[csw-maintainers] Symbols information and checkpkg database

Fri Mar 29 15:06:20 CET 2013

In out package database, metadata of each package are stored in a
simplistic way: there is a single data structure, which is serialized and
stored as a binary blob. I made this design based on a rough estimate how
much data there is per package. These data objects were small enough that
load times were reasonable and you could load the whole catalog into RAM in
8 minutes, and then run very fast queries against it.

Enter symbols information. They increased the amount of data 10-fold. The
whole database used to take about 150MB after compression. After symbols
information was added, the database dump has 1.8GB (compressed;
uncompressed size is over 14GB).

The following problems occurred since:

We can no longer display package metadata on the buildfarm website, which
makes it hard to inspect packages. By opening one URL, we used to get the
whole information about the package. We had to disable that feature,
because it doesn't work fast enough with the increased amount of data.

Catalog generation used to take about 25-30 minutes, and it now takes hours.

Since metadata for each package are in a single blob, you can't read part
of that data without reading (deserializing) all of it.

I looked around the code to figure out what we could do. A reasonable
amount of refactoring will be necessary. One thing seems to be standing in
the way: the automatic mode.

Some of you might still remember how everyone used to have their own
checkpkg database in sqlite that was automatically created when checkpkg
was run. I don't think we can maintain that functionality with increased
data pressure. Therefore, the proposed plan is this:

1. Drop the automatic modea and rip out the old unnecessary code. Downside:
people on private buildfarms will no longer be able to just run checkpkg
and expect it to work! Everyone will be forced to set up a shared database
and a HTTP server if they want to run checkpkg.
2. Move more interaction towards the RESTful interface. This will allow for
easy parallel processing.
3. Move storage of data from the database to the filesystem. Split out each
package's metadata into two or more chunks. Only the webserver will have
access to these files.

It is a considerable amount of work, and we'll have so suffer current
degradation for several weeks to months more.

Thoughts?

Maciej
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opencsw.org/pipermail/maintainers/attachments/20130329/722a16a9/attachment-0001.html>