[csw-maintainers] Small/Medium-size coding project available

Wed May 8 13:49:58 CEST 2013

Hello maintainers,

Here's a small/medium sized coding project that we would significantly
benefit from. It's nicely self-contained and it does have some algorithmic
components. It can be written in any programming language (that we can run
on our buildfarm).

Our current catalog generation takes about 80 minutes. If we can make it
faster, we can generate the catalog more often and have a quicker
build-push-release turnaround, and relieve the buildfarm from most of the
current catalog-generation-induced disk stress. We currently run the
generation every 3h. If we can make the generation complete in something
like 10 minutes (which I think is possible), we could run catalog
generation e.g. every hour.

We have a directory on disk with a package catalog, as we can see on the
mirror:

http://mirror.opencsw.org/opencsw/unstable/i386/5.10/

We can query the RESTful interface for the current state of the same
catalog in the database:

curl -s
http://buildfarm.opencsw.org/pkgdb/rest/catalogs/unstable/i386/SunOS5.10/ \
| python -m json.tool | (head -n 30; cat >/dev/null)

[
    {
        "basename":
"389_admin-1.1.30,REV=2013.01.07-SunOS5.10-i386-CSW.pkg.gz",
        "catalogname": "389_admin",
        "file_basename":
"389_admin-1.1.30,REV=2013.01.07-SunOS5.10-i386-CSW.pkg.gz",
        "md5_sum": "6110aad210240504ede48f9cd8b4501c",
        "mtime": "2013-01-07T12:02:22",
        "rev": "2013.01.07",
        "size": 403046,
        "version": "1.1.30,REV=2013.01.07",
        "version_string": "1.1.30,REV=2013.01.07"
    },
    (...)
]

(the query takes about 25s to evaluate; the python bit is here just for
data pretty-printing)

We also have the 'allpkgs' directory:
http://mirror.opencsw.org/opencsw/allpkgs/

It's excluded from rsync, so it doesn't get propagated to mirrors, but it
does exist on the master mirror and the buildfarm. It's the central pool
for all the package data files.

When we generate catalogs, we do not copy anything, instead we make
hardlinks to the allpkgs directory. For example, we make a hardlink from
allpkgs/foo-i386-CSW.pkg.gz to unstable/5.9/i386. However, when we generate
a catalog for the next OS release (e.g. 5.10), we do not make a hardlink;
if possible, we make a symlink from the 5.10 directory to the 5.9
directory. This way we save space on mirrors: we only send out 1 copy of
the file (in the lowest OS release in which it occurs), and then we create
symlinks to it.

For example:
allpkgs/foo-i386-CSW.pkg.gz (not synced to mirrors)
unstable/i386/5.9/foo-i386-CSW.pkg.gz (hardlink to the file in allpkgs)
unstable/i386/5.10/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz
(symlink)
unstable/i386/5.11/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz
(symlink)

You can now see that we need to generate catalogs for one catalog release
(e.g. unstable) and one architecture, and all OS releases in one program
run.

We do currently have code that does it, but the code is really stupid. It
unlinks everything from the directory, and starts from scratch every time.
This generates a lot of unnecessary disk operations, and makes the whole
process slow. It would be much better to see what's in the database, see
what's on disk, and figure out the smallest set of operations to bring the
disk to the new state.

Would anyone be up for writing it?

Maciej
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opencsw.org/pipermail/maintainers/attachments/20130508/adf2fbeb/attachment.html>