<div dir="ltr">Hello maintainers,<div><br></div><div style>Here's a small/medium sized coding project that we would significantly benefit from. It's nicely self-contained and it does have some algorithmic components. It can be written in any programming language (that we can run on our buildfarm).</div>


<div style><br></div><div style>Our current catalog generation takes about 80 minutes. If we can make it faster, we can generate the catalog more often and have a quicker build-push-release turnaround, and relieve the buildfarm from most of the current catalog-generation-induced disk stress. We currently run the generation every 3h. If we can make the generation complete in something like 10 minutes (which I think is possible), we could run catalog generation e.g. every hour.<br>


</div><div style><br></div><div style>We have a directory on disk with a package catalog, as we can see on the mirror:</div><div style><br></div><div style><a href="http://mirror.opencsw.org/opencsw/unstable/i386/5.10/">http://mirror.opencsw.org/opencsw/unstable/i386/5.10/</a></div>


<div style><br></div><div style>We can query the RESTful interface for the current state of the same catalog in the database:<br></div><div style><br></div><div style>curl -s <a href="http://buildfarm.opencsw.org/pkgdb/rest/catalogs/unstable/i386/SunOS5.10/">http://buildfarm.opencsw.org/pkgdb/rest/catalogs/unstable/i386/SunOS5.10/</a> \</div>


<div style>| python -m json.tool | (head -n 30; cat >/dev/null)</div><div style><br></div><div style><div>[</div><div>    {</div><div>        "basename": "389_admin-1.1.30,REV=2013.01.07-SunOS5.10-i386-CSW.pkg.gz", </div>


<div>        "catalogname": "389_admin", </div><div>        "file_basename": "389_admin-1.1.30,REV=2013.01.07-SunOS5.10-i386-CSW.pkg.gz", </div><div>        "md5_sum": "6110aad210240504ede48f9cd8b4501c", </div>


<div>        "mtime": "2013-01-07T12:02:22", </div><div>        "rev": "2013.01.07", </div><div>        "size": 403046, </div><div>        "version": "1.1.30,REV=2013.01.07", </div>


<div>        "version_string": "1.1.30,REV=2013.01.07"</div><div>    }, </div><div>    (...)</div><div>]</div></div><div style><br></div><div style>(the query takes about 25s to evaluate; the python bit is here just for data pretty-printing)</div>


<div style><br></div><div style>We also have the 'allpkgs' directory: <a href="http://mirror.opencsw.org/opencsw/allpkgs/">http://mirror.opencsw.org/opencsw/allpkgs/</a></div><div style><br></div><div style>It's excluded from rsync, so it doesn't get propagated to mirrors, but it does exist on the master mirror and the buildfarm. It's the central pool for all the package data files.</div>


<div style><br></div><div style>When we generate catalogs, we do not copy anything, instead we make hardlinks to the allpkgs directory. For example, we make a hardlink from allpkgs/foo-i386-CSW.pkg.gz to unstable/5.9/i386. However, when we generate a catalog for the next OS release (e.g. 5.10), we do not make a hardlink; if possible, we make a symlink from the 5.10 directory to the 5.9 directory. This way we save space on mirrors: we only send out 1 copy of the file (in the lowest OS release in which it occurs), and then we create symlinks to it.</div>


<div style><br></div><div style>For example:</div><div style>allpkgs/foo-i386-CSW.pkg.gz (not synced to mirrors)</div><div style>unstable/i386/5.9/foo-i386-CSW.pkg.gz (hardlink to the file in allpkgs)</div><div style>unstable/i386/5.10/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz (symlink)<br>


</div><div style>unstable/i386/5.11/foo-i386-CSW.pkg.gz → ../5.9/foo-i386-CSW.pkg.gz (symlink)<br></div><div style><br></div><div style>You can now see that we need to generate catalogs for one catalog release (e.g. unstable) and one architecture, and all OS releases in one program run.</div>


<div style><br></div><div style>We do currently have code that does it, but the code is really stupid. It unlinks everything from the directory, and starts from scratch every time. This generates a lot of unnecessary disk operations, and makes the whole process slow. It would be much better to see what's in the database, see what's on disk, and figure out the smallest set of operations to bring the disk to the new state.</div>


<div style><br></div><div style>Would anyone be up for writing it?</div><div style><br></div><div style>Maciej</div></div>