[csw-devel] Help with checkpkg: optimizing CatalogMixin.GetPkgByPath()

Maciej Bliziński maciej at opencsw.org
Sun Mar 20 13:44:13 CET 2011


Hey guys,

Here's one more checkpkg-related mini-project.

One of the performance bottlenecks in checkpkg is the GetPkgByPath
function in the CatalogMixin class.  It does a relatively simple task:
based on the information from the database, it finds out, which
package owns a certain file, given a catalog.  A catalog is defined as
a triplet of catalog release, architecture and OS release[1].

The slowest part seems to be the MySQL query.  The miniproject would be to:

1. Extract the exact SQL query compiled by SQLObject
2. Analyze the query in the database
3. Propose a fix (an additional index, etc.)

This particular bit will require some research on SQLObject internals
(switching on debugging mode) and MySQL query performance analysis.

Why is this optimization important?

Based on profiling of checkpkg I did, this particular function is the
slowest bit of the whole checkpkg run.  Checking the whole catalog
takes currently a lot of time - about two days if not three, I'm not
sure any more.  Before the file collision check was introduced,
checkpkg was able to analyze the whole catalog in about an hour.  Two
days is way to slow if we want to have up-to-date QA information for
our packages, especially if it comes to a new update breaking other
packages.  This optimization is vital to our QA future workflow.

I will provide all guidance and help I can.

Maciej

[1] It will eventually change to a quadruplet of catalog release,
tier, architecture and OS release.


More information about the devel mailing list