[csw-maintainers] Symbols information and checkpkg database

Sun Apr 21 14:11:04 CEST 2013

I spent more time on this problem. I don't see an easy way out of the
current problem.

I can define the problem as this: When I view a page of a given
package, I want to see its main metadata, such as pkgmap, and the
output of the dump utility to see which binaries depend on which
sonames. For example:

http://buildfarm.opencsw.org/pkgdb/srv4/1bd915f6cdbf1217addd0e6d28823dff/

This page is currently forced to display the following depressing message:

"As of January 2013, the stats stored are so big that processing them
can take several minutes before they can be served. Disabling until a
proper solution is in place."

If you try enabling showing of metadata again, you just get a timeout.
Current state of affairs means that it's much harder to review and
diagnose packages than it needs to be. Things that used to take 10
seconds to check now take 5 to 10 minutes and require shell access:
you can't packages by using a web browser. I'm unhappy about this.

An obvious solution is to keep the elfdump and ldd data in a separate
place, and not include it with the main bulk of metadata. It sounds
easier than it looks. The core problem is this:

We are no longer capable of keeping a single package's metadata in RAM
to analyze them.

We might be on the buildfarm, but our longer-term plan is to allow
other people with smaller hardware to have their own buildfarm. I'm
using a virtual machine with 1.6GB of RAM as a reference. I am not
able to index our catalogs on our machine, it just fails because of
insufficient RAM.

Our package checking code has a nice and simple API: you define a
function, which gets your packge's metadata and a few interaction
objects it uses to report errors:

https://sourceforge.net/apps/trac/gar/browser/csw/mgar/gar/v2/lib/python/package_checks.py

Here's a showcase of a check that verifies that a package must not
depend on itself:

def CheckDependsOnSelf(pkg_data, error_mgr, logger, messenger):
  pkgname = pkg_data["basic_stats"]["pkgname"]
  for depname, dep_desc in pkg_data["depends"]:
    if depname == pkgname:
      error_mgr.ReportError("depends-on-self")

It is a really simple API, because pkg_data is just a data structure
deserialized from JSON, the same one you can get via REST, which is a
simple HTTP GET you can do with curl or anything else that can talk
over HTTP. There are no mysteries, lazy evaluations. You just see what
the data structure is, and you can traverse it for whatever data you
need.

With the current amount of data, we cannot have a simple pkg_data any
more. We'll have to switch to something doing lazy evaluation, and we
are running a risk that a check function can leak memory and cause
checkpkg to crash. Of course, we can implement this, but I can already
hear people saying "this is so complicated, why are you preventing me
from writing checks?"

The new API would have to look something like this:

def CheckDependsOnSelf(data_access_object, error_mgr, logger, messenger):
  pkgname = data_access_object.get("basic_stats")["pkgname"]
  for depname, dep_desc in data_access_object.get("depends"):
    if depname == pkgname:
      error_mgr.ReportError("depends-on-self")

It doesn't look that different, but it is different in that instead of
accessing a normal dict/list data structure, you're calling an object,
which makes REST queries under the hood, and generally does who knows
what.

I don't have a good solution. Any ideas?

Maciej