[csw-maintainers] Garbage collection in allpkgs

Dagobert Michelsen dam at opencsw.org
Mon Dec 31 22:09:46 CET 2012


Hi Maciej,

Am 31.12.2012 um 14:04 schrieb Maciej (Matchek) Bliziński <maciej at opencsw.org>:
> 2012/12/31 Dagobert Michelsen <dam at opencsw.org>:
>> Am 30.12.2012 um 19:53 schrieb Maciej (Matchek) Bliziński <maciej at opencsw.org>:
>>> 2012/12/29 Dagobert Michelsen <dam at opencsw.org>:
>>>>> Could we do the
>>>>> same somewhere on the buildfarm but not on the master mirror? Then we
>>>>> would have an official archive for those that need it but since it
>>>>> wouldn't be used that much it would be unnecessary to mirror it, we
>>>>> would just link to it from our mirror page.
>>>> 
>>>> This is already the case: allpkgs/ is not included in the main rsync
>>>> offering, just in opencsw-full:
>>>> 
>>>>> dam at login [login]:/home/dam > rsync rsync://mirror.opencsw.org
>>>>> csw             Legacy name, please switch to the identical 'opencsw'
>>>>> opencsw         CSW Primary Mirror, use this if you are mirroring OpenCSW (the archive "allpkgs" is now in 'opencsw-full')
>>>>> opencsw-full    CSW Primary Mirror, contains full archive of old packages
>>>>> opencsw-future  The proposed future layout of the OpenCSW Primary, layout may change without notice at any time
>>>> 
>>>> This is done by using the exclude-directive in rsync.conf for "csw" and "opencsw":
>>>>       exclude = allpkgs HEADER.txt
>>>> 
>>>> Having all packages on the primary mirror is also good IMHO. This way
>>>> each downstream-site can easily select what to offer.
>>> 
>>> I'm not sure what you mean by downstream-sites selecting what to
>>> offer.
>> 
>> Official sites mirroring our packages.
> 
> I was asking what do you mean by selecting what to offer. Downstream
> sites I understand.
> 
>>> The primary mirror has a set of files, and that's it.
>> 
>> Not quite. There are all files in the filesystem avaialable for download.
>> However, if you rsync "opencsw" you won't get allpkgs/, so almost all
>> of the official mirror sites don't mirror allpkgs/.
> 
> So there's a set of file that you get when you rsync and that's it. If
> you rsync, you don't get to choose, you get what you it's given to
> you. If you don't, then you're not a full mirror.

You are completely missing the point here. Please try
  rsync rsync://mirror.opencsw.org/opencsw/
and compare this to
  rsync rsync://mirror.opencsw.org/opencsw-full/

The former does not include allpkgs/, the latter does. By choosing the rsync URL
you as downstream mirror decide if you want to include allpkgs or not although
it is on the filesystem of the primary. This is rsync-magic. Trust me! :-)

>>> People
>>> can make snapshots from different points in time, is that what you
>>> mean?
>> 
>> No, that is different. We don't do this ATM, but archive catalog-files,
>> so if someone has a specific problem we can regenerate everything
>> from that catalog and allpkgs/ and this is another reason why I think
>> having allpkgs is a Good Thing™.
> 
> Did we ever do that? Did we even exercise doing this? I think that in
> a real situation we would do something else rather than recreating a
> past catalog. Do you have a specific scenario in mind? For example, we
> have a bad, I don't know, MySQL. What would make us recreate an old
> catalog from archives, rather than selectively solving the problem at
> hand?

I did a couple of times. 

>>> Generally, the oldpkgs archive that people often wanted were there
>>> because of the rolling release of the 'current' catalog. At the time,
>>> the thinking was that first we scrutinize the hell of the package, and
>>> once we prove to ourselves it's a good package, we push it forward
>>> with no good way of rolling it back. If we pushed something bad, we
>>> had nothing to say to people other than “scavenge oldpkgs”. These
>>> days, we have the testing release; we know that we will every now and
>>> then push something bad to unstable, and that people can still using
>>> the testing release, and we don't have to panic when there's something
>>> broken in unstable. The assumption there is of course, that majority
>>> of problems will be caught when in unstable.
>> 
>> Right.
>> 
>>> We also keep the old named releases, for instance the dublin release,
>>> which is a consistent set of packages and their dependencies.
>> 
>> Yes.
>> 
>>> The 'allpkgs' directory does contain a history of packages, but it
>>> doesn't contain all transient (and often broken) 19 different versions
>>> of MySQL that happened to be in unstable for 2 days, but only the 3 or
>>> 4 versions that are actually used somewhere in our catalogs.
>> 
>> If it would be only for this we wouldn't need allpkgs at all.
>> 
>>> I thought
>>> that was good enough. If you disagree, I will put the 24GB of junk
>>> back in allpkgs; but I remain unconvinced that they are actually
>>> useful.
>> 
>> 
>> Hopefully I convinced you with the above arguments.
> 
> I see some point in keeping old packages for some time. We had one
> case in which we needed to restore a version of gcc from allpkgs,
> because the version from dublin was too old and the version from
> unstable was still problematic. But I still don't see a point in
> keeping them forever.
> 
> We should focus on realistic scenarios, either ones that we actually
> performed (like the gcc restore) or ones that we anticipate and are
> able to exercise.

Keeping them is almost free and has IMHO no drawbacks. Having a complete
archive is professional behaviour. Why keep all patchdiag.xref ever
released? Because you can never know.


Best regards

  -- Dago

-- 
"You don't become great by trying to be great, you become great by wanting to do something,
and then doing it so hard that you become great in the process." - xkcd #896



More information about the maintainers mailing list