rdiff-backup optimization

Donovan Baarda abo@minkirri.apana.org.au
Mon, 20 May 2002 12:22:35 +1000


On Thu, May 16, 2002 at 10:44:08AM -0700, Ben Escoto wrote:
> >>>>> "DB" == Donovan Baarda <abo@minkirri.apana.org.au>
> >>>>> wrote the following on Thu, 16 May 2002 19:30:34 +1000
> 
>   DB> I have a cleaner version of the rolling checksum code that is
>   DB> 2~3x faster, for a start.
> 
>   DB> I posted a list of things that could be fixed to the rproxy list
>   DB> a while ago. I'm looking at implementing them now. Depending on
>   DB> when/if I get developer access on SF, I'll either post it all as
>   DB> a patch, or release a new version of librsync.
> 
> Anything that makes rdiff faster will help with rdiff-backup, of
> course, but I think the main problem with rdiff-backup is that it uses
> too much CPU time.  For instance, if out/ doesn't exist and manyfiles
> is a directory containing 10000 1 byte files:
> 
> ~/prog/python/rdiff-backup/src $ time rsync -a manyfiles/ out
> real    0m19.684s
> user    0m1.300s
> sys     0m5.260s
> 
> ~/prog/python/rdiff-backup/src $ time rdiff-backup manyfiles out
> real    1m32.337s
> user    0m59.870s
> sys     0m7.980s
[...]

10000 files in one dir is a lot. Many filesystems slow down exponentialy as
the number of files increases, which is why things like squid use three
levels of directories rather than one directory full of all the objects.

Perhaps rdiff-backup is accessing files by name (causing directory lookups
which is what hurts) more than rsync is (ie, stat, then open)?

Not sure if these lookups would register as sys time of user time though...

Have you used the python profiler to see where it's spending time? Perhaps
it's just python overheads.

>     Unless I'm missing something, there are three options as far
> rdiff-backup optimization goes:
> 
> 1.  Leave it the way it is.
> 2.  Conceptually rejigger the architecture so it somehow comes out
>     much faster.
> 3.  Rewrite substantial portions of it in C.

I dunno if you are checking ctimes and mtimes to minimise checking of
files... this could make a big difference on the "second run" above.

-- 
----------------------------------------------------------------------
ABO: finger abo@minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------