reversing order of diffs.

Donovan Baarda abo@minkirri.apana.org.au
Thu, 14 Mar 2002 23:03:28 +1100


On Thu, Mar 14, 2002 at 02:16:00AM -0800, Ben Escoto wrote:
> >>>>> "DB" == Donovan Baarda <abo@minkirri.apana.org.au>
> >>>>> wrote the following on Thu, 14 Mar 2002 14:37:00 +1100
[...]
> Yes, I think this is all correct.  There are a few things to be said
> for sticking with the current system though:

This was mainly random thoughts about how rdiff-backup works, not necisarily
suggestions. My _real_ requirements further down can't use xdelta.

> 1.  It IS the current system - unbeatable ease of implementation :-)
> 
> 2.  xdelta would be yet another requirement
> 
> 3.  xdelta, last time I looked at it, was less stable than rdiff

I think the author of xdelta is much more cautious of calling something
"stable" than the authors of rdiff :-). I've looked at the code of xdelta2
and it is so clean and well structured it's like a machine wrote it. 

> 4.  xdelta uses much more memory, and I think is slower, than rdiff

Hmmm. I'm not sure about this. It possibly does use more memory because it
uses a much smaller block size, but it should be faster because it doesn't
use md4sums at all, just the rolling checksum, which it then verifies by
comparing the actual data (which is compared backwards and forwards from the
match to extend it as far as possible, avoiding rolling checksum calculation
for adjacent matches and allowing matches to be arbitarily aligned).

> 5.  Where is xdelta development going?  It seems to be getting really
>     complicated and/or being submerged into some larger project.

It is being used by a few other projects, some of which are quite
complicated, but I don't believe it is being "submerged". It was originaly
developed for version 2 of PRCS, but developed a life of it's own. It has
it's own project on SF. It was considered for Subversion before xdelta2, but
at the time the Author was too buisy to make any commitments, so they went
down their own path using something called vcdelta which they wrote
themselves. The vcdelta implementation is similar (I think) to xdelta, but
also uses both src and dest files as as a source of matches, which means it
can take advantage of repeated sections within a file. IMHO, xdelta2 is now
a better stand-alone delta storage repository than what subversion is
currently using (but they would probably dispute that :-).

> 6.  Benefits from xdelta are so far theoretical.  In some cases xdelta
>     can be much better, but it isn't clear these cases occur in real
>     life enough to justify the change.

The summary report Josh wrote on xdelta2 says it all... the xdelta
algorithm does give marginal but measurable performance and storage
benefits. However, xdelta2 uses bdb3 as its backend storage for it's ACID
semantics, and it's overheads blow away the minor benefits the algorithm
gives. So xdelta2 should give about the same performance as rdiff, but
provides a complete delta-storage API that should be thread safe etc.

[...]
> That is an interesting suggestion, and I can see how it solve some
> people's problems, but "that is not what rdiff-backup is for" does
> come to mind.  I would have thought Another Backup Tool would already
> do this.  I suppose what's missing from other tools is the diff'ing
> ability?  Why is that a requirement?  If diffs weren't required, I'd
> assume you could use just about any backup program?

The diffs are the great thing! This allows you do to incremental backups of
a whole partition image, without mounting it. I know rdiff can be used for
this by itself, but I have things like directories full of fs images that I
loop-mount and modify, and vmware system images etc.

Also, I know of no other backup tools that allow you to choose any offline
backup as the basis for a new incremental backup. Most have a strict order
to the full->inc1-inc2-etc sequence. Even worse, most incremental backups
don't have a proper snapshot of the filesystem, they just include new and
modified files since the last backup, so restoring a system to the exact
state at that time is impossible, because there is no record of what files
were deleted between backups.

>     I think some of rdiff-backup's code could be useful to you in this
> project.  The basic idea behind rdiff-backup is to make a big
> SIGNATURE of the whole mirror directory, and use that on the source
> directory to make a big DIFF, and then bring that back to patch the
> mirror directory.  So really in the main part of rdiff-backup only two
> (big) files get sent, SIGNATURE from the mirror directory, and DIFF
> from the source.  rdiff-backup processes these "files" in a lazy way
> so that they all don't have to be generated or loaded into memory at
> once.

Don't you then generate more big DIFF's for the older backups after you've
patched the mirror directory?

>     But all the code is pretty much there if you just wanted to, say,
> write a utility that extended rdiff to directories instead of just
> regular files, and also saved permissions, ownership, etc.  Call this
> rdiff+.  Then each of your incremental backups could just be an rdiff+
> delta, and the stored signature information for each backup could be
> an rdiff+ signature.

That's exactly what I'm after. I think it's worth implementing this as a
stand-alone layer under rsync-backup, so that it can be common-code with
anything else that can use it.

BTW, rdiff-backup currently uses regexp exclude options. I have code for
rsync style include/exclude lists that do smart pruning of directories to
minimize filesystem walking. Would these be of use to anyone?

-- 
----------------------------------------------------------------------
ABO: finger abo@minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------