reversing order of diffs.

Ben Escoto bescoto@stanford.edu
Thu, 14 Mar 2002 02:16:00 -0800


--==_Exmh_-1456432608P
Content-Type: text/plain; charset=us-ascii

>>>>> "DB" == Donovan Baarda <abo@minkirri.apana.org.au>
>>>>> wrote the following on Thu, 14 Mar 2002 14:37:00 +1100

  DB> As I understand it, rdiff-backup currently uses a full copy of
  DB> the most recent backup, with diffs for older backups. It is also
  DB> capable of efficiently updating a remote backup by sending only
  DB> delta's over the wire.

  DB> This makes it nice and easy to restore the latest backup, and a
  DB> bit slower to restore older backups. This "full-latest +
  DB> old-deltas" architecture at first glance looks like rdiff would
  DB> be less efficient than xdelta, which can calculate optimal
  DB> delta's better than rsync's block aligned match algo.  Also,
  DB> xdelta2 would give you all the "get-a-particular-version" and
  DB> ACID for free.

  DB> However, xdelta alone can't do efficient over-the-wire
  DB> transfers, because it requires access to full copies of both
  DB> versions to calculate the delta.  but... as I understand it,
  DB> rdiff-backups efficient over-the-wire transfers must involve
  DB> calculating forward-delta's to transmit over the wire,
  DB> generating the latest version for the archive, then calculating
  DB> backwards deltas to record older versions in the archive. This
  DB> looks to me like you could still benefit from using xdelta as
  DB> the archive store, and use rdiff for the efficient over-the-wire
  DB> transfers.

Yes, I think this is all correct.  There are a few things to be said
for sticking with the current system though:

1.  It IS the current system - unbeatable ease of implementation :-)

2.  xdelta would be yet another requirement

3.  xdelta, last time I looked at it, was less stable than rdiff

4.  xdelta uses much more memory, and I think is slower, than rdiff

5.  Where is xdelta development going?  It seems to be getting really
    complicated and/or being submerged into some larger project.

6.  Benefits from xdelta are so far theoretical.  In some cases xdelta
    can be much better, but it isn't clear these cases occur in real
    life enough to justify the change.

  DB> But... I question the whole full-latest+old-deltas archive. My
  DB> problem is that it doesn't allow you to make backups that you
  DB> can store offline. You cannot make a full backup, store it
  DB> offline, then make small incremental backups that you also keep
  DB> offline. I know that people are going to say "that is not what
  DB> rdiff-backup is for", but I think it is pretty close and a small
  DB> change or two could add this.

  DB> All you need is to (optionly) reverse things so you have a
  DB> full-oldest+new-deltas archive. For each backup you keep a full
  DB> list of file signatures online. The beauty of keeping this
  DB> signature list online is you can calculate new diffs against any
  DB> backup, without having the full backup online.

  DB> The storing a signature list online saves calculating it for
  DB> remote updates.  Keeping latest deltas saves the forward+reverse
  DB> delta calculation needed when doing efficient over-the-wire
  DB> transfers, as you just keep the transfered delta. This brings
  DB> the whole thing more inline with traditional full+incremental
  DB> backup tools, with the added benefit that _any_ previous backup,
  DB> full or incremental, can be used as a basis for an incremental
  DB> backup. Note that using offline backups with only online
  DB> signatures means you can't use xdelta as the store.

  DB> I'm going to look at rsync-backup code in more detail to
  DB> implement something like this soon, as I _need_ offline
  DB> backups. I actualy have a significant amount of Python code
  DB> already written towards this end, including things like
  DB> rsync-style include/exclude lists with efficient directory
  DB> pruning. I never quite finished it, and now that rsync-backup is
  DB> here, I'm more interested in "molding/extending" it to my needs
  DB> than releasing Yet Another Backup Tool.

That is an interesting suggestion, and I can see how it solve some
people's problems, but "that is not what rdiff-backup is for" does
come to mind.  I would have thought Another Backup Tool would already
do this.  I suppose what's missing from other tools is the diff'ing
ability?  Why is that a requirement?  If diffs weren't required, I'd
assume you could use just about any backup program?

    I think some of rdiff-backup's code could be useful to you in this
project.  The basic idea behind rdiff-backup is to make a big
SIGNATURE of the whole mirror directory, and use that on the source
directory to make a big DIFF, and then bring that back to patch the
mirror directory.  So really in the main part of rdiff-backup only two
(big) files get sent, SIGNATURE from the mirror directory, and DIFF
from the source.  rdiff-backup processes these "files" in a lazy way
so that they all don't have to be generated or loaded into memory at
once.

    But all the code is pretty much there if you just wanted to, say,
write a utility that extended rdiff to directories instead of just
regular files, and also saved permissions, ownership, etc.  Call this
rdiff+.  Then each of your incremental backups could just be an rdiff+
delta, and the stored signature information for each backup could be
an rdiff+ signature.


--
Ben Escoto

--==_Exmh_-1456432608P
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001

iD8DBQE8kHhY+owuOvknOnURAh4fAJ4loXS8s5kasd4mtXmg56nwDsd4gACfTlID
Kb5xK3kTpP1QqXbKkfQCY78=
=n2tW
-----END PGP SIGNATURE-----

--==_Exmh_-1456432608P--