reversing order of diffs.

Ben Escoto bescoto@stanford.edu
Fri, 15 Mar 2002 01:11:12 -0800


--==_Exmh_259462698P
Content-Type: text/plain; charset=us-ascii

>>>>> "DB" == Donovan Baarda <abo@minkirri.apana.org.au>
>>>>> wrote the following on Thu, 14 Mar 2002 23:03:28 +1100

  DB> I think the author of xdelta is much more cautious of calling
  DB> something "stable" than the authors of rdiff :-). I've looked at
  DB> the code of xdelta2 and it is so clean and well structured it's
  DB> like a machine wrote it.

Maybe.  I'm basing my opinions on xdelta1.  It tries to do more, and
maybe as a result has more bugs.  For instance, if run on a gzip
compressed file, it tries to uncompress it and diff the contents.
However, it would fail if the compressed file was corrupted.  When I
reported the error, it was quickly fixed, so maybe that was just an
isolated incident.  Still, I got the impression that rdiff was more
stable (I have never seen it fail).

    I understand what you mean, but your complement about "like a
machine wrote it" seems odd.  Source code is mostly for humans, so
good source code should look like a human wrote it.  You'd never say
that a novel was so good it seemed like a computer produced it.

  DB> Hmmm. I'm not sure about this. It possibly does use more memory
  DB> because it uses a much smaller block size, but it should be
  DB> faster because it doesn't use md4sums at all, just the rolling
  DB> checksum, which it then verifies by comparing the actual data
  DB> (which is compared backwards and forwards from the match to
  DB> extend it as far as possible, avoiding rolling checksum
  DB> calculation for adjacent matches and allowing matches to be
  DB> arbitarily aligned).

Investigating, it seems rdiff uses a lot of memory too (at least much
more than rdiff-backup in certain cases).  So you may be right on both
counts.  But I think I remember a user on this list reporting that
xdelta had consumed 1.4GB of ram.  I hope rdiff doesn't do things like
that.

  DB> It is being used by a few other projects, some of which are
  DB> quite complicated, but I don't believe it is being
  DB> "submerged". It was originaly developed for version 2 of PRCS,
  DB> but developed a life of it's own. It has it's own project on
  DB> SF. It was considered for Subversion before xdelta2, but at the
  DB> time the Author was too buisy to make any commitments, so they
  DB> went down their own path using something called vcdelta which
  DB> they wrote themselves. The vcdelta implementation is similar (I
  DB> think) to xdelta, but also uses both src and dest files as as a
  DB> source of matches, which means it can take advantage of repeated
  DB> sections within a file. IMHO, xdelta2 is now a better
  DB> stand-alone delta storage repository than what subversion is
  DB> currently using (but they would probably dispute that :-).

Well, I'm just looking for a simple utility that acts like diff, but
does a good job on binary files.  I suppose then I should look at
xdelta1 instead of xdelta2?  Anyway, looking at the sourceforge page,
it seems that xdelta2 is still marked beta, and there hasn't been a
new release in 9 months.  There is also a FAQ question "Is progress
being made on Xdelta?" from 11 months ago where he says he expects 2.0
final in a couple of months.  I can't find anything on vcdelta.

  >> I think some of rdiff-backup's code could be useful to you in
  >> this project.  The basic idea behind rdiff-backup is to make a
  >> big SIGNATURE of the whole mirror directory, and use that on the
  >> source directory to make a big DIFF, and then bring that back to
  >> patch the mirror directory.  So really in the main part of
  >> rdiff-backup only two (big) files get sent, SIGNATURE from the
  >> mirror directory, and DIFF from the source.  rdiff-backup
  >> processes these "files" in a lazy way so that they all don't have
  >> to be generated or loaded into memory at once.

  DB> Don't you then generate more big DIFF's for the older backups
  DB> after you've patched the mirror directory?

As you pointed out in an earlier message, the increment diffs are
generated on the destination end from the mirror files and forwards
diffs.  So it's possible that rdiff-backup could make a big DIFF and
the increments could have been generated from this.

    The effect is the same, but the production of the increment diffs
just isn't conceptualized in this way.  One of the main benefits of
the big DIFF/SIG production is good low latency performance (similar
to, but simpler and slower than, rsync's pipelining).  This doesn't
matter once the source end sends over the diffs because the increments
and the mirror directory are assumed to be on the same system.  If
this changes maybe we should adopt the big DIFF #2 idea.

  >> But all the code is pretty much there if you just wanted to, say,
  >> write a utility that extended rdiff to directories instead of
  >> just regular files, and also saved permissions, ownership, etc.
  >> Call this rdiff+.  Then each of your incremental backups could
  >> just be an rdiff+ delta, and the stored signature information for
  >> each backup could be an rdiff+ signature.

  DB> That's exactly what I'm after. I think it's worth implementing
  DB> this as a stand-alone layer under rsync-backup, so that it can
  DB> be common-code with anything else that can use it.

Well, I was oversimplifying a bit earlier, because rdiff-backup
doesn't really generate a big SIG right off the bat; first it compares
file timestamps and so forth and then only sigs/diffs the files that
have changed.  So I don't think rdiff-backup would use the "rdiff+"
functions directly.

    Maybe there could be two stand-alone utilities in the package:
rdiff-backup, and an "rdiff+" type utility?  They could share a lot of
the same code.

  DB> BTW, rdiff-backup currently uses regexp exclude options. I have
  DB> code for rsync style include/exclude lists that do smart pruning
  DB> of directories to minimize filesystem walking. Would these be of
  DB> use to anyone?

I have gotten requests to add this feature to rdiff-backup.  Maybe
this could help.  Could you explain what smart pruning is?


--
Ben Escoto

--==_Exmh_259462698P
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001

iD8DBQE8kbqv+owuOvknOnURAkcUAJ9yIeyUyfNADUxsb045LPfDF9dwCwCffYVt
4mtaRQwxPg5IvruHbj0xl58=
=eQ+D
-----END PGP SIGNATURE-----

--==_Exmh_259462698P--