reversing order of diffs.

Donovan Baarda abo@minkirri.apana.org.au
Sat, 16 Mar 2002 11:04:23 +1100


On Fri, Mar 15, 2002 at 01:11:12AM -0800, Ben Escoto wrote:
> >>>>> "DB" == Donovan Baarda <abo@minkirri.apana.org.au>
> >>>>> wrote the following on Thu, 14 Mar 2002 23:03:28 +1100
[...]
>     I understand what you mean, but your complement about "like a
> machine wrote it" seems odd.  Source code is mostly for humans, so
> good source code should look like a human wrote it.  You'd never say
> that a novel was so good it seemed like a computer produced it.

What I meant was there appeared to be a total absence of nasty hacks... just
systematicly clean code.

>   DB> Hmmm. I'm not sure about this. It possibly does use more memory
>   DB> because it uses a much smaller block size, but it should be
>   DB> faster because it doesn't use md4sums at all, just the rolling
>   DB> checksum, which it then verifies by comparing the actual data
>   DB> (which is compared backwards and forwards from the match to
>   DB> extend it as far as possible, avoiding rolling checksum
>   DB> calculation for adjacent matches and allowing matches to be
>   DB> arbitarily aligned).
> 
> Investigating, it seems rdiff uses a lot of memory too (at least much
> more than rdiff-backup in certain cases).  So you may be right on both
> counts.  But I think I remember a user on this list reporting that
> xdelta had consumed 1.4GB of ram.  I hope rdiff doesn't do things like
> that.

The memory consumed depends on the file size, for both rdiff and xdelta,
because they need to hold the whole signature for the file in memory. In the
case of xdelta, it can be large because the blocksize is smaller so there
are more rolling checksums. However, for rdiff, althought the blocksize is
larger, it needs md4sums for each block as well, whereas xdelta doesn't. If
you tell xdelta to use the same blocksize as rdiff, it will actually use
less memory.

For xdelta to use 1.4GB, they must have been working on a 10G+ file. I would
be surprised if rdiff used much less.

[...]
> Well, I'm just looking for a simple utility that acts like diff, but
> does a good job on binary files.  I suppose then I should look at
> xdelta1 instead of xdelta2?  Anyway, looking at the sourceforge page,
> it seems that xdelta2 is still marked beta, and there hasn't been a
> new release in 9 months.  There is also a FAQ question "Is progress
> being made on Xdelta?" from 11 months ago where he says he expects 2.0
> final in a couple of months.  I can't find anything on vcdelta.

xdelta2 does more than just act like diff. It provides a full RCS-like
repository but better. You can check in and out any version, and pull out
delta's between any two versions. It handles all the combining and inverting
of deltas for you, and does it all ACID. xdelta1 is the simple binary diff
version.

The development of xdelta appears to be stagnant, but then the next major
version comes in a big-bang event every 6~12 months. In between there
appears to be only bug fixes happening. Generally, each release is pretty
much complete, requiring little to no work anyway.

However, this is pretty much academic since my requirements mean you can't
use xdelta anyway :-)

My big complaint about VCDelta is exactly that you can't find anything
anywhere. The subversion folks invented their own xdelta that is
used/tested/bugfixed only by/for subversion.

[...]
>     The effect is the same, but the production of the increment diffs
> just isn't conceptualized in this way.  One of the main benefits of
> the big DIFF/SIG production is good low latency performance (similar
> to, but simpler and slower than, rsync's pipelining).  This doesn't

Don't you mean you get low latency by _not_ using a big diff/sig? You can
pipeline a file at a time?

> matter once the source end sends over the diffs because the increments
> and the mirror directory are assumed to be on the same system.  If
> this changes maybe we should adopt the big DIFF #2 idea.

There is no reason why the big diff/sig can't be pipelined too. Just read
the big diff/sig a file at a time.

>   >> But all the code is pretty much there if you just wanted to, say,
>   >> write a utility that extended rdiff to directories instead of
>   >> just regular files, and also saved permissions, ownership, etc.
>   >> Call this rdiff+.  Then each of your incremental backups could
>   >> just be an rdiff+ delta, and the stored signature information for
>   >> each backup could be an rdiff+ signature.
> 
>   DB> That's exactly what I'm after. I think it's worth implementing
>   DB> this as a stand-alone layer under rsync-backup, so that it can
>   DB> be common-code with anything else that can use it.
> 
> Well, I was oversimplifying a bit earlier, because rdiff-backup
> doesn't really generate a big SIG right off the bat; first it compares
> file timestamps and so forth and then only sigs/diffs the files that
> have changed.  So I don't think rdiff-backup would use the "rdiff+"
> functions directly.

Ahh, it uses file timestamps to cull the file-list first. That probably
speeds things up.

>     Maybe there could be two stand-alone utilities in the package:
> rdiff-backup, and an "rdiff+" type utility?  They could share a lot of
> the same code.

Yeah... I'm willing to write it.

>   DB> BTW, rdiff-backup currently uses regexp exclude options. I have
>   DB> code for rsync style include/exclude lists that do smart pruning
>   DB> of directories to minimize filesystem walking. Would these be of
>   DB> use to anyone?
> 
> I have gotten requests to add this feature to rdiff-backup.  Maybe
> this could help.  Could you explain what smart pruning is?

smart pruning allows you to "walk" through a directory tree finding matching
files/directorys, and avoids walking through any directories that are totaly
excluded. For example; "--exclude **/spool/news/** --include **" will
include everything except things in directories matching "/spool/news/", and
will not walk through them. This can save heaps of time :-)

My code is in two parts, an "efnmatch.py" that does the extended
shell-wildcard matching, and "dirscan.py" that does directory scanning and
filename matching against a list of include/exclude patterns. I'm thinking I
should bundle these into a little project and whack them up on freshmeat
and/or parnassus.

-- 
----------------------------------------------------------------------
ABO: finger abo@minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------