Too big increment files

Donovan Baarda abo@minkirri.apana.org.au
Thu, 9 May 2002 10:44:11 +1000


On Wed, May 08, 2002 at 05:01:28PM -0700, Ben Escoto wrote:
> >>>>> "IR" == ivan  <Windows-1252>
> >>>>> wrote the following on Wed, 8 May 2002 20:29:22 +0200
> 
>   IR> Hi, i want to use rdiff-backup for keeping and send the
>   IR> increment files over internet, but i have a problem: i?m making
>   IR> proves with a 17 MB Word document and changing only a few word,
>   IR> the increment file is 17 MB.  I know that the problem is on
>   IR> rdiff, i have make this proves with this one only and the result
>   IR> is the same: a 17 MB increment file ( i prove whit changing the
>   IR> block size, but nothing change).  Is very important for me make
>   IR> a small increment files, and so, i someone knows where the
>   IR> problem is, i will be very gratefully.  If the quiestion has to
>   IR> be asked on other mailing list, let me know the address.  Thanks
>   IR> for your help.
> 
> Yep, I don't work on rdiff; you might want to try the rproxy mailing
> list at, for instance, rproxy-users@lists.sourceforge.net or
> rproxy-devel@lists.sourceforge.net.
> 
>     Also, if I remember correctly Donovan Baarda has recently fixed a
> bug in rdiff so new versions of rdiff should make significantly
> smaller diffs.  However, a version of rdiff with this bug fixed has
> not been released yet, I think, and also I wouldn't have thought the
> bug would result in behavior as extreme as what you are seeing.

Yes, there is a subtle bug in rdiff 0.9.5 that makes delta's bigger than
they need to be. There is a patch available on the sourceforge rproxy
project called something like "MSVC6 + Cywin" that includes this fix as well
as compilation support for win32 (yes, you can compile an rdiff.exe on
win32).

I'm not that familiar with word document formats, but it is possible that
you can never get a small delta file. If word uses compression, then even a
small change in the document text ends up changing the whole file.

You can check this by using something like xdelta, which produces pretty
close to optimum delta's. If xdelta can't get a small delta, nothing can.

BTW. I have finished and released my Python interface to the librsync
library. The incremental delta calculation (like the zlib python interface)
is not yet working, but the file level delta API works fine. This lets you
do;

    # File level API
    stats=filesig(oldfile,sigfile)              # create sigfile
    stats=filerdelta(sigfile,newfile,diffile)   # create a rdelta diffile
    stats=filepatch(oldfile,diffile,newfile)    # apply a diffile
        
Where:
    stats       - a statistics object that can be printed
    oldfile     - the source file
    newfile     - the target file
    sigfile     - the signature file
    diffile     - the delta file

This is probably only marginaly better than using rdiff itself, except the
files can be any python object that can be converted into a unix filehandle,
including sockets etc. This stuff can be found at the pysync project page on
freshmeat.

-- 
----------------------------------------------------------------------
ABO: finger abo@minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------