An Interesting thought-maybe...

Ben Escoto bescoto@stanford.edu
Mon, 24 Jun 2002 11:43:18 -0700


--==_Exmh_1354654083P
Content-Type: text/plain; charset=us-ascii

>>>>> "KS" == Kevin Spicer <Spicer>
>>>>> wrote the following on Wed, 19 Jun 2002 23:09:05 +0100

  KS> When people zip files they a) change the filename and b) change
  KS> (in a binary sense) the content - but they don't change (in a
  KS> human sense) the files real content.  I don't think it would be
  KS> unreasonable to guess that rdiff backup is transferring the
  KS> entire file again.  However 99% of zip files consist of the
  KS> original filename with a suffix, if it was possible to summise
  KS> which file had been zipped and how and just replicate this
  KS> action on the remote machine this would likely be much quicker.
    ...
  KS> I guess there would be a lot of work here, but I just wondered
  KS> whether others think there might be gains to be made here?

I think this falls under the general category of content specific
diffing/delta generation.  I don't know the technical term but I bet
there is a lot of academic literature on it.  Recognizing that a file
isn't just a bunch of bytes could theoretically improve performance on
many different file types.

    For instance, all natural language files could be translated into
Esperanto or something, so if /foo/bible on the source end was an
English translation of the bible, and /foo/bible on the receiving end
was a Spanish translation, it would only take few bytes (directive
"Translate Spanish->English") to encode the relevant delta.  Well this
example is fanciful, but I hope the idea is clear...

    In this case I think what you suggest would be hard, because, as
Dean Gaudet pointed out, zipping a file is not a bytewise 1-1
transformation (different versions, different compression levels,
etc.), and it seems we should preserve byte-level identity.

    However, unzipping usually is a 1-1 transformation, so it would be
much easier to notice when a user has unzipped a file, and just tell
the writing end to do that.  It would still be pretty time-consuming
to implement, though.

    Another idea along these lines was suggested a while ago by Dean:
keep track of inodes, so rdiff-backup could detect when a file was
moved, and not send over the whole file again.  This also seems like
it could be worth it, but would be long to implement, and the inode
info would take up a lot of memory.


--
Ben Escoto

--==_Exmh_1354654083P
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001

iD8DBQE9F2hF+owuOvknOnURAonBAKCD/jUaq3mOCPPbDIBBj2/okTom1QCfbLtA
cMhQn4zIIE20qpT+UHcDJZQ=
=5nqf
-----END PGP SIGNATURE-----

--==_Exmh_1354654083P--