An Interesting thought-maybe...
Ben Escoto
bescoto@stanford.edu
Mon, 24 Jun 2002 11:43:18 -0700
--==_Exmh_1354654083P
Content-Type: text/plain; charset=us-ascii
>>>>> "KS" == Kevin Spicer <Spicer>
>>>>> wrote the following on Wed, 19 Jun 2002 23:09:05 +0100
KS> When people zip files they a) change the filename and b) change
KS> (in a binary sense) the content - but they don't change (in a
KS> human sense) the files real content. I don't think it would be
KS> unreasonable to guess that rdiff backup is transferring the
KS> entire file again. However 99% of zip files consist of the
KS> original filename with a suffix, if it was possible to summise
KS> which file had been zipped and how and just replicate this
KS> action on the remote machine this would likely be much quicker.
...
KS> I guess there would be a lot of work here, but I just wondered
KS> whether others think there might be gains to be made here?
I think this falls under the general category of content specific
diffing/delta generation. I don't know the technical term but I bet
there is a lot of academic literature on it. Recognizing that a file
isn't just a bunch of bytes could theoretically improve performance on
many different file types.
For instance, all natural language files could be translated into
Esperanto or something, so if /foo/bible on the source end was an
English translation of the bible, and /foo/bible on the receiving end
was a Spanish translation, it would only take few bytes (directive
"Translate Spanish->English") to encode the relevant delta. Well this
example is fanciful, but I hope the idea is clear...
In this case I think what you suggest would be hard, because, as
Dean Gaudet pointed out, zipping a file is not a bytewise 1-1
transformation (different versions, different compression levels,
etc.), and it seems we should preserve byte-level identity.
However, unzipping usually is a 1-1 transformation, so it would be
much easier to notice when a user has unzipped a file, and just tell
the writing end to do that. It would still be pretty time-consuming
to implement, though.
Another idea along these lines was suggested a while ago by Dean:
keep track of inodes, so rdiff-backup could detect when a file was
moved, and not send over the whole file again. This also seems like
it could be worth it, but would be long to implement, and the inode
info would take up a lot of memory.
--
Ben Escoto
--==_Exmh_1354654083P
Content-Type: application/pgp-signature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001
iD8DBQE9F2hF+owuOvknOnURAonBAKCD/jUaq3mOCPPbDIBBj2/okTom1QCfbLtA
cMhQn4zIIE20qpT+UHcDJZQ=
=5nqf
-----END PGP SIGNATURE-----
--==_Exmh_1354654083P--