[rproxy-devel] rdiff deltas not very good compared to pysync,
why?
Shirish H. Phatak
shirish@tacitnetworks.com
Wed, 24 Apr 2002 12:52:45 -0400
This is a multi-part message in MIME format.
--------------020601070606010709070409
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Hi all,
I have been really backed up and could not get to this until
today. Sorry for the delay.
I am attaching a patch against anonymous cvs that should fix the
patch size problem. The patch will also cleanly apply to the 0.9.5 tree.
The problem was my fault, I forgot to roll out bytes I was skipping over
as an optimization in rs_delta_scan to avoid recomputing the weak
checksum. With this patch the size of the rdiff delta with Donovan's
samples was 131091 bytes. Please let me know if this solves the problem.
Incidently bugs with rolling checksums are most easily detected by using
the --paranoia option in rdiff which will abort on the first mismatched
sum; thanks to Markus for pointing this out.
This patch also includes a memory leak fix for a scoop buffer leak
(upto 16k per job with default parameters). I had submitted the fix
earlier but it probably got lost. This will only affect those who are
using librsync as a library in long running programs.
Since there appears to be a dedicated group of users and lots of
activity, maybe we can convince Martin to roll in these patches and make
a new release?
-Shirish
Donovan Baarda wrote:
>G'day,
>
>Just been doing some work on librsync for a Python extension, and noticed
>that it is producing deltas more than twice as big as pysync produces, using
>the same block size. I'm using the released v0.9.5 code.
>
>I'm using some test files I generated for pysync testing. These consist of a
>256K random data "oldfile.bin", and a slightly larger "newfile.bin" that
>includes random edits (insert,replace,delete,copy) of "oldfile.bin". Because
>this is all random data, it doesn't compress.
>
>pyproxy can produce both rsync and xdelta style deltas. The xdelta results
>should be pretty close to optimal, so they make a good basis to compare
>against.
>
>The default block size for pyproxy is 1024, so I used "-b 1024" when running
>rdiff to force the same block size. The results I got were;
>
>Operation size
>----------------------------------
>source oldfile.bin 262144
>target newfile.bin 325316
>rdiff signature 3084
>pyproxy sig 8090
>pyproxy xdelta 103463
>pyproxy rdelta 131389
>rdiff delta 319252
>
>As you can see, the "rdiff signature" was less than half the size of than
>"pysync sig". This is understandable, as pysync uses a Python pickled dict
>of dicts for it's sigfile format.
>
>However, the "rdiff delta" is more than two times the size of "pysync
>rdiff", and more than three times the optimal "pyproxy xdelta". Since pysync
>uses a pickled Python list of (offset,length) tupples and insert strings, I
>find this very surprising. None of these are faulty, as a "patch" by any of
>the tools uses the correct result.
>
>Note that pysync does use gzip context compression (compressing the whole
>data stream, including hits, but only including the compressed output of
>misses), and I don't thing rdiff does. However, in this case the input data
>was all random so compression has no effect. Compressing any of the inputs
>or outputs yeilds negligable change.
>
>I haven't examined the librsync code to figure out why yet, but I suspect
>that there might be a bug in the rolling checksums. There is certainly
>something wrong.
>
--------------020601070606010709070409
Content-Type: text/plain;
name="librsync-delta-shift.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="librsync-delta-shift.patch"
? COPYING2
? COPYING_delta
? COPYING_sig
Index: delta.c
===================================================================
RCS file: /cvsroot/librsync/delta.c,v
retrieving revision 1.29
diff -u -r1.29 delta.c
--- delta.c 8 Aug 2001 04:58:17 -0000 1.29
+++ delta.c 24 Apr 2002 16:19:25 -0000
@@ -164,8 +164,8 @@
rs_long_t match_where;
int search_pos, end_pos;
unsigned char *inptr = (unsigned char *) p;
- uint32_t s1 = job->weak_sig & 0xFFFF;
- uint32_t s2 = job->weak_sig >> 16;
+ unsigned s1 = job->weak_sig & 0xFFFF;
+ unsigned s2 = job->weak_sig >> 16;
if (job->basis_len) {
rs_log(RS_LOG_ERR, "somehow got nonzero basis_len");
@@ -190,8 +190,18 @@
for (search_pos = 0; search_pos <= end_pos; search_pos++) {
size_t this_len = job->block_len;
- /* Did we inherit the signature from rs_delta_match?*/
+ /* Did we inherit the signature from rs_delta_match, if
+ * so we know this block won't match and we should simply
+ * skip the first char.
+ */
if (job->have_weak_sig < 0) {
+ /* advance by one; roll out the byte we just skipped over. */
+ unsigned char a = inptr[search_pos];
+ unsigned shift = a + RS_CHAR_OFFSET;
+
+ s1 -= shift;
+ s2 -= this_len * shift;
+ job->weak_sig = (s1 & 0xffff) | (s2 << 16);
job->have_weak_sig = 1;
/* We already know that this block won't match!*/
continue;
Index: job.c
===================================================================
RCS file: /cvsroot/librsync/job.c,v
retrieving revision 1.28
diff -u -r1.28 job.c
--- job.c 9 Apr 2001 15:02:40 -0000 1.28
+++ job.c 24 Apr 2002 16:19:25 -0000
@@ -84,6 +84,9 @@
rs_result rs_job_free(rs_job_t *job)
{
+ if (job->scoop_buf)
+ free(job->scoop_buf);
+
rs_bzero(job, sizeof *job);
free(job);
--------------020601070606010709070409--