reversing order of diffs.

Ben Escoto bescoto@stanford.edu
Sat, 16 Mar 2002 11:36:39 -0800


--==_Exmh_-2123727982P
Content-Type: text/plain; charset=us-ascii

>>>>> "DB" == Donovan Baarda <abo@minkirri.apana.org.au>
>>>>> wrote the following on Sat, 16 Mar 2002 11:04:23 +1100

  DB> The memory consumed depends on the file size, for both rdiff and
  DB> xdelta, because they need to hold the whole signature for the
  DB> file in memory. In the case of xdelta, it can be large because
  DB> the blocksize is smaller so there are more rolling
  DB> checksums. However, for rdiff, althought the blocksize is
  DB> larger, it needs md4sums for each block as well, whereas xdelta
  DB> doesn't. If you tell xdelta to use the same blocksize as rdiff,
  DB> it will actually use less memory.

  DB> For xdelta to use 1.4GB, they must have been working on a 10G+
  DB> file. I would be surprised if rdiff used much less.

Ok, this makes sense.  About the 1.4GB memory usage, I think that
happened when xdelta was run on two large tar.gz files.  The
uncompressed contents may have totaled 10GB.

  >> The effect is the same, but the production of the increment diffs
  >> just isn't conceptualized in this way.  One of the main benefits
  >> of the big DIFF/SIG production is good low latency performance
  >> (similar to, but simpler and slower than, rsync's pipelining).
  >> This doesn't

  DB> Don't you mean you get low latency by _not_ using a big
  DB> diff/sig? You can pipeline a file at a time?

What I mean is that transferring a big SIG in pre-set blocks is less
sensitive to latency than dealing with a file at a time.  For
instance, suppose you are dealing with lots of small files.  And also
assume that your protocol has the standard non-pipelined
acknowledgment sequence (don't know what this is called) where one
side sends data, the other side acknowledges, etc.

    If the protocol sends each small file, and then waits for an
acknowledgment, it will be slow over a high-latency connection.  Most
of the time won't be transferring data, but waiting for acknowlegment
from the other side, since only a few bytes will be transferred
between acknowlegments.

    On the other hand, suppose all the small files are rolled up into
one big file, and this big file is transferred to the other side in
32k blocks.  Here latency does matter, since the protocol is not
totally pipelined (if I understand what this word means correctly),
but at most one side will wait for acknowlegment every 32k, not every
50 bytes or whatever, so there is better high latency performance.
This is what rdiff-backup currently does.

  DB> smart pruning allows you to "walk" through a directory tree
  DB> finding matching files/directorys, and avoids walking through
  DB> any directories that are totaly excluded. For example;
  DB> "--exclude **/spool/news/** --include **" will include
  DB> everything except things in directories matching "/spool/news/",
  DB> and will not walk through them. This can save heaps of time :-)

Ok, I think rdiff-backup does this now.  So if a directory is matched
by an exclude regular expression then no files in the directory will
be examined.

    I was thinking of the opposite problem.  Say you want to backup
only *foo files, but want to search the whole directory structure for
them.  Directories should be included, but only if they have *foo
files somewhere below them.


--
Ben Escoto

--==_Exmh_-2123727982P
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001

iD8DBQE8k57D+owuOvknOnURAi7RAKCMSsixZxEH9i/mJdelsenV4o2U4gCghVoy
pg6g7KS1VGJsRRMeAxEvLPM=
=Nm7T
-----END PGP SIGNATURE-----

--==_Exmh_-2123727982P--