bottlenecks

Ben Escoto bescoto@stanford.edu
Tue, 18 Jun 2002 03:07:13 -0700


--==_Exmh_-656681431P
Content-Type: text/plain; charset=us-ascii

>>>>> "DG" == dean gaudet <dean-list-rdiff-backup@arctic.org>
>>>>> wrote the following on Mon, 17 Jun 2002 12:43:24 -0700 (PDT)

  DG> and here's why i suspect fork() overhead:

  DG> % /usr/bin/time xargs -0 -n100 cat -- >/dev/null <filelist
  DG> 0.42user 11.92system 1:12.92elapsed 16%CPU (0avgtext+0avgdata
  DG> 0maxresident)k 0inputs+0outputs (23219major+5044minor)pagefaults
  DG> 0swaps

  DG> that puts 100 filenames into each cat rather than the 1 filename
  DG> i used above.

Thank you for the analysis!  I think you are right about the fork
overhead.  I decided to look into it more by comparing multiple uses
of rdiff, with the same thing using Donovan Bailey's python wrapper
around librsync.

    So, I created two directories full of 1000 files whose names were
the numbers 0 to 999.  Each file was 2048 long.  Then I wrote the
following two python programs: 

---------------[ prog1 - "test2.py" ]----------------------------
#!/usr/bin/env python

"""Run rdiff to transform everything in one dir to another"""

import sys, os

dir1, dir2 = sys.argv[1:3]
for i in xrange(1000):
	assert not os.system("rdiff signature %s/%s sig" % (dir1, i))
	assert not os.system("rdiff delta sig %s/%s diff" % (dir2, i))
	assert not os.system("rdiff patch %s/%s diff %s/%s.out" %
						 (dir1, i, dir1, i))

---------------[ prog2 - "test3.py" ]----------------------------

#!/usr/bin/env python

"""Use librsync to transform everything in one dir to another"""

import sys, os, librsync

dir1, dir2 = sys.argv[1:3]
for i in xrange(1000):
	dir1fn = "%s/%s" % (dir1, i)
	dir2fn = "%s/%s" % (dir2, i)

	# Write signature file
	f1 = open(dir1fn, "rb")
	sigfile = open("sig", "wb")
	librsync.filesig(f1, sigfile, 2048)
	f1.close()
	sigfile.close()

	# Write delta file
	f2 = open(dir2fn, "r")
	sigfile = open("sig", "rb")
	deltafile = open("delta", "wb")
	librsync.filerdelta(sigfile, f2, deltafile)
	f2.close()
	sigfile.close()
	deltafile.close()

	# Write patched file
	f1 = open(dir1fn, "rb")
	newfile = open("%s/%s.out" % (dir1, i), "wb")
	deltafile = open("delta", "rb")
	librsync.filepatch(f1, deltafile, newfile)
	f1.close()
	deltafile.close()
	newfile.close()
--------------------------------------------------------------------

The idea is they both do the same thing, but test2.py runs rdiff on
each file, and test3.py does everything internally, so no forking.

The results:

~/rd/src $ time python test3.py out1 out2
real    0m31.900s
user    0m1.750s
sys     0m1.190s

~/rd/src $ time python test2.py out1 out2
real    0m57.369s
user    0m33.440s
sys     0m23.090s

The internal librsync procedure used up only about 1/20th of the CPU
time as the external rdiff procedure, meaning I suppose that 95% of
the CPU time was spent forking, setting up the external rdiff process,
etc.  Of course this is for small files; as the file sizes became
large, the forking time would go to 0 as a percentage of the total.

    Oh, and the reason that test3.py had such a large real time is
because it ended up using hundreds of MB of memory and my system
started swapping like crazy.  I guess all this efficiency won't be
worth much if there is a huge memory leak in pysync or librsync...


--
Ben Escoto

--==_Exmh_-656681431P
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001

iD8DBQE9DwZP+owuOvknOnURAtrdAJ9Eoq/oskE76Oh97nOevn2EG136FgCfYNLt
phAz8z3Hx9Pe7DwDklWsIWc=
=m1Pi
-----END PGP SIGNATURE-----

--==_Exmh_-656681431P--