bottlenecks
Ben Escoto
bescoto@stanford.edu
Tue, 18 Jun 2002 03:07:13 -0700
--==_Exmh_-656681431P
Content-Type: text/plain; charset=us-ascii
>>>>> "DG" == dean gaudet <dean-list-rdiff-backup@arctic.org>
>>>>> wrote the following on Mon, 17 Jun 2002 12:43:24 -0700 (PDT)
DG> and here's why i suspect fork() overhead:
DG> % /usr/bin/time xargs -0 -n100 cat -- >/dev/null <filelist
DG> 0.42user 11.92system 1:12.92elapsed 16%CPU (0avgtext+0avgdata
DG> 0maxresident)k 0inputs+0outputs (23219major+5044minor)pagefaults
DG> 0swaps
DG> that puts 100 filenames into each cat rather than the 1 filename
DG> i used above.
Thank you for the analysis! I think you are right about the fork
overhead. I decided to look into it more by comparing multiple uses
of rdiff, with the same thing using Donovan Bailey's python wrapper
around librsync.
So, I created two directories full of 1000 files whose names were
the numbers 0 to 999. Each file was 2048 long. Then I wrote the
following two python programs:
---------------[ prog1 - "test2.py" ]----------------------------
#!/usr/bin/env python
"""Run rdiff to transform everything in one dir to another"""
import sys, os
dir1, dir2 = sys.argv[1:3]
for i in xrange(1000):
assert not os.system("rdiff signature %s/%s sig" % (dir1, i))
assert not os.system("rdiff delta sig %s/%s diff" % (dir2, i))
assert not os.system("rdiff patch %s/%s diff %s/%s.out" %
(dir1, i, dir1, i))
---------------[ prog2 - "test3.py" ]----------------------------
#!/usr/bin/env python
"""Use librsync to transform everything in one dir to another"""
import sys, os, librsync
dir1, dir2 = sys.argv[1:3]
for i in xrange(1000):
dir1fn = "%s/%s" % (dir1, i)
dir2fn = "%s/%s" % (dir2, i)
# Write signature file
f1 = open(dir1fn, "rb")
sigfile = open("sig", "wb")
librsync.filesig(f1, sigfile, 2048)
f1.close()
sigfile.close()
# Write delta file
f2 = open(dir2fn, "r")
sigfile = open("sig", "rb")
deltafile = open("delta", "wb")
librsync.filerdelta(sigfile, f2, deltafile)
f2.close()
sigfile.close()
deltafile.close()
# Write patched file
f1 = open(dir1fn, "rb")
newfile = open("%s/%s.out" % (dir1, i), "wb")
deltafile = open("delta", "rb")
librsync.filepatch(f1, deltafile, newfile)
f1.close()
deltafile.close()
newfile.close()
--------------------------------------------------------------------
The idea is they both do the same thing, but test2.py runs rdiff on
each file, and test3.py does everything internally, so no forking.
The results:
~/rd/src $ time python test3.py out1 out2
real 0m31.900s
user 0m1.750s
sys 0m1.190s
~/rd/src $ time python test2.py out1 out2
real 0m57.369s
user 0m33.440s
sys 0m23.090s
The internal librsync procedure used up only about 1/20th of the CPU
time as the external rdiff procedure, meaning I suppose that 95% of
the CPU time was spent forking, setting up the external rdiff process,
etc. Of course this is for small files; as the file sizes became
large, the forking time would go to 0 as a percentage of the total.
Oh, and the reason that test3.py had such a large real time is
because it ended up using hundreds of MB of memory and my system
started swapping like crazy. I guess all this efficiency won't be
worth much if there is a huge memory leak in pysync or librsync...
--
Ben Escoto
--==_Exmh_-656681431P
Content-Type: application/pgp-signature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Exmh version 2.5 01/15/2001
iD8DBQE9DwZP+owuOvknOnURAtrdAJ9Eoq/oskE76Oh97nOevn2EG136FgCfYNLt
phAz8z3Hx9Pe7DwDklWsIWc=
=m1Pi
-----END PGP SIGNATURE-----
--==_Exmh_-656681431P--