bottlenecks

dean gaudet dean-list-rdiff-backup@arctic.org
Mon, 17 Jun 2002 12:43:24 -0700 (PDT)


i haven't looked at 0.9.0 to see what's new... but i played around with
some parallel rdiff invocations yesterday.  not within rdiff-backup, but
just using command lines such as:

	find mail1 mail2 -type f -print0 >filelist
	xargs -0 -i -P10 rdiff signature \{\} /dev/null <filelist

which would spawn up to 10 rdiffs in parallel, throwing away the
signature file.

i timed only the xargs/rdiff command, not the find (parallelism on
find is a whole different story).

mail1 mail2 are two copies of my ~/mail folder -- which each contain 850MB
/ 12700 files (a mixture of maildirs, mailboxes, and gzipped mailboxes
spread over handfuls of directories... i archive everything i receive).
this is enough data to eliminate caching on my systems, and require disk
i/o for all the data.

it appears that fork()ing for every file is around 20% of the overhead,
and fixing this is worth more than trying to get stuff going in parallel.

i've got three SMP boxes to play with -- the main difference between the
boxes are the disk systems:

(1) has a single IDE disk, and no TCQ support, so only one disk command
can be outstanding at any time.  (TCQ = tagged command queueing)
this disk is mounted noatime.

(2) has a single SCSI disk, and a TCQ depth of 8 ... so up to 8
commands outstanding.  atimes are active.

(3) has 4 IDE disks in a RAID5, no TCQ, but with 4 disks, up to 4
commands can be outstanding.  atimes are active.

(1) and (2) showed no improvement from the parallel rdiffs.
(3) showed a 16% to 20% improvement.

(2) surprised me -- but i'm guessing that it might be more to do with it
being a linux 2.2.19 SMP system... linux 2.2.x SMP doesn't scale nearly
as well as 2.4.x, and the other two systems are 2.4.x.  next time i'm
doing machine surgery i'll see about getting a 2.4.x scsi system together
(or maybe i'll start investigating the IDE TCQ stuff for linux).

i suspect there's more improvement available from parallelism, but
i'm running into SMP scalability problems.

here's the specs for (3):

	SMP dual Athlon 1.4GHz MP
	1GB DDR266 memory
	2x Promise Ultra100/TX2 ide controllers
	4x Maxtor D740X 80GB 7200rpm disks
	Linux-2.4.19-pre7-ac4
	software RAID5, 32KB stripe
	LVM on top of the RAID5

this is a live system, it's the "primary" i run rdiff-backup against
every day... and there's a fair amount of other activity going on
as i run these measurements.

first, here are the timings for parallel rdiffs.  (-Pn indicates how
many were run in parallel).

rdiff	-P1	39.90user 36.24system 2:07.32elapsed 59%CPU
rdiff	-P1	39.71user 36.38system 2:05.89elapsed 60%CPU
rdiff	-P2	38.80user 37.33system 1:59.08elapsed 63%CPU
rdiff	-P2	40.14user 38.75system 1:58.06elapsed 66%CPU
rdiff	-P5	40.29user 41.00system 1:44.31elapsed 77%CPU
rdiff	-P5	39.88user 39.65system 1:47.14elapsed 74%CPU
rdiff	-P10	41.10user 48.30system 1:47.32elapsed 83%CPU
rdiff	-P10	40.88user 44.24system 1:37.41elapsed 87%CPU

i wanted to eliminate the CPU component of rdiff, so that i could see
what the disk array was capable of.  so i tried the following:

	/usr/bin/time xargs -0 -Pn -i cat -- \{\} >/dev/null <filelist

cat	-P1	13.91user 29.25system 1:36.06elapsed 44%CPU
cat	-P1	13.63user 30.21system 1:34.28elapsed 46%CPU
cat	-P2	13.21user 31.16system 1:32.75elapsed 47%CPU
cat	-P2	14.07user 31.38system 1:36.56elapsed 47%CPU
cat	-P5	13.44user 32.44system 1:32.87elapsed 49%CPU
cat	-P5	13.72user 35.35system 1:32.95elapsed 52%CPU
cat	-P10	14.12user 34.73system 1:38.92elapsed 49%CPU
cat 	-P10	13.58user 39.00system 1:36.99elapsed 54%CPU

and here's why i suspect fork() overhead:

% /usr/bin/time xargs -0 -n100 cat -- >/dev/null <filelist
0.42user 11.92system 1:12.92elapsed 16%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (23219major+5044minor)pagefaults 0swaps

that puts 100 filenames into each cat rather than the 1 filename
i used above.

and the natural next thing to try would be -Pn ... except that performance
drops off seriously (2x to 3x worse), and i'm not sure why yet.  i suspect
it's SMP scalability issues, i'm gonna try oprofile when i get a chance.

-dean