Directory statistics questions...

Jason Piterak Jason_Piterak@c-i-s.com
Thu, 23 May 2002 15:57:00 -0400


Hi Ben,

> -----Original Message-----
> From: ben@stanford.edu [mailto:ben@stanford.edu]On Behalf Of 
> Ben Escoto
> Sent: Thursday, May 23, 2002 6:23 AM
> To: Jason Piterak
> Cc: rdiff-backup@keywest.Stanford.EDU
> Subject: Re: Directory statistics questions... 
> 
> 
> >>>>> "JP" == Jason Piterak <Jason>
> >>>>> wrote the following on Wed, 22 May 2002 21:20:05 -0400
> 
>   JP> Hi Ben, Some more ideas from a lazy admin...
> 
> Great, keep them coming!
  Cool! It's great to be able to help, even if I can't help code :-P

> 
>   JP> o How long did it take?
> 
> What would be the most convenient format to parse for absolute times
> and time intervals?  Everything in seconds?
  Everything in seconds would be fine... Or even better, the epoch start and
stop times. Not terribly human-readable, but very useful for scripting :-)

> 
>   JP> o How does this compare to yesterday?
>   JP> o How does this compare to an average of the last week?
> 
> Are you suggesting that rdiff-backup itself calculate these, or would
> it be sufficient to provide information from which these could be
> calculated?

  No, not at all... Just providing the information would be more than
sufficient. Where you are already keeping the historical statistics files in
rdiff-backup-data, it makes it pretty easy to do the calculations.

>  
>   JP>   But I've got some questions... The information in the
>   JP> directory ststistics files is perfect, but they don't seem to
>   JP> work as I would expect:
> 
> I'm not sure this will answer your questions, but the way things are
> currently set up, TotalFiles and the like refer what was is in the
> mirror directory at the start of the session.  For example, suppose
> empty_dir is empty and 10files contains ten files.  Then:
> 
> ~ $ rdiff-backup empty_dir/ out
> ~ $ cat out/rdiff-backup-data/increments/directory_stat*
> cat: No such file or directory
> ~ $ rdiff-backup 10files/ out
> ~ $ cat 
> out/rdiff-backup-data/increments/directory_statistics.2002-05-
> 23T00\:05\:57-07\:00.data 
> TotalFiles 1
> TotalFileSize 4096
> ChangedFiles 11
> ChangedFileSize 4096
> IncrementFileSize 0
> ~ $ rdiff-backup empty_dir/ out
> ~ $ cat 
> out/rdiff-backup-data/increments/directory_statistics.2002-05-
> 23T00\:06\:03-07\:00.data 
> TotalFiles 11
> TotalFileSize 4106
> ChangedFiles 11
> ChangedFileSize 4106
> IncrementFileSize 732
> 
> So as you can see, there can be more ChangedFiles than there are
> TotalFiles if new files are added.  Also, if a file inside (directly
> or indirectly) a directory changes, then the directory is considered
> changed, and the ChangedFiles count is incremented.  So maybe this
> accounts for the unexpected ChangedFiles result.
  Aaahh... I see, now. The problem is that what I'm really looking for is
information with respect to the _current_ backup -- How did it go, what did
it look like, and what did it change? The historical information is useful,
but current information is crucial. 

> 
>     But I can see how more useful and less confusing statistics could
> be provided.  How about:
> 
> SourceFiles
> SourceFileSize
> MirrorFiles
> MirrorFileSize
> NewFiles
> NewFileSize
> DeletedFiles
> DeletedFileSize
> ChangedFiles
> ChangedSourceSize
> ChangedMirrorSize
> IncrementFileSize
  This would be much better... especially if you include statistics files
for the current backup and backup subdirectories.

> ?  These categories would be pretty unambiguous?  Finally, what is the
> right way to count directories?  Should their reported sizes be added
> to the Size statistics (currently they are)?  And when should a
> directory be considered changed so that it is included in the
> ChangedFiles count?
 Hmmm... Size information on folders makes sense, because the reported size
actually takes up space on the disk. On the other hand, counting directories
as changed files is less useful (and possibly confusing). If you only count
the directory when it's size changes (eg: files or subdirectories are
added), this ignores that (as far as I know), directory sizes don't
generally shrink as you delete items from them. If you count the directory
any time the contents change, you skew the data when you (for instance),
delete a file drom a deeply nested directory. 
  My preference would be to:
    o  Not count directories in change counts.
    o  To list changed directories' sizes in the changed bytes count.
    o  Or... to list them as a separate statistic.


Though that is a preference, and doesn't necessarily make any more sense
than anything else :-P

> 
> 
> --
> Ben Escoto
>