fastest possible data transfer

Thu Feb 14 09:51:03 EET 2008

On 2/11/08, Christos Ricudis <ricudis at itc.auth.gr> wrote:
> [...] to rsync den eimai sigouros ti kanei [...]

είναι βασικά ένα diff & patch που δουλεύει γρήγορα για κάθε τύπου
αρχείου (text & binary). Έχει νόημα ΜΟΝΟ όταν έχεις μια μορφή του
αρχείου και στις δύο πλευρές. Επίσης:
 - Όσο πιο πολλές  ομοιότητες έχουν τα δύο αρχεία τόσο πιο συμφέρουσα
η χρήση του rsync.
 - Όσο πιο γρήγορο το κανάλι μεταφοράς τόσο λιγότερο συμφέρουσα ...
 - Όσο πιο λιγότερη CPU έχουμε να αφιερώσουμε τόσο λιγότερο συμφέρουσα
(όχι κάτι φοβερό αλλά ειδηκά το ένα από τα δύο PC πρέπει να υπολογίσει
ένα 128bit MD4 checksum για ΟΛΟ το αρχείο)

Για backup ενός filesystem μέσω ADSL π.χ. το rsync είναι
*καταπληκτικό* αλλά ακόμα και μέσω LAN 100Mbps όταν τα αρχεία είναι
μεγάλα και αλλάζουν λίγο κάθε μέρα (π.χ. mbox, mdb) είναι πολύ πιο
γρήγορος απο ένα απλό copy.

Ο αλγόριθμος με λίγα λόγια:

===The rsync algorithm===

Suppose we have two general purpose computers "pcA" and "pcB".
Computer "pcA" has access to a file A and "pcB" has access to file B,
where A and B are ``similar''. There is a slow communications link
between "pcA" and "pcB".

The rsync algorithm consists of the following steps:

1. "pcB" splits the file B into a series of non-overlapping
fixed-sized blocks of size S bytes1.

2.  For each of these blocks "pcB" calculates two checksums: a weak
``rolling'' 32-bit checksum (easy to calculate) and a strong 128-bit
MD4 checksum.

3.  "pcB" sends these checksums to "pcA".

4.  "pcA" searches through A to find all blocks of length S bytes (at
any offset, not just multiples of S) that have the same weak checksum
as one of the blocks of B. This can be done in a single pass very
quickly using a special property of the rolling checksum described
below. Each positive hit is tested again this time against the strong
checksum so that we can be sure that Block-A and Block-B are the same.

5.    "pcA" sends "pcB" a sequence of instructions for constructing a
copy of A. Each instruction is either a reference to a block of B, or
literal data. Literal data is sent only for those sections of A which
did not match any of the blocks of B.

The end result is that "pcB" gets a copy of A, but only the pieces of
A that are not found in B (plus a small amount of data for checksums
and block indexes) are sent over the link. The algorithm also only
requires one round trip, which minimises the impact of the link
latency.

The most important details of the algorithm are the rolling checksum
and the associated multi-alternate search mechanism which allows the
all-offsets checksum search to proceed very quickly. These will be
discussed in greater detail below.

 -- http://samba.anu.edu.au/rsync/tech_report/node2.html