Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Python 2.6.6 #1

Open
pavel-odintsov opened this issue Jul 24, 2014 · 13 comments
Open

Issues with Python 2.6.6 #1

pavel-odintsov opened this issue Jul 24, 2014 · 13 comments

Comments

@pavel-odintsov
Copy link

Hello!

I read your rdiff implementation in pure Python and it's amazing! It's really awesome work! Did you execute performance tests? I do it with original rdiff and they are really worst - librsync/librsync#6 :(

Could you add compatibility with Python 2.6?

python pyrdiff.py 
  File "pyrdiff.py", line 344
    print("Usage: {0:s} <command> [options ...]".format(sys.argv[0]), file=sys.stderr)
                                                                          ^
SyntaxError: invalid syntax

I fix this lines but scripts still not working:

python pyrdiff.py signature /vz/private/1202/root.hdd/root.hdd /tmp/result.signature
Traceback (most recent call last):
  File "pyrdiff.py", line 399, in <module>
    main()
  File "pyrdiff.py", line 375, in main
    Signatures.from_basis_file(basisfd).write(sigfd)
  File "pyrdiff.py", line 184, in write
    write_int(fd, RS_SIG_MAGIC, 4)
  File "pyrdiff.py", line 142, in write_int
    buf = i.to_bytes(nbytes, 'big')
AttributeError: 'int' object has no attribute 'to_bytes'

Thank you!

@pavel-odintsov
Copy link
Author

I fixed compilation bugs and compare your Python rdiff with original rdiff.

Python rdiff:

 time python pyrdiff.py signature /vz/template/cache/debian-7.0-x86_64-minimal.tar.gz /tmp/signature

real    0m16.551s
user    0m16.431s
sys 0m0.118s

Original rdiff:

time rdiff signature  /vz/template/cache/debian-7.0-x86_64-minimal.tar.gz /tmp/signature

real    0m0.422s
user    0m0.376s
sys 0m0.044s

Maybe you can try to do same in Golang? It's really fast and cute :)

@pavel-odintsov
Copy link
Author

But signatures is really same :)

md5sum /tmp/signature..original
3d823f471f88f54c046c5f2408cecd7f  /tmp/signature.original

md5sum  /tmp/signature.python
3d823f471f88f54c046c5f2408cecd7f  /tmp/signature.python

Nice work! :)

For Python 2.6. compatibility I'm used this recipe http://stackoverflow.com/questions/16022556/has-python-3-2-to-bytes-back-ported-to-python-2-7

def to_bytes(n, length, endianess='big'):
    h = '%x' % n
    s = ('0'*(len(h) % 2) + h).zfill(length*2).decode('hex')
    return s if endianess == 'big' else s[::-1]

And add ord conversion to faster_rollsum function:

def faster_rollsum(data):
        A = 0
        B = 0
        for d in data:
                A += ord(d) + 31
                B += A
        return (A & 0xffff) | ((B & 0xffff) * 65536)

I'm really sure you can do it better because I'm not an python-guy :)

@therealmik
Copy link
Owner

Hi,

This was originally written as part of an effort to understand rdiff properly - to ensure I wasn't talking crap when I logged the other bugs against rdiff (including a critical security bug!).

I write all my code for python3 these days - I could make a python2 branch if it were useful, but I'd replace the md4 signatures with sha256.

I'll try to make a python2 branch for you, but no guarantees of speed (maybe in pypy it would be ok) - I was aiming for clarity.

@therealmik
Copy link
Owner

Try grabbing the python2 branch, and let me know how it goes. I tested against python2.7.

The delta files will come out slightly different to real rdiff - this is due to an rdiff bug though - my ones will be smaller (but still compatible).

I REALLY don't recommend using rdiff (or pyrdiff) on untrusted data - MD4 is horribly broken.

It's easy enough to convert it to use sha256, so if you have a real application for this I'll happily do so. I was waiting to see how librsync handles this though (apparently ignoring me is the solution so far).

@pavel-odintsov
Copy link
Author

Thank you so much, Michael!

I checked version for Python 2 and all works fine but signature is not equal to original rdiff:

rdiff signature /vz/private/1214/root.hdd/root.hdd /tmp/1214_signature
./pyrdiff.py signature /vz/private/1214/root.hdd/root.hdd  /tmp/1214_pysignature
LANG=C cmp  /tmp/1214_pysignature /tmp/1214_signature
/tmp/1214_pysignature /tmp/1214_signature differ: char 17, line 1

md5sum  /tmp/1214_pysignature
1aa907bca16f00f3dd7d4aa4677e180f  /tmp/1214_pysignature

 md5sum  /tmp/1214_signature
7e30877026141179281a19cfa240bc8f  /tmp/1214_signature

ls -al /tmp/1214_signature 
-rw-r--r-- 1 root root 18382860 Июл 25 10:57 /tmp/1214_signature

ls -al /tmp/1214_pysignature 
-rw-r--r-- 1 root root 18382860 Июл 25 10:56 /tmp/1214_pysignature

@pavel-odintsov
Copy link
Author

I'm not an hash expert but I'm really worry about collisions because my data in 1MB blocks (it's OpenVZ ploop disk images) and any collision will destroy my data.

I checked sha1 implementation from openssl package and standard sha1sum and found openssl version is really very fast: https://translate.google.com/translate?sl=auto&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fwww.stableit.ru%2F2014%2F04%2Fsha1sum-openssl-sha1.html&edit-text=

And second issue with rdiff is speed. I have servers with 40 cores modern CPU with and ssd drives but rdiff uses only one core for 100%. Maybe you have ideas about parallel version of pyrdiff?

Do you have any ideas about future of pyrdiff? Is it only educations interests or you plans do really new project for data backup for drop-out replace rdiff?

Absence of really reliable and fast solution for incremental backups is a real pain of almost every system administrator.

@therealmik
Copy link
Owner

Hi,

I'm unable to reproduce the problem on my system, but I suspect it could be related to my abuse of ord() and chr(). Try pulling the latest version.

For speed, python2 is really slow. I ran it with pypy and it was reasonable, but still slower than the original rdiff.

I think the original rdiff should use blake2b, which would make the signature files bigger, but would avoid the collision problem and would use a modern core very well (SSE etc).

If I can't convince the authors of rdiff and rsync to do something, I will roll my own - it won't take me long now that I've done this prototype.

@pavel-odintsov
Copy link
Author

Amazing! All works fine now :)

pyrdiff.py signature /vz/private/1214/root.hdd/root.hdd /tmp/pysignature
rdiff signature /vz/private/1214/root.hdd/root.hdd /tmp/signature
 md5sum /tmp/pysignature
7e30877026141179281a19cfa240bc8f  /tmp/pysignature

md5sum /tmp/signature
7e30877026141179281a19cfa240bc8f  /tmp/signature

Thank you so much!

It will be really nice if you build brand new modern rdiff and rsync. I read about blake2b and it's looks like good.

What do you think about librsync source code? I test it and can't understand why it's so slow :(

What language you plans use for new rdiff realization?

@therealmik
Copy link
Owner

I was originally hoping to target duplicity, which is written in python.

I just committed mksigs.py - this is 3x slower than rdiff when run in pypy (I tried on a 4G iso).

It should be trivial enough to modify normal rdiff to use blake2 signatures, but I don't intend on maintaining that codebase. This is unlikely to speed it up much.

If I don't hear anything from the rdiff maintainer soon, I might consider writing something.

@pavel-odintsov
Copy link
Author

mksigs looks really nice!

I checked it for 150M file and got not so good results unfortunately :(

time python mksigs.py /vz/template/cache/debian-6.0-x86_64.tar.gz /tmp/resultsignature
real    0m21.468s
user    0m21.426s
sys 0m0.053s


time rdiff signature /vz/template/cache/debian-6.0-x86_64.tar.gz /tmp/resultsignature
real    0m0.323s
user    0m0.281s
sys 0m0.042s

I read your code and found very interesting md4 truncation which they used in original rdiff... Worst idea at ever. I can find only one description for this - space efficiency for deltas but it's really strange.

I will try to translate mksigs into C or Go and execute performance tests. But C++ is best approach for patch and delta code because hashes/maps and dynamic arrays used widely.

@pavel-odintsov
Copy link
Author

I tried to do very basic and dirty version in Go: https://gist.github.com/pavel-odintsov/a8f677d76213bddbccd6 It can't generate compatible signatures now but all hash generation work is same.

And profiling results for 700MB file is not so bad for first try:

time rdiff --block-size 1048576  signature /vz/template/cache/debian-7.0-x86_64-isppro.tar.gz /tmp/rdiff

real    0m3.464s
user    0m3.251s
sys 0m0.217s

time ./gordiff 

real    0m3.920s
user    0m3.702s
sys 0m0.223s

As you can see result is very closer :)

But when I changed md4 to standard md5 from Golang standard library I got amazing speedup:

time ./gordiff 

real    0m2.040s
user    0m1.828s
sys 0m0.215s

And with blake2 from here https://godoc.org/github.com/codahale/blake2 I got:

time ./gordiff 

real    0m2.194s
user    0m2.026s
sys 0m0.223s

@therealmik
Copy link
Owner

It seems like rdiff is going to get fixed. Perhaps it's best to wait to see how fast that goes.

If you're backing up raw disk images, you could easily get better performance by simply getting rid of the rolling sum, and simply creating signatures for each block. Then the delta file would simply be (blocknum, data) for any block that's changed.

@vps2fast
Copy link

vps2fast commented Oct 9, 2014

I did my own fastrdiff signature in plain C with OpenSSL md4 realization and got following results:

time rdiff signature --block-size 1048576 root.hdd root.hdd.signature
real    0m24.051s
user    0m23.025s
sys 0m1.022s

time ./fastrdiff signature
real    0m19.266s
user    0m18.261s
sys 0m0.987s

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants