MD5 Partial Calculations – save and restore the calculation of large files
Calculating an MD5 checksum of a files transmitted over a network is a pretty important for both security and integrity. Calculating an MD5 digest can be time and CPU intensive.
One solution to alleviate the time constraint, is to compute the MD5 as the file is received over a network. This works great, until you decide the transfer should be resumable. In fact, making the transfer resumable, makes computing the checksum even more valuable for integrity. Ruby provides a fairly straight forward API for calculating an MD5 in partial bytes. The only part that is missing is the ablity to serialize Ruby’s Digest::MD5 class.
Looking at the internal implementation, I decided it was easiest to just provide an alternative class. The main reason for doing this instead of extending the existing ruby class is because it switches between two different backends, either md5.c or openssl’s. Both are of comparable speed, so I decided to take the md5.c implementation because serializing it’s MD5_CTX structure is relatively easy. Here’s the structure:
typedef struct md5_state_s {
uint32_t count[2]; /* message length in bits, lsw first */
uint32_t state[4]; /* digest buffer */
uint8_t buffer[64]; /* accumulate block */
} MD5_CTX;
The only downside is the duplicated the code… I am fairly confident the md5.c code will not be changing too much over time… but it’s easy enough to keep up with any changes.
I added two serializations methods, save and restore. This allows an implementation to receive, compute, pause and resume file transfers while still calculating a small checksum per chunk of file received. This means, the load on the server can remain small whether a user is transmitting a large or small file. Here’s the gist of how it works:
hasher = Digest::MD5Partial.new
offset = 0
total = File.size(__FILE__)
until offset >= total do
buf = nil
File.open(__FILE__, 'rb') do|io|
io.seek(offset, IO::SEEK_SET)
buf = io.readpartial(buf_size)
hasher.update(buf)
end
# save the partial
File.open("partial", "wb") do|io|
str = hasher.save
io << str
end
# restore the partial
hasher.restore(File.read("partial"))
# advance the offset
offset += buf.size
end
from_partial = hasher.hexdigest
directly = Digest::MD5.hexdigest(File.read(__FILE__))
assert_equal directly, from_partial
Check it out: http://github.com/taf2/md5-partial/tree/master

Recent Comments