Why is piping 'dd' through gzip so much faster than a direct copy?
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Peaceful Mind
--
Chapters
00:00 Why Is Piping 'Dd' Through Gzip So Much Faster Than A Direct Copy?
01:08 Accepted Answer Score 108
01:48 Answer 2 Score 6
02:21 Answer 3 Score 0
03:12 Answer 4 Score 0
04:25 Thank you
--
Full question
https://superuser.com/questions/760097/w...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#backup #performance #dd #pipe #gzip
#avk47
ACCEPTED ANSWER
Score 108
dd
by default uses a very small block size -- 512 bytes (!!). That is, a lot of small reads and writes. It seems that dd
, used naively in your first example, was generating a great number of network packets with a very small payload, thus reducing throughput.
On the other hand, gzip
is smart enough to do I/O with larger buffers. That is, a smaller number of big writes over the network.
Can you try dd
again with a larger bs=
parameter and see if it works better this time?
ANSWER 2
Score 6
Bit late to this but might I add...
In an interview I was once asked what would be the quickest possible method for cloning bit-for-bit data and of coarse responded with the use of dd
or dc3dd
(DoD funded). The interviewer confirmed that piping dd
to dd
is more efficient, as this simply permits simultaneous Read/Write or in programmer terms stdin/stdout
, thus ultimatly doubling write speeds and halfing transfer time.
dc3dd verb=on if=/media/backup.img | dc3dd of=/dev/sdb
ANSWER 3
Score 0
Cong is correct. You are streaming the blocks off of disk uncompressed to a remote host. Your network interface, network, and your remote server are the limitation. First you need to get DD's performance up. Specifying a bs= parameter that aligns with the disks buffer memory will get the most performance from the disk. Say bs=32M for instance. This will then fill gzip's buffer at sata or sas line rate strait from the drives buffer. The disk will be more inclined to sequential transfer giving better through put. Gzip will compress the data in stream and send it to your location. If you are using NFS that will allow the nfs transmission to be minimial. If you are using SSH then you encur the SSH encapsulation and encryption overhead. If you use netcat then you have no encryption over head.
ANSWER 4
Score 0
I assume here that the "transfer speed" you're referring to is being reported by dd
. This actually does make sense, because dd
is actually transferring 10x the amount of data per second! However, dd
is not transferring over the network -- that job is being handled by the gzip
process.
Some context: gzip
will consume data from its input pipe as fast as it can clear its internal buffer. The speed at which gzip
's buffer empties depends on a few factors:
- The I/O write bandwidth (which is bottlenecked by the network, and has remained constant)
- The I/O read bandwidth (which is going to be far higher than 1MB/s reading from a local disk on a modern machine, thus is not a likely bottleneck)
- Its compression ratio (which I will assume by your 10x speedup to be around 10%, indicating that you're compressing some kind of highly-repetitive text like a log file or some XML)
So in this case, the network can handle 100kB/s, and gzip
is compressing the data around 10:1 (and isn't being bottlenecked by the CPU). This means that while it is outputting 100kB/s, gzip
can consume 1MB/s, and the rate of consumption is what dd
can see.