Quick Reality Check: Latency

Nowadays, in any NFV-related improvement to existing technology, improving Latency is always involved. Either via reducing it or making it more deterministic (like real-time).

intelgrafikIn order to reduce latency, one first needs to realize how terribly slow are some things when compared to a CPU speed. As you know, CPU is basically a ultra-fast robot that executes simple operations (math, bit tranformation, push/pop from memory, etc) and takes its commands from code registers and the data from different memory locations: some very fast but small (L1 cache) and others slow but very big (RAM and Hard Drives via system buses like SATA or PCIe).

Motherboard_diagram.svg.pngNow let’s focus about the Memory: when the CPU needs to fetch some data to manipulate, it remains idle until the data comes. This time is then spent to execute other pieces of work. When the data has been fetched (from RAM for instance), then the code resumes execution. If the data exists in L1, is a cache hit, but other than that, it’s a miss and the CPU will have to wait until the data is fetched from L2, L3, RAM, etc.

So in order to understand why it’s important to minimize the amount of data that has to be fetched from “distant” parts, like a PCIe device or a SATA hard-drive, have a look at the time it takes to access those parts, including network devices. The numbers come from https://gist.github.com/jboner/2841832. Referring to nanoseconds doesn’t make sense to our little brain, but here you’ll see how bad this is as soon as you useĀ  a more human scale, where a L1 cache hit takes 1 second, and slower operations take up to years.

If L1 access is a second, then:

L1 cache reference : 0:00:01
Branch mispredict : 0:00:10
L2 cache reference : 0:00:14
Mutex lock/unlock : 0:00:50
Main memory reference : 0:03:20
Compress 1K bytes with Zippy : 1:40:00
Send 1K bytes over 1 Gbps network : 5:33:20
Read 4K randomly from SSD : 3 days, 11:20:00
Read 1 MB sequentially from memory : 5 days, 18:53:20
Round trip within same datacenter : 11 days, 13:46:40
Read 1 MB sequentially from SSD : 23 days, 3:33:20
Disk seek : 231 days, 11:33:20
Read 1 MB sequentially from disk : 462 days, 23:06:40
Send packet CA->Netherlands->CA : 3472 days, 5:20:00

In a future post, I’ll compare how DPDK or SR-IOV allow VNFs to have much faster access to network resources, and process data packets without even touching the RAM (all is done in L1 to L3 cache). Using those solutions, the latency for network processing can be reduced to near-baremetal level, in nanoseconds, instead of being stuck with milliseconds-like performance of regular software switches in virtual environments.

(Credit for Images:

https://en.wikipedia.org/wiki/Central_processing_unit

http://www.tomshardware.com/reviews/dual-xeon-duo,664-3.html

https://en.wikipedia.org/wiki/Motherboard )