Thousand TCP flow with iperf
Contents
Abstract
The classic iperf and lately the iperf3 widely used for various network performance measurements. The original iperf written in C++ while the iperf3 written in C, but thats not the only difference, from performance measurement standpoint there is a big one: iperf3 is single threaded and iperf is multi-threaded. This is quite important when it comes to multiple flow TCP measurements. In many cases, iperf3 fails to utilize the whole network capacity due to CPU bottleneck: each TCP flow share one thread and therefore one CPU core. In contrast, iperf create a new thread for every TCP flow. Nevertheless handling many TCP flows is still tricky even with the multi-threaded iperf. In this blogpost, we will investigate the pitfalls of many flow measurements.
The naive approach
First, we have to install iperf. In my Ubuntu 18.04 environment, this looks like a following.
|
|
To verify the operation, we can test the TCP performance on loopback. We need to start an iperf server and client in separated terminal sessions:
|
|
Then the client:
|
|
I got the following output from the client in my notebook:
|
|
That was a single TCP flow measurement. We can repeat the test with 2 flows, just pass the -P 2
parameter to the client iperf:
|
|
The problem of small backlog
Unfortunately, this is not a very scalable approach, if we try it with 1000 flows it will fail. In my case, the iperf trying to spawn threads but that happens very slowly and some flow even fails with the following error: write failed: Connection reset by peer
. Unfortunately that’s not a very helpful error message, but it means that the remote peer got too many incoming connections which fills his listener TCP socket’s backlog queue full. When the backlog is full, the kernel refuses to accept new incoming TCP connections so send TCP reset back to the iperf client. But at the same time, the backlog also processed by the kernel, so some new TCP flow will succeed to connect because of the liberated places in the backlog queue. To check how big the listener’s backlog, we can use the perf trace
utility. The perf trace
can monitor the system calls used by iperf, so we have to monitor the listen
system call of the iperf server, and check the length of the backlog:
|
|
In my case the backlog is big enough (MAX_INT
), because I use the iperf version 2.0.10
. But in older version, the backlog was set to 5
which is too small for many connections. So there is another limit somewhere which prevent us from our successful measurements: the SOMAXCONN
kernel parameter. This is set to 128
by default. We can increase it to 65536
:
|
|
Now the write failed: Connection reset by peer
error should be gone.
iperf fails to report the aggregated throughput
With -P 100
we can see the aggregeated throughput in the last row [SUM]
at the end of the measurement:
|
|
However, this row disappearing at -P 1000
or to be more precise, at -P 128
. This is really suspicious for anyone who coded in C/C++ and experienced integer overflow with signed char
type. I investigated the place of the problem in the iperf source code, and it turned out somewhere in the aggregating code it compares a char
with an int
and increases both at every iteration (for every thread). Then the char
overflows, and the condition if(char == int)
will obviously evaluate as false
. I submitted the bug to the maintainer of iperf
, you can find the commit here. I recommend using the latest version of iperf to avoid that bug.
Summary
- iperf is better for multi TCP flow measurement than iperf3, because multi-threaded
- Older versions of iperf explicitly set backlog length to
5
which is insufficient, so use iperf version>= 2.0.10
or modify the value in the iperf source code manually to a greater value sudo sysctl -w net.core.somaxconn=65536
: in linux, setSOMAXCONN
value from the default128
to65536
- If the aggregated throughput
[SUM]
disappears with big parallel-flow values (-P > 127
) use the latest version of iperf: iperf master branch at Sourceforge