Server stops accepting connections after ~120,000 active on 18.04.3

Question

I'm running a type of perf test where I have a simple TCP server with 4 IP addresses that is listening on a port and getting connections from several other computers on the local network. Everything works fine up to just under 120,000 active connections, clients are able to get messages from client and create new connections. At just under 120,000, new connections just stop appearing. There is no log activity on server and clients start getting timeouts after a bit. There is no firewall that would be getting in the way. I have tweaked a bunch of settings already:

/etc/sysctl.conf

net.core.netdev_max_backlog = 1000000

net.core.netdev_budget = 50000
net.core.netdev_budget_usecs = 5000

net.core.somaxconn = 1024000

net.core.rmem_default = 1048576
net.core.rmem_max = 16777216

net.core.wmem_default = 1048576
net.core.wmem_max = 16777216

net.core.optmem_max = 65536

net.ipv4.tcp_rmem = 4096 1048576 2097152
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192
net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_max_syn_backlog = 3000000
net.ipv4.tcp_max_tw_buckets = 2000000

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_rfc1337 = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.rp_filter = 1

/etc/security/limits.conf

* soft nofile 6553600
* hard nofile 6553600

cat /proc/sys/fs/file-max
1621708

The limits are intentionally completely overkill because it's just a test. Are there some other settings I am missing that would enable more connections? Neither the CPU nor RAM is being stressed so I would like to keep pushing the hardware. Server and clients are all running on AWS EC2 t3a.xlarge instances, if that makes any difference.

Unrelated test I wrote a script last night to load Nautilus File Manager that was supposed to quit at 20,000 instances but it never made it that far before video on second monitor froze. RAM didn't run out and CPU didn't hit 100% (about 60% on average on 8 cores). So there are limits on any single machine... — WinEunuuchs2Unix, Oct 12 '19 at 01:51
There may be [theoratical limits](https://stackoverflow.com/questions/2332741/what-is-the-theoretical-maximum-number-of-open-tcp-connections-that-a-modern-lin). How many clients are sending these requests? You said you have four different IP addresses. How is that set up? Do you have four separate network interfaces on your server or four different clients? — darksky, Oct 12 '19 at 03:53
There are 4 clients sending requests, each with a single IP, and creating connection in a round-robin fashion across all 4 of the server's IPs. The server IP addresses are all aliases for the same network interface. They are configured through netplan [(example)](https://www.ostechnix.com/how-to-configure-ip-address-in-ubuntu-18-04-lts/) and definitely reachable. When server stops accepting new connections, the old ones are still active and usable. For example if I have an SSH session open, I can keep using it, but another one will not connect. — Denis, Oct 12 '19 at 16:09

score 2 · Accepted Answer · answered Oct 14 '19 at 21:05

2

Turns out it was an AWS limitation. Apparently talking between EC2 instances inside the same VPC has an active connection limit of around 120,000. Making them use a public IP to communicate got rid of the limit. I wasn't getting any errors in Ubuntu because the OS was not limiting anything.

answered Oct 14 '19 at 21:05

Denis

61
4

1

+1. Click the grey check mark next to your answer so others know it is the proper solution. – WinEunuuchs2Unix Oct 14 '19 at 22:18

Server stops accepting connections after ~120,000 active on 18.04.3

1 Answers1