Nginx load balancing bad configuration or bad behaviour?

Question

I'm currently using Nginx as load balancer in order to balance network traffic between 3 nodes where a NodeJS API is running on.

The Nginx instance runs on node1 and every request is made to node1. I have a peek of requests about 700k in 2 hours and nginx in configured to switch, in a round robin manner, them between node1, node2 and node3. Here the conf.d/deva.conf:

upstream deva_api { server 10.8.0.30:5555 fail_timeout=5s max_fails=3; server 10.8.0.40:5555 fail_timeout=5s max_fails=3; server localhost:5555; keepalive 300; } server { listen 8000; location /log_pages { proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_http_version 1.1; proxy_set_header Connection ""; add_header 'Access-Control-Allow-Origin' '*'; add_header 'Access-Control-Allow-Methods' 'GET, POST, PATCH, PUT, DELETE, OPTIONS'; add_header 'Access-Control-Allow-Headers' 'Authorization,Content-Type,Origin,X-Auth-Token'; add_header 'Access-Control-Allow-Credentials' 'true'; if ($request_method = OPTIONS ) { return 200; } proxy_pass http://deva_api; proxy_set_header Connection "Keep-Alive"; proxy_set_header Proxy-Connection "Keep-Alive"; auth_basic "Restricted"; #For Basic Auth auth_basic_user_file /etc/nginx/.htpasswd; #For Basic Auth } }

and here the nginx.conf configuration:

user www-data; worker_processes auto; pid /run/nginx.pid; include /etc/nginx/modules-enabled/*.conf; worker_rlimit_nofile 65535; events { worker_connections 65535; use epoll; multi_accept on; } http { ## # Basic Settings ## sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 120; send_timeout 120; types_hash_max_size 2048; server_tokens off; client_max_body_size 100m; client_body_buffer_size 5m; client_header_buffer_size 5m; large_client_header_buffers 4 1m; open_file_cache max=200000 inactive=20s; open_file_cache_valid 30s; open_file_cache_min_uses 2; open_file_cache_errors on; reset_timedout_connection on; include /etc/nginx/mime.types; default_type application/octet-stream; ## # SSL Settings ## ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE ssl_prefer_server_ciphers on; ## # Logging Settings ## access_log /var/log/nginx/access.log; error_log /var/log/nginx/error.log; ## # Gzip Settings ## gzip on; include /etc/nginx/conf.d/*.conf; include /etc/nginx/sites-enabled/*; }

The problem is that, with this configuration, I get hundreds of errors in error.log like the following:

upstream prematurely closed connection while reading response header from upstream

but only on node2 and node3. I have already tried the following tests:

increase the number of concurrent APIs in each node (actually I'm using PM2 as intra-node balancer)
remove one node in order to make easier nginx's job
applied weights to nginx

Nothing makes the result better. In those tests I noticed that there were errors only on the 2 remote nodes (node2 and node3), so i tried to remove them from the equation. The result was that I get no more errors like that one but I started to have 2 different errors:

recv() failed (104: Connection reset by peer) while reading response header from upstream

and

writev() failed (32: Broken pipe) while sending request to upstream

It seems the problem was due to the lack of API on node1, the APIs probably cannot respond all the inbound traffic before the client's timeout (this was, is, my guess). Said that, I have increased the number of concurrent API on node1 and the result was better than previous ones but I continue to get the latter 2 errors and I cannot increase the concurrent API on node1 anymore.

So, the question is, why I cannot use nginx as load balancer with all my nodes? Am I making errors in the nginx configuration? Are there any other problems that I did not notice?

EDIT: I run some network tests between 3 nodes. The nodes communicate each other via Openvpn:

PING:

node1->node2 PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data. 64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.85 ms 64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=1.85 ms 64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=3.17 ms 64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=3.21 ms 64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.68 ms node1->node2 PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data. 64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.16 ms 64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=3.08 ms 64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=10.9 ms 64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=3.11 ms 64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=3.25 ms node2->node1 PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data. 64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=2.30 ms 64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=8.30 ms 64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.37 ms 64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.42 ms 64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.37 ms node2->node3 PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data. 64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.86 ms 64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=4.01 ms 64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=5.37 ms 64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=2.78 ms 64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.87 ms node3->node1 PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data. 64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=8.24 ms 64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=2.72 ms 64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.63 ms 64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.91 ms 64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.14 ms node3->node2 PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data. 64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.73 ms 64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=2.38 ms 64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=3.22 ms 64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=2.76 ms 64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=2.97 ms

Bandwidth check, via IPerf:

node1 -> node2 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 229 MBytes 192 Mbits/sec node2->node1 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 182 MBytes 152 Mbits/sec node3->node1 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 160 MBytes 134 Mbits/sec node3->node2 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 260 MBytes 218 Mbits/sec node2->node3 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 241 MBytes 202 Mbits/sec node1->node3 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 187 MBytes 156 Mbits/sec

It seems there is a bottleneck in the OpenVPN tunnel because the same test through eth is about 1Gbits. Said that, I have followed this guide community.openvpn.net but I got only twice of the bandwidth measured before.

I would like to keep OpenVPN on, so are there any other tweaks to make in order to increase network bandwidth or any other adjustment to nginx configuration to make it work properly?

Dege · Accepted Answer · 2019-03-12 11:53:02Z

The problems were caused by the slowness of the OpenVPN network. By routing the requests on the internet after the addition of authentications on each different server we got the errors down to 1-2/day and are probably caused by some other issues now.

Stack Exchange Network

Nginx load balancing bad configuration or bad behaviour?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Nginx load balancing bad configuration or bad behaviour?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions