2

I am trying to troubleshoot some odd, intermittent connection failures with apache. I noticed the issue when users complained that parts of the web application we're hosting weren't working. Debugging revealed that AJAX requests were not returning the XML or JSON data the JavaScript application was expecting. The application is served over SSL.

When I tested myself, I would see intermittent failures, and Firebug would show that either the response length was zero, or the connection seemed to fail completely. Application logs on the server showed no problems, including when Firebug reported the response was empty -- the application log on the server showed data had been sent.

On a hunch I fired up apachebench (ab) and was surprised to find some connection failures:

[jnet@Stan ~]$ ab -v 1 -n 1000 -c 10 $url This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking workingman.smart-safe-secure.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Apache/2.2.3 Server Hostname: workingman.smart-safe-secure.com Server Port: 443 SSL/TLS Protocol: TLSv1/SSLv3,DHE-RSA-AES256-SHA,1024,256 Document Path: / Document Length: 659 bytes Concurrency Level: 10 Time taken for tests: 104.086 seconds Complete requests: 1000 Failed requests: 2 (Connect: 2, Receive: 0, Length: 0, Exceptions: 0) Write errors: 0 Total transferred: 945000 bytes HTML transferred: 659000 bytes Requests per second: 9.61 [#/sec] (mean) Time per request: 1040.855 [ms] (mean) Time per request: 104.086 [ms] (mean, across all concurrent requests) Transfer rate: 8.87 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 356 844 215.7 840 2268 Processing: 68 194 138.9 128 1483 Waiting: 67 178 122.0 116 1426 Total: 494 1039 241.8 993 2623 Percentage of the requests served within a certain time (ms) 50% 993 66% 1039 75% 1101 80% 1162 90% 1407 95% 1492 98% 1626 99% 1718 100% 2623 (longest request) 

These requests were for a static HTML page, so my PHP application doesn't seem to be the issue here. Running the tests over normal HTTP (non-ssl) produced no failures at all. I am at a loss as to what could be happening... not even sure how to troubleshoot from here. I will gladly post httpd.conf configuration, just let me know what parts would help. Server is Apache/2.2.3 (CentOS), with mpm_worker and mod_fastcgi...

UPDATE: I just had my first ab test return 2 connection failures over normal HTTP, for the same HTML page. So it looks like SSL isn't the problem after all...

UPDATE 2: It's possible this is some sort of network issue, because I am not able to replicate this using ab on a server in the same data center, nor am I able to replicate this using ab on localhost. However pinging the server in question from my workstation shows 0% packet loss... So I am unsure of what steps to take next.

UPDATE 3: In case it helps, if I run ab to benchmark over an SSH tunnel, I get no failures... so maybe this is a networking issue instead of an apache issue...

1 Answer 1

1

When you say that it works great when request are done on the same datacenter or when you use a ssh tunnel I think that it could be some kind of shaping between your remote site on the datacenter.
Like if icmp and ssh (and others) are more prioritized than http. So if the WAN like become overloaded the router can drop http traffic. Generaly SSH is prioritized because it need high interactivity while FTP has the less prioritized as it's file transfert.
Ask your network team if there is any Shaping or QOS in place.

Another thing tells me that the problem could be that is that connect time are from 356 to 2268. 356 is quite slow, I guess that when tunnel with SSH it's less than that. and a so high difference between min et max tell me that some packet are probably droped (due to QOS/Shaping) and retransmit are needed (so connect time is slower)

3
  • I am the network manager for both our local office as well as the systems admin for the servers :-) We do have a pfsense router here in the office and it does use shaping, and we do prioritize SSH highly. But we also prioritize HTTP highly. And I only have this issue with this server, not other servers we have... Commented Jan 27, 2010 at 12:54
  • Also, those connect times are on par with my tests of other servers in my data center. What could be done to lower those? I do see a large (seems to be large anyway) number of retransmits via netstat -s but I'm not sure what's normal... Commented Jan 27, 2010 at 12:56
  • retransmits are not normal. Check your bandwidth usage on the full path (but the problem is more likely to be on the WAN). You may also test late the evening to see if you can reproduce when there is less network trafic. If bandwidth is not full and problem also occur during the evenning you might have loss problem on your wan link. Commented Jan 27, 2010 at 15:00

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.