TL;DR: use NTP if you have high request queueing values from New Relic and you have multiple servers in your infrastructure.
I had heard about New Relic before but just recently starting monitoring some Broadleaf deployments with it. While New Relic was trivial to set up, I starting noticing a very disturbing trend with increasingly high 'Request Queueing' time. This chronicles our journey into how I resolved it.
The infrastructure that I used at Rackspace looked about like this:
Using the New Relic documentation, obtaining metrics from the Tomcat servers was pretty painless. I simply added the newrelic.jar as a -javaagent argument into CATALINA_OPTS and away I went. Seeing data come in was pretty cool, especially because it validated our scalability white paper with even more data. We were seeing good 10-20ms response times from the server, good cache hits at the database layer and all-around expected performance:
After hooking the application monitoring up and seeing the data come across to New Relic, I came across this article on how to track request queueing, realizing that my New Relic metrics were not showing this data. I am an avid reader of Hacker News and followed along very closely with the Heroku/Rap Genius debacle last year in regards to New Relic request queueing data (see http://news.rapgenius.com/James-somers-herokus-ugly-secret-lyrics for the post that started it). I was also very anxious to see how the httpd layer would perform with mod_proxy_balancer and our chosen 'byrequest' balancing algorithm, mainly because I was a bit skeptical of this layer from the start but wanted to reserve judgement until I had hard data.
Getting New Relic to show the request queueing metric was pretty easy. I simply enabled mod_header and added this line in the
RequestHeader set X-Request-Start "%t"
In a perfect world I would set the X-Request-Start header at the load balancer level so that I could track the request time the moment that it hit my infrastructure. However, Rackspace load balancers do not currently support setting custom headers so that option was out.
And that's when the trouble started. Our application response time shot through the roof. Those 10-15ms numbers turned more into 1-2 seconds spent at the application layer. Even more disturbing was that as I watched it throughout the week the numbers were steadily increasing!
The front part of this graph is the same as the first graph (low numbers where the blue blocks are). That giant green block? That's what New Relic interprets as the request queueing.
I scoured the internet to try to find articles of people that had a similar issue, but could not find anything substantial. And while I could obviously just hide this data from being shown on the graph and only look at the JVM data it still ate at me quite a bit. I shot an email over to New Relic to see if they had any ideas. They came back with:
...it may be a case of getting the clocks synced up for all these servers. For example, if one of your Tomcat servers is just a few seconds behind the load balancer, it can affect this metric greatly.
I went back and looked at the request queueing documentation page. Sure enough, a little blurb at the bottom mentions clock skew and recommends using NTP. I logged into one of the Tomcat servers and one of the httpd servers and they were definitely a few seconds off. Looks like I had found my culprit!
This makes sense given the request lifecycle:
- Request hits the Apache server
- Apache sets the X-Request-Start header using a timestamp (from the time on the server)
- Request hits the Tomcat server
- The New Relic Java agent reads the X-Request-Start header timestamp
- The agent gets the current time (which is from the Tomcat server), subtracts the current time from the start time set by Apache and reports the difference as request queueing.
The request could have only spent a couple of milliseconds in httpd but if those servers are a couple of seconds off then New Relic has no choice but to report the multiple second difference (and be none the wiser).
1 Timelord to rule them all
Lucky for me, NTP (Network Time Protocol) was created to solve just such a problem. The hierarchy of how NTP works is very similar to DNS. NTP servers are classified by stratum, where a stratum-1 server represents the equivalent of a DNS root server. These servers hold true time, and distribute out this time to stratum-2 servers, which then distribute the time to stratum-3 servers, etc. Normally clients do not connect to a specific time server but rather a pool of servers referenced from http://www.pool.ntp.org. For instance, to get time from the US pool you would use 0.us.pool.ntp.org, 1.us.pool.ntp.org, etc.
For our particular group of servers we will use NTP a bit differently. When you install NTP you can also use that machine as an NTP server and not just an NTP client. Since we care more about the synchronization between this group of servers specifically and less about what true time really is for all of the servers, we will designate one of the servers as the Time Lord that the other servers will obtain the time from. The Time Lord will synchronize time from some external pool and distribute it to the rest of the clients. In the above infrastructure, I designated one of the Apache servers as the Time Lord.
How would you configure this? First, install NTP on all of the machines:
root@apache-1:~# sudo apt-get install ntp ## Repeat on all other servers
And then modify /etc/ntp.conf on the Time Lord and replace the default NTP servers (ntp.ubuntu.org) with the following:
# If you are using Rackspace, use the time servers in your specific datacenter # This example is for servers within the Chicago data center time.ord1.rackspace.com time.ord2.rackspace.com # Use US time servers after that server 0.us.pool.ntp.org server 1.us.pool.ntp.org server 2.us.pool.ntp.org server 3.us.pool.ntp.org # Use Ubuntu's ntp server as a fallback. server ntp.ubuntu.com # And use the current local time as a fallback of that server 127.127.1.0 fudge 127.127.1.0 stratum 10 # Allow this server to act as TIMELORD: KEEPER OF ALL TIME # This operates on a local VPN. You can obviously modify this for your specific network restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
Then modify /etc/ntp.conf on all other machines to use the Time Lord server to obtain time (notice the absence of any other NTP servers on the clients):
# Use iburst here so that time will be obtained very quickly after a reboot. Otherwise time syncing could take 10-15 minutes after a reboot server <ip_of_time_lord> iburst
If you are running a firewall (I personally recommend UFW because of how dirt simple and awesome it is compared to raw iptables) then you'll want to open incoming port 123 (used by NTP) on the Time Lord server. In the below example I am only opening port 123 on a common VPN. I'm not quite ready to accept time requests from just anybody :)
root@apache-1:~# ufw allow from 192.168.1.0/255.255.255.0 to any port 123
The NTP clients will take a bit to synchronize with the Time Lord (15-30 minutes). You can expedite the process by running this command on all of the clients:
root@apache-1:~# sudo service ntp stop && ntpdate -s <ip_address_of_time_lord> && sudo service ntp start
You can check your time settings with ntpq. An example of a client configuration might look like:
root@tomcat-1:~# ntpq -pn remote refid st t when poll reach delay offset jitter ============================================================================== *192.168.1.1 126.96.36.199 3 u 495 1024 377 1.588 -0.831 0.262
Whereas the Time Lord configuration might look like:
root@apache-1:~# ntpq -pn remote refid st t when poll reach delay offset jitter ============================================================================== +188.8.131.52 184.108.40.206 2 u 115 1024 377 1.969 -0.967 0.276 -220.127.116.11 18.104.22.168 2 u 782 1024 377 30.492 -3.078 1.908 -22.214.171.124 126.96.36.199 3 u 3 1024 377 49.384 1.283 1.093 -2600:3c03::f03c 188.8.131.52 2 u 18 1024 377 44.868 1.650 0.823 *184.108.40.206 220.127.116.11 2 u 78 1024 377 25.006 -0.234 0.637 +18.104.22.168 22.214.171.124 2 u 834 1024 377 108.764 -1.570 1.191 127.127.1.0 .LOCL. 10 l 11d 64 0 0.000 0.000 0.000
And that's it! You should now see your New Relic graphs back to normal! After getting NTP configured, our latency immediately went away as you can see by the giant response time cliff:
By the way, if you use Ansible for your configuration management (which I also recommend), I uploaded an Ansible playbook that sets up NTP for you using the suggestions above. Check out the gist on GitHub