The old saying "look before you leap" is still true in today's world of high tech computing. If anything, yesterday showed us it applies as much to computers as it does to us humans. As you may have heard that the keepers of world time added an extra second to June 30th, 2012 to align it with Earth’s rotation time. This extra "leap" second brought down many servers to their knees all around the digital world. Before I get into the details of what happened, let’s talk about what was impacted.
What was impacted?
There were a lot of prominent websites that were impacted by this leap second that you can read all about on Techcrunch and other sites. Here is a list of systems that I am aware of that were impacted:
- Versions of Linux kernel less than 2.6.9-89.EL (RHEL4) or kernel-2.6.18-164.el5 in RHEL5
- Some versions of OpenManage from Dell on Linux systems. If you run primarily Dell servers this might be your issue
- Apparently any multi-threaded application that relied on Linux’s futex for synchronizing thread access
- Tomcat server
- Potentially, MySQL
- Firefox, Thunderbird and Chrome
Seems like there were a few different systems that were impacted by the leap second:
- Linux – On some older (or not patched) versions, Network Time Protocol (NTP) servers had a bug that would cause a livelock. The concept of leap second is not new to most modern servers. In fact, NTP has built in functionality to handle it. On the day of leap second shift, NTP informs the servers that they need to strike 23:59:59 two times. A newer version of Linux Kernel uses "hrtimer" to handle leap second shift, this is known to cause livelock in the systems under some circumstances. The new patch uses "second_overflow()" method that deals with the situations more gracefully. If you were not using the patched version of Kernel then you would be updating the leap second with hrtimer that could have caused issues on the server. This also seem to have impacted the functioning of futex which most of the multi threaded applications use. Because of this a plenty of java based servers - JBoss, Tomcat etc. were impacted.
- OpenManage – Some versions of OpenManage software from Dell that is used to manage Dell servers were impacted. This was because of a bug in OpenManage. Restarting the servers solved the problem.
Typical symptoms if you are hit by this leap second issue are the following:
- Total crash of Linux server
- Livelock within Java based systems that rely on futex. This will cause high CPU utilization, in the range of 150% - 250%, high load average on Linux
- High CPU utilization on Linux servers by system processes
- The systems will still continue to respond but might be sluggish
There are a few things that one can do to fix this issue:
- Shutdown NTP and manually set the time and start it again:
sudo /etc/init.d/ntp stop
sudo date -s "`date -u`"
sudp /etc/init.d/ntp start
- Reboot your system
- It is always good to patch the kernel with the latest code