How New Relic Migrated Their Site to Ruby 1.9 - Lessons Learned
Originally authored by Jonathan Owens
A few weeks ago, we switched the New Relic website to
run on Ruby 1.9.3. This was an enormous project spanning many months
and required the effort of nearly every engineer in the company. But the
results were excellent – improved speed, reduced memory usage and an
infrastructure ready for future Ruby versions.
Moving such a large and long-lived Rails 2.3 application as New Relic required a very careful and thorough approach. Nearly every aspect of how our site is tested, deployed and ran was affected. We learned a lot during the process and want to share the most important lessons.
It’s Not Just Debt
On a codebase as large as New Relic, the engineering time required to
upgrade to 1.9 was large enough to treat as a serious feature. As we
later learned, the performance improvements were significant enough to
take it very seriously indeed.
Work to get the New Relic code ready for 1.9 began years ago, before I joined the company. Taking it from exploratory to production ready took eight months of calendar time, distributed among several engineers for different upgrade tasks.
When you get started with such a migration, it’s important to assign at least one engineer with the task of chasing down every dependency and weird bug until it’s shipped.
Use a Ruby Version Management Tool
Using a Ruby version management tool helped us in all stages of the upgrade. Both RVM and rbenv
are excellent tools for managing and installing Ruby versions all the
way from the laptop to the server. We went with rbenv for our servers
for the simple installation and used the puppet-rbenv
module to set it up. It’s worth deciding on which tool to use early on
in the process and getting everyone on the team comfortable with it.
Once you’ve picked one, you can use it to do cross version testing on laptops, set up multi-version build configurations on your CI server, and easily change patchlevels or major versions on your production servers.
During our upgrade process, there was a long season where the codebase had to be cross compatible between 1.8 and 1.9. By making version switching easy, we ensured that it wasn’t too huge a burden.
Make Your Test Server Do Most of the Work
We use Jenkins to build our code every time a push occurs. So to get
started, we simply added a build that ran our tests in Ruby 1.9. With
the first build we had over 200 failing tests. But that gave us a
target. We held some bug bashes to help drive down the failures, then
made fixing the ones that remained a sustaining task like any other.
This wasn’t a fast process, but it was sustainable. And it allowed the
upgrade to fit into our existing bug tracking and test process.
But Tests Aren’t Everything
It was a very exciting day when our 1.9 test job went green. The first
thing I did was switch my laptop to 1.9 and try to run the site. It
didn’t work at all. Whoops!
Turns out there’s a lot more to running the code than just the tests. We had lots of development-mode only code that set everything to run on a laptop, none of which was tested by our CI tasks. This meant several more days of chasing down errors we had no idea existed.
Partial, Reversible Deploys Are Essential
We had several preproduction environments in which to test our 1.9
performance. But none of them receive even a fraction of the traffic our
production site does, nor do they have even a fraction of the dataset
to work with. So when the time came to deploy the upgrade, we decided to
do one server at a time to see how they fared.
We quickly discovered two things. First, 1.9 was performing about 80% slower than 1.8. And second, our load balancers didn’t think this was a problem and gave it just as much traffic as the other servers. Then things started to get ugly.
We scrambled to fix the load balancer by switching from round robin to least-connections as our balance strategy. This reduced the load on the now poorly performing server so we could troubleshoot the performance problem.
After many hurried code changes, we discovered that our own Ruby agent had a poorly performing garbage collection instrumentation strategy under 1.9, which we patched right away and later released as version 3.5. With the patched agent, 1.9 went from 80% slower to 30% faster. High fives were had all around.
We would have got that 30% improvement much more quickly if we had actually done the fire drill of taking a server out of rotation before introducing the Ruby version change.
It Really Works
We have a bias for measurement here and when you make a big change,
such as this one, having some charts on your side can be a tremendous
help. Especially when you’re trying to make the project about more than
just debt, having load charts that to down and throughput charts that go
up are a tremendous asset.. In our case, 1.9 was so much faster that it
was like getting a free web server.
This machine is delivering more traffic with less CPU:
We can look back now and see that this was an upgrade that delivered real user happiness. We can reliably serve page to our users in less than two seconds, any time of the day.
Closing Comments
The switch to Ruby 1.9.3 represented a major feature upgrade to New
Relic. While it was an enormous project, it has improved every aspect of
how our code is run and managed. We hope our lessons learned help you
achieve the results you’re looking for.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)










