From The Atlantic:
The Obama campaign’s technologists were tense and tired. It was game day and everything was going wrong.
Josh Thayer, the lead engineer of Narwhal, had just been informed that they’d lost another one of the services powering their software. That was bad: Narwhal was the code name for the data platform that underpinned the campaign and let it track voters and volunteers. If it broke, so would everything else.
They were talking with people at Amazon Web Services, but all they knew was that they had packet loss. Earlier that day, they lost their databases, their East Coast servers, and their memcache clusters. Thayer was ready to kill Nick Hatch, a DevOps engineer who was the official bearer of bad news. Another of their vendors, PalominoDB, was fixing databases, but needed to rebuild the replicas. It was going to take time, Hatch said. They didn’t have time.
They’d been working 14-hour days, six or seven days a week, trying to reelect the president, and now everything had been broken at just the wrong time. It was like someone had written a Murphy’s Law algorithm and deployed it at scale.
They’d been working 14-hour days, six or seven days a week, trying to reelect the president, and now everything had been broken at just the wrong time.
And that was the point. “Game day” was October 21. The election was still 17 days away, and this was a live action role playing (LARPing!) exercise that the campaign’s chief technology officer, Harper Reed, was inflicting on his team. “We worked through every possible disaster situation,” Reed said. “We did three actual all-day sessions of destroying everything we had built.”
Hatch was playing the role of dungeon master, calling out devilishly complex scenarios that were designed to test each and every piece of their system as they entered the exponential traffic-growth phase of the election. Mark Trammell, an engineer who Reed hired after he left Twitter, saw a couple game days. He said they reminded him of his time in the Navy. “You ran firefighting drills over and over and over, to make sure that you not just know what you’re doing,” he said, “but you’re calm because you know you can handle your shit.”
The team had elite and, for tech, senior talent — by which I mean that most of them were in their 30s — from Twitter, Google, Facebook, Craigslist, Quora, and some of Chicago’s own software companies such as Orbitz and Threadless, where Reed had been CTO. But even these people, maybe *especially* these people, knew enough about technology not to trust it. “I think the Republicans fucked up in the hubris department,” Reed told me. “I know we had the best technology team I’ve ever worked with, but we didn’t know if it would work. I was incredibly confident it would work. I was betting a lot on it. We had time. We had resources. We had done what we thought would work, and it still could have broken. Something could have happened.”
In fact, the day after the October 21 game day, Amazon services — on which the whole campaign’s tech presence was built — went down. “We didn’t have any downtime because we had done that scenario already,” Reed said. Hurricane Sandy hit on another game day, October 29, threatening the campaign’s whole East Coast infrastructure. “We created a hot backup of all our applications to US-west in preparation for US-east to go down hard,” Reed said.
“We knew what to do,” Reed maintained, no matter what the scenario was. “We had a runbook that said if this happens, you do this, this, and this. They did not do that with Orca.”
Continue reading the rest of the story on The Atlantic