When Graz University of Technology researcher Michael Schwarz first reached out to Intel, he thought he was about to ruin the company’s day. His team had found a problem with their chips, a vulnerability that was both profound and immediately exploitable. His team finished the exploit on December 3rd, a Sunday afternoon. Realizing the gravity of what they’d found, they emailed Intel immediately.
It would be nine days until Schwarz heard back. But when he got on the phone with someone from Intel, Schwarz got a surprise: the company already knew about the CPU problems and was desperately figuring out how to fix them. Moreover, the company was doing its best to make sure no one else found out. They thanked Schwarz for his contribution, but told him what he had found was top secret, and gave him a precise day when the secret could be revealed.
The flaw Schwarz — and, he learned, many others — had discovered was potentially devastating: a design-level chip flaw that could slow down every processor in the world, with no perfect fix short of a gut redesign. It affected almost every major tech company in the world, from Amazon’s server farms to the chipmakers like Intel and ARM. But Schwarz had also come up against a secondary problem: how do you keep a flaw this big a secret long enough for everyone involved to fix it?
Disclosure is an old problem in the security world. Whenever a researcher finds a bug, the custom is to give vendors a few months to fix the problem before it goes public and bad guys have a chance to exploit it. But as those bugs affect more companies and more products, the dance becomes more complex. More people need to be told and kept in confidence as more software needs to be quietly developed and pushed out. With Meltdown and Spectre, that multi-party coordination broke down and the secret spilled out before anyone was ready.
That early breakdown had consequences. After the release, basic questions of fact became muddled, like whether AMD chips are vulnerable to Spectre attacks (they are), or whether Meltdown is specific to Intel. (ARM chips are also affected.) Antivirus systems were caught off guard, unintentionally blocking many of the crucial patches from being deployed. Other patches had to be stopped mid-deployment after crashing machines. One of the best tools available for dealing with the vulnerability has been a tool called Retpoline, developed by Google’s incident response team, initially planned for release alongside the bug itself. But while the Retpoline team says they weren’t caught off guard, the code for the tool wasn’t made public until the day after the official announcement of the flaw, in part because of the haphazard break in the embargo.
Perhaps most alarming, some crucial outside response groups were left out of the loop entirely. The most authoritative alert about the flaw came from Carnegie Mellon’s CERT division, which works with Homeland Security on vulnerability disclosures. But according to senior vulnerability analyst Will Dormann, CERT wasn’t aware of the issue until the Meltdown and Spectre websites went live, which led to even more chaos. The initial report recommended replacing the CPU as the only solution. For a processor design flaw, the advice was technically true, but only stoked panic as IT managers imagined prying out and replacing the central processor for every device in their care. A few days later, Dormann and his colleagues decided the advice wasn’t actionable and changed the recommendation to simply installing patches.
“I would have liked to have known,” Dormann says. “If we’d known about it earlier, we would have been able to produce a more accurate document, and people would have been more educated right off the bat, as opposed to the current state, where we’ve been testing patches and updating the document for the past week.”
Still, maybe that damage was inevitable? Even Dormann isn’t sure. “This happens to be the largest multi-party vulnerability we’ve ever been part of,” he told me. “With a vulnerability of this magnitude, there’s no way that it’s going to come out cleanly and everyone’s going to happy.”
The first step in the Meltdown and Spectre disclosures came six months before Schwarz’s discovery, with a June 1st email from Google Project Zero’s Jann Horn. Sent to Intel, AMD and ARM, the message laid out the flaw that would become Spectre, with a demonstrated exploit against Intel and AMD processors and troubling implications for ARM. Horn was careful to give just enough information to get the vendors’ attention. He had reached out to the three chipmakers on purpose, calling on each company to figure out its own exposure and notify any other companies that might be affected. At the same time, Horn warned them not to spread the information too far or too fast.
“Please note that so far, we have not notified other parts of Google,” Horn wrote. “When you notify other parties about this issue, please don’t share information unnecessarily.”
Figuring out who was affected would prove difficult. There were chipmakers to start, but soon it became clear that operating systems would need to be patched, which meant looping in another round of researchers. Browsers would be implicated, too, along with the massive cloud platforms run by Google, Microsoft, and Amazon, arguably the most tempting targets for the new bug. By the end, dozens of companies from every corner of the industry would be compelled to issue a patch of some kind.
Project Zero’s official policy is to offer only 90 days before going public with the news, but as more companies joined, Zero seems to have backed down, more than doubling the patch window. As months ticked by, companies began deploying their own patches, doing their best to disguise what they were fixing. Google’s Incident Response Team was notified in July, a month after the initial warning from Project Zero. The Microsoft Insiders program sent out a quiet, early patch in November. (Intel CEO Brian Krzanich was making more controversial moves during the same period, arranging an automated stock sell-off in October to be executed on November 29th.) On December 14th, Amazon Web Server customers got a warning that a wave of reboots on January 5th might affect performance. Another Microsoft patch was compiled and deployed on New Year’s Eve, suggesting the security team was working through the night. In each case, the reasons for the change were vague, leaving users with little clue as to what was being fixed.
Still, you can’t rewrite the basic infrastructure of the internet without someone getting suspicious. The strongest clues came from Linux. Powering most of the cloud servers on the internet, Linux had to be a big part of any fix for the Spectre and Meltdown. But as an open-source system, any changes had to be made in public. Every update was posted to a public Git repository, and all official communications took place on a publicly archived listserve. When kernel patches started to roll out for a mysterious “page table isolation” feature, close observers knew something was up.
The biggest hint came on December 18th, when Linus Torvalds merged a late-breaking patch that changed the way the Linux kernel interacts with x86 processors. “This, besides helping fix KASLR leaks (the pending Page Table Isolation (PTI) work), also robustifies the x86 entry code,” Torvalds explained. The most recent kernel release had come just one day earlier. Normally a patch would wait to be bundled into the next release, but for some reason, this one was too important. Why…