Experts in the inner workings of the Internets Domain Name System which matches IP addresses with corresponding domain namessay the 27-year-old communications protocol does not appear to be the cause of Facebooks high-profile outage last week.
Facebooks service was unavailable to its 500 million active users for 2.5 hours on Thursday -- the companys worst failure in more than four years. Initial news reports blamed the outage on DNS because end users received a DNS error message when they couldnt reach the site.
"There's probably a lesson here that the problem at various times looked like DNS, but ultimately proved not to be," said Cricket Liu, vice president of architecture at Infoblox, which sells DNS appliances. "In my experience, users are quick to point fingers at DNS (perhaps because Web browsers like to implicate DNS when they can't get somewhere) but DNS often isn't at fault."
Facebook gave little detail about the cause of the outage except to say that it was the result of a misconfiguration in one of its databases, which prompted a flood of traffic from an automated system trying to fix the error.
"We made a change to a persistent copy of a configuration value that was interpreted as invalid," explained Robert Johnson in Facebooks blog post about the incident. "This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries per second."
The feedback loop created so much traffic that Facebook was forced to turn off the database cluster, which meant turning off the Web site.
"Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site," Johnson said. He added that "for now weve turned off the system that attempts to correct configuration values."
Experts said the Facebook outage was not the result of DNS because they were able to log partially into the site during the incident, which would be impossible if DNS were to blame.
"It looked like it was a configuration issue on their end with state information what was cache versus what was authoritative," said Richard Hyatt, co-founder and CTO of BlueCat Networks, which sells DNS appliances and a cloud-based managed DNS service. "I think it was a configuration on their end that might have been connected to DNS
but they were very vague about it."
Infoblox attributed the Facebook outage to a problem with change management in a blog post on Friday.
"When dealing with network change and configurations, organizations must be more proactive in testing, validating and monitoring the ongoing status of both the critical network infrastructure (routers, switches, firewalls, etc.) as well as the Web services (applications, databases and servers)," Infoblox said, referring to change management as the biggest problem for IT teams worldwide.
Its not at all clear that Facebook had a DNS problem. Theres no indication of that from the official information they have published," said Jim Galvin, director of strategic relationships and technical standards at Afilias, which operates more than a dozen top-level domains including .info and .org. "Its pretty clear they had a distributed database problem
From our point of view, its an issue of how to make sure you have good data in the database to start with."
Galvin said the problem Facebook experienced was akin to a distributed denial-of-service attack, where a Web site is overwhelmed by traffic from a hacker. In Facebooks case, the excess traffic was created by its own automated system for verifying configuration values.
"With Facebook, the interim system knew it had bad data and wanted to get the right data
so it will keep asking until it gets the right answer," Galvin said. "The analogy is a DDOS attack. You have more and more resolvers suddenly figuring out that they have bad data in the cache, and theyre constantly requesting the right data. The servers that have bad data in them are seeing more requests, and everything slows down."
Galvin said confusion surrounding the Facebook outage stems from the fact that DNS has similar properties. If there was bad data in an authoritative DNS database, the DNS resolvers would continue to ask for the correct data and flood the system with traffic. Also, the bad data in a DNS database would continue to reside in a cached database for a certain number of hours after the error was fixed because of the time-to-live (TTL) feature of the DNS. Many Web sites have a TTL of one day, which means bad data will live in DNS caches for 24 hours.
"This is what DNS does by default," Galvin said. "If it gets bad data in the cache, that is where the TTL comes to play. You may or may not be able to do something about that depending on how long your TTL is."
While the Facebook outage does not appear related to DNS, similar misconfigurations of DNS data have prompted massive outages, most recently for Germany and Sweden.
In May, Germany suffered an outage of its .de top-level domain zone servers that kicked millions of Web sites such as
www.ford.de offline for several hours because of a truncated zone file.
A year ago, all Web sites with Swedens .se extension were unavailable for an hour or more because an incorrect script used to update the .se domain was missing a dot.
"These types of outages happen frequently," Hyatt said. "They happen through poorly managed systems. The one that happened in Germany and the one that happened in Sweden those were mistakes or errors in automated scripts that should never happen
They could have been avoided."
Hyatt said DNS appliances including BlueCats feature configuration checking software that can alert administrators that the DNS data change they are making is invalid.
"We have data checking rules that look at the configuration youre trying to deploy and wont push it out
if the system doesnt exist or the system isnt configured right," Hyatt said. "Our system has a lot of smarts. It will give you an alert and tell you whats wrong."
BlueCats appliances have featured DNS configuration checking since they were introduced back in 2001.
"Were looking for anomalies, logical errors that dont make sense," Hyatt said. "We definitely would have caught the Germany and Sweden errors because those were logic errors."
Similarly, Afilias checks zone file changes for the top-level domains that it operates before the changes get published to prevent errors like those experienced by the operators of .de and .se.
"We notice when zone files are changed. It pops an alert so it gets investigated," Galvin said. "We check the percentage of change
It would have helped prevent the Germany and Sweden problems, where there were very dramatic zone file changes."
But Galvin added that theres not much a service provider like Afilias can do if a customer has bad data in its DNS database, much like the scenario Facebook experienced.
"Youre wholly responsible for your own data; all we guarantee is that your data is available," Galvin said. "You cannot recover faster [from your bad data] than your TTL allows recovery to occur."
Hyatt added that the best error checking systems cant prevent sys admins from making every type of mistakes that would cause an outage. "If they are doing something risky and overriding best practices, we cant prevent that," Hyatt added.
It's
NEVER to late to
Delete your Facebook Account click below for details.
▼ ▼ ▼ ▼