How Google Flu Trends Works

It’s a natural part of living in the information age: You start to feel sick, so you Google your symptoms.
It’s a natural part of living in the information age: You start to feel sick, so you Google your symptoms.

Both the common cold and the flu will make you feel miserable, and because both are respiratory infections with similar symptoms — coughing, aching, headache, you know the drill — it can be difficult to know which one has you in its grip.

Every year, anywhere from five to 20 percent of the U.S. population will get the flu, mostly during that wintery stretch between December and February, give or take a few months [source: CDC]. While many sufferers will find relief in over-the-counter medicines, influenza can be serious. Influenza-related complications may require hospitalization, and sometimes complications can be fatal. Influenza, together with pneumonia (both are lower respiratory infections), ranked as the eighth leading cause of death in the U.S. in 2010, and respiratory infections were the third leading cause of deaths worldwide that year (as many as 3.2 million people) [source: CDC, CNBC].


Because seasonal influenza can cause serious complications, the Centers for Disease Control and Prevention (CDC) monitors influenza-like illness (ILI) across the U.S., tracking and analyzing flu activity to get a good picture of the incidence rate, prevalence proportion and occurrence rate of ILI throughout the year [source: Harvard Health Publications]. For its tracking purposes, the CDC considers a fever of at least 100 degrees Fahrenheit (37.8 degrees Celsius) with a cough and/or a sore throat to be an ILI.

The CDC monitors these numbers with data collected through multiple sources, including local and state health departments, 122 public health and vital statistics offices, nearly 3,000 outpatient health care facilities, more than 270 laboratories, and reports from the FluSurv-NET surveillance system [source: CDC]. All those pieces are broken down into five categories of useable information:

  1. Viral Surveillance — laboratory reports on the number of respiratory specimens taken that week and what percentage were, in fact, confirmed flu
  2. Mortality — data on the proportion of pneumonia and influenza (P&I)-related deaths and reports of influenza-associated pediatric deaths
  3. Hospitalizations — confirmed influenza-related hospitalizations
  4. Outpatient Illness Surveillance — tracking the number of outpatient visits for ILI
  5. Geographic Spread of Illness — the estimated level of flu activity by state, which could be widespread, regional, local, sporadic or no activity

Beginning on the 40th week of the year — which is the beginning of the October to May flu season — the CDC distributes weekly influenza activity reports.

The information the CDC circulates is intended to be a snapshot of current flu trends, not specific numbers of people who caught the flu during that flu season or year. The focus is on whether flu outbreaks are occurring, where flu is being reported, when it was reported and which influenza viruses are to blame.

While the data released by the CDC provides an accurate picture of flu trends, that data, once it's compiled and analyzed, is also one to two weeks old. It can't tell you whether a new pocket of flu emerged in a specific city over the previous weekend, but it's good for measuring the overall impact of flu on the U.S. population, in addition to making flu-related public health recommendations.

For instance, by monitoring which strains of influenza were circulating in the 2014 flu season, CDC epidemiologists were able to tell with data collected between Oct. 1 and Nov. 22 that one of the three chosen strains included in that year's flu shot had mutated, and the vaccine would be less effective that season.

But what if you wanted to know more about that a flu epidemic that's spreading through a nearby city? Google would like to help with that.


About Google Flu Trends

Almost three-quarters of Americans searched health information online in the last year.
Almost three-quarters of Americans searched health information online in the last year.

As many as 72 percent of American adults admit they've looked up health information online in the past year — that's about 90 million people, mostly searching for information about specific conditions such as a cough or flu, or treatments such as antibiotics. And more than three-quarters of those who search online health information begin their inquiry at Google, Bing or Yahoo [sources: Fox, Ginsberg]. Think about what kind of information is sitting in those search engine databases. Well, Google did.

Google Flu Trends (GFT) is an Internet-based influenza surveillance tool that uses aggregated search query data to predict flu trends in more than 25 countries, including the U.S. The project began in 2008 as an initiative under Google's philanthropic arm,, after the idea sprung from observed seasonal spikes of certain types of search terms.


For example, when springtime allergies strike, we're more likely to search for antihistamines than during the winter flu season, when we're more likely to search for information about our cold and flu symptoms such as fever or chills.

Google engineers used five years of historical big data — and we mean big. They tapped into their database of 50 million of the most commonly used prefiltered search queries to establish a baseline of general flu activity. The initial algorithm for the prediction tool relied solely on regional flu-related search query data (regional based on IP address), including overarching topics such as general influenza symptoms, cold remedies and antiviral medications.

The algorithm compares real-time search query data — the word or phrase you used as your search term, such as "sore throat" — against the baseline to determine levels of regional flu activity, ranging among five classifications from minimal to intense. Theoretically, GFT could provide current-day reporting (near real-time) of flu activity and predict influenza outbreaks weeks before the CDC compiles a report.

According to GFT inventors, though, GFT's real-time reporting is meant to be used as complementary information to the clinical and virological data in traditional surveillance (the CDC and its networks). GFT's fast detection is intended to help with early detection of not only flu epidemics, but also viral strain identification and the potential for pandemics.


GFT: Model Updates, Accuracy and the Big Data Trap

One of the problems with analyzing search data to determine illness trends is that it doesn’t account for people who aren’t sick, but are fretful about coming down with something.
One of the problems with analyzing search data to determine illness trends is that it doesn’t account for people who aren’t sick, but are fretful about coming down with something.
© Hemera/Thinkstock

Prior to each new year's flu season, the Google Flu Trends model is refreshed with 45 of the most useful influenza-related queries from years prior (those special search terms are chosen using logistic regression, but the exact queries and how they're weighted against others are kept top secret).

Additionally, GFT's post-season estimates are assessed against the traditional data surveillance reports used by the CDC to see how well the two match. Based on the prediction tool's ability to accurately estimate when that year's flu season begins, when the season will peak, and how severe it will be, the model may be updated. When it first launched in 2008, GFT had a mean correlation of 97 percent with CDC data [source: Ginsberg].


In September 2009, the model for the U.S. version of Google Flu Trends got its first update to include search query data from the H1N1 outbreak. This was because GFT's model had completely underestimated the H1N1 swine flu pandemic (which happened in the summertime). And then it continued to miss the mark.

During the 2011/2012 flu season, GFT overestimated the prevalence of flu by 50 percent. GFT also overestimated the 2012/2013 flu season, predicting as many as double the number of outpatient visits relating to ILI as the CDC actually reported. At the peak of the 2013/2014 flu season, GFT estimated that as many as 11 percent of the U.S. population had the flu. If that seems like a lot, it's because it is — the CDC, in comparison, reported 6 percent that season. Researchers report that the tool's accuracy may actually be much worse; they found that beginning in August 2011 GFT had overestimated in 100 out of 108 weeks [sources: Hodson, Walsh, Lazer].

The most common explanation for Google's flu prevalence overestimation is nothing more than our own jerkiness when flu season rolls around — you know, when you search the word "cough" in an effort to figure out if you're coming down with the flu, a cold or, maybe, wait, could it be pneumonia? Media use of phrases like "the worst flu season in years" and seasonal flu media reports also contribute to our cough-obsessed searches. The problem is that GFT doesn't know whether you're sick or just worried about getting sick; consider that only about 10 percent of all the people who seek medical care for the flu actually have influenza [source: Salzberg]. Google searches don't have context, and they don't know your intent.

But that might not be the complete answer.

In addition to ILI-related media hype inflating flu searches, working with big data can lead to making correlations that may not be accurate. It's the big data trap. While the results of mining the data may paint a relationship between seasonal search queries and, say, doctor visits, the sheer massiveness of the data set suggests that correlation's accuracy can't be trusted.

Another question about GFT's overestimation lies in Google's own search engine algorithm updates. Researchers propose that the introduction of the autosuggest feature in Google Search changed user behavior for the potential for overestimation in GFT; users searching for one flu symptom were now being encouraged to search for more (Google-recommended) flu-related terms, influencing overall ILI-related searches.

In 2012, the search engine began including possible conditions related to the symptoms queried, also potentially adding to the overestimation problem.

However, after poor performance again in the 2012/2013 flu season, GFT's algorithm was again updated. It would now downplay any media-driven irregularities and make its forecasts based on a statistical method called ElasticNet (which is a generalized linear model of regularized regression). But there was still room for improvement; the revised algorithm still overestimated by as much as 30 percent [source: Lohr].

In 2014, GFT engineers updated the GFT tool to include not only refreshed search data but also the traditional clinical and virological so-called small data from the CDC for the 2014/2015 flu season. Both engineers and scientists agree a combination of this information should lead to more accurate results.


Lots More Information

Author's Note: How Google Flu Trends Work

What a week to immerse yourself in influenza; the day I was writing about how the CDC monitors and analyzes flu data was the same day CDC health officials announced that this year's flu season could be severe — because one of the virus strains (and the one that's most dominant so far this season) used in this year's vaccine has mutated. Keep your eye on Google Flu Trends.

Related Articles

More Great Links

  • Arce, Nicole. "Google Flu Trends got it wrong: Flu prediction tool gets updated." Tech Times. Nov. 1, 2014. (Dec. 5, 2014)
  • Arthur, Charles. "Google Flu Trends is no longer good at predicting flu, scientists find." The Guardian. March 27, 2014. (Dec. 5, 2014)
  • Butler, Declan. "When Google got flu wrong." Nature. Feb. 13, 2013. (Dec. 5, 2014)
  • Centers for Disease Control and Prevention. "Deaths: Final Data for 2011." (Dec. 5, 2014)
  • Centers for Disease Control and Prevention. "Influenza (Flu)." Dec. 4, 2014. (Dec. 5, 2014)
  • CNBC. "The world's 10 leading causes of death." (Dec. 5, 2014)
  • Copeland, Patrick. "Google Disease Trends: An Update." (Dec. 5, 2014)
  • Fox, Susannah. "The social life of health information." Pew Research Center. Jan. 15, 2014.(Dec. 5, 2014)
  • Fung, Kaiser. "Google Flu Trends' Failure Shows Good Data > Big Data." Harvard Business Review. March 25, 2014. (Dec. 5, 2014)
  • Ginsberg, Jeremy. "Letter: Detecting influenza epidemics using search engine query data." Nature. Vol. 457. Pages 1012-1014. Feb. 19, 2009. (Dec. 5, 2014)
  • Goldschmidt, Debra. "CDC: Flu shot less effective this year because current virus has mutated." CNN. Dec. 4, 2014. (Dec. 5, 2014)
  • "Flu Trends." 2014. (Dec. 5, 2014)
  • Harvard Medical School - Harvard University. "10 flu myths." (Dec. 5, 2014)
  • Hodson, Hal. "Google Flu Trends gets it wrong three years running." NewScientist. March 13, 2014. (Dec. 5, 2014)
  • Lazer, David. "Google Flu Trends Still Appears Sick: An Evaluation of the 2013-2014 Flu Season." Social Science Research Network. March 13, 2014. (Dec. 5, 2014)
  • Lazer, David. "The Parable of Google Flu: Traps in Big Data Analysis." Science. Vol. 343, No. 6176, Pages 1203-1205. March 14, 2014. (Dec. 5, 2014)
  • Lohr, Steve. "Google Flu trends: The Limits of Big Data." The New York Times. March 28, 2014. (Dec. 5, 2014)
  • Oremus, Will. "Going Viral." Slate. Jan. 9, 2013. (Dec. 5, 2014)
  • Salzberg, Steven. "Why Google Flu Is A Failure." Forbes. March 23, 2014. (Dec. 5, 2014)
  • Stefansen, Christian. "Google Flu Trends gets a brand new engine." Google Research Blog - Google. Oct. 31, 2014. (Dec. 5, 2014)
  • Stromberg, Joseph. "Why Google Flu Trends Can't Track the Flu (Yet)." Smithonian Magazine. March 13, 2014. (Dec. 5, 2014)
  • Walsh, Bryan. "Google's Flu Project Shows the Failings of Big Data." Time. March 13, 2014. (Dec. 5, 2014)