The Google Flu Trends algorithm, as it is known, performed poorly. For instance, it continually overestimated doctor visits, later evaluations found, because of limitations of the data and the influence of outside factors such as media attention, which can drive up searches that are unrelated to actual illness.
Since then, researchers have made multiple adjustments to this approach, combining Google searches with other kinds of data. Teams at Carnegie-Mellon University, University College London and the University of Texas, among others, have models incorporating some real-time data analysis.
“We know that no single data stream is useful in isolation,” said Madhav Marathe, a computer scientist at the University of Virginia. “The contribution of this new paper is that they have a good, wide variety of streams.”
In the new paper, the team analyzed real-time data from four sources, in addition to Google: Covid-related Twitter posts, geotagged for location; doctors’ searches on a physician platform called UpToDate; anonymous mobility data from smartphones; and readings from the Kinsa Smart Thermometer, which uploads to an app. It integrated those data streams with a sophisticated prediction model developed at Northeastern University, based on how people move and interact in communities.
The team tested the predictive value of trends in the data stream by looking at how each correlated with case counts and deaths over March and April, in each state.
In New York, for instance, a sharp uptrend in Covid-related Twitter posts began more than a week before case counts exploded in mid-March; relevant Google searches and Kinsa measures spiked several days beforehand.
The team combined all its data sources, in effect weighting each according to how strongly it was correlated to a coming increase in cases. This “harmonized” algorithm anticipated outbreaks by 21 days, on average, the researchers found.