James Baker, the Food Standards Agency’s Social Media Manager, explains how social media can help us identify norovirus outbreaks much earlier than labs confirm the cases.
by James Baker
If you're part of the digital and communication revolution currently happening all around us, you might have already seen how the FSA is embracing social and digital media to help communicate and build relationships with you, with food businesses, with other government departments and with our stakeholders.
Beyond answering questions and concerns, providing updates quickly during incidents, and raising awareness of dangerous substances and food alerts, however, we’ve discovered how we can go even further and use social media to warn us of what’s on the horizon.
We recently carried out some research with our Analysis and Research Division (ARD) to learn more about how our social media listening could help give us an early warning of norovirus outbreaks (aka the dreaded winter vomiting bug).
What we did
We looked at Twitter data from the last big outbreak hunting for spikes in symptom key words and related terms that were being used in tweets, and compared them with the number of lab confirmed norovirus cases for the same periods.
What we found
Not only were we able to accurately pin point the most common symptom keywords being used, we also found a number of high correlations between the tweets and lab reports themselves.
We also found some interesting patterns when we looked at correlations between tweets and lab report data over different time periods. While a small set of core words were strongly correlated to lab cases across all the periods we tested, terms involving ‘virus’ and ‘winter’ correlated with the same and past weeks. Everyday terms for symptoms were strongly correlated to future lab cases in future weeks (as well as the current week in most cases).
Same Week and Past Weeks
Note: This table summarises keywords where correlations were significant (at the 0.01 level).
Past and Future weeks columns include only those terms that showed a significant correlations across all future/past weeks tested (+/-6weeks).
You might think – well isn't that what we'd expect? Yes, but what was really interesting was that when we started to explore this further using 'predictive analytics', we found that the changes in the number of tweets using symptom key words predicted the increase in lab reports at the start of the annual peak in human cases. In some cases, we found that the spikes in the key words being used were being shared up to four weeks earlier than when we received the Public Health England (PHE) confirmed lab reports.
In addition to the research and correlation between Twitter conversations and the lab reports, we also looked at Google trends to see if there was any increase or patterns in how people searched for the key words that we identified.
Although there were close correlations with lab reported cases, Google trends didn't seem to be as good as Twitter when used as an 'early alert tool'. We did, however, notice that there was an increase in people searching for symptom keywords or looking for more information about norovirus during the increase in cases.
These differences seem to make sense when we think about how we use different platforms. We tend to tweet out about our feelings, life events, and symptoms as soon as they happen to us as individuals. Whereas we often use a search engine to find out more about something after we have heard about it in the wider media, from a third party or while an event or incident is actually happening.
What we’re now doing
We are now using this data to explore a range of different predictive analytic approaches to test Twitter’s potential for early alerts to this type of risk. The initial results are encouraging, with both multiple linear regression models successfully predicting the weekly volume of lab cases, and classification tree models detecting when there’s an escalation of lab cases. (If you’re like me this might sound like a bit of ‘mind squeeze’, but you can find a great translation by the rather clever Ceri Cooper at the bottom of this post).
In short though, this means there's the potential for us to identify outbreaks of norovirus much earlier than before, giving us the opportunity to proactively share our advice and guidance with those who might be affected, alert other government departments and industry, and perhaps even help to reduce its spread.
Of course, as with any unsolicited social data shared publicly, there are caveats on how trustworthy it actually is. For example, people might be self-diagnosing, and search terms can be so broad that there's always going to be a bit of noise. And of course, early reports of an outbreak resulting in increased awareness means more chatter, which might skew the data. But what we’d hope is that as this online chatter increases, the subsequent number of cases would plateau or decrease because of an earlier intervention.
Into the future… And beyond!
The next step for us now is to refine our models and include a threshold alert into our monitoring systems so we can put our findings to the test, and try to build this monitoring into our internal processes for responding to any future incidents. We are also looking at other areas where we might be able to apply a similar approach.
There are, however, a few things we might want to think about for the future. 'Big data' tends to focus more on the 'what' than the 'why', and measuring social data accurately and robustly is actually quite difficult to do, very resource intensive; and due to the rapid change of the social media landscape - how and what we measure now, might not be the same in the future.
What is important to us though is that we've started our journey and we're hopeful that it will lead us to a place where we can use more social media, for even more social good.
James Baker is Social Media Manager for the Food Standards Agency
Analysis was undertaken by the analysts in the Government Operational Research Service and Beliz Bahar, an O.R. MSc student at London School of Economics.
The really clever stuff:
Thanks to Ceri Cooper from the FSA Analysis and Research Division for the translation below.
Multiple linear regression is a statistical method that attempts to model the relationship between two or more explanatory variables (norovirus keywords from Twitter in this instance) and a response variable (weekly volume of lab reports) by fitting a linear equation to observed data.
For example, weekly volume of lab reports = 1 x keyword 1_lag2weeks) + 1.5 x (keyword2_lag1week) + 5 x (keyword3_lag4weeks)
Classification trees are a bit more complicated to explain, but are commonly used in data mining with the objective of creating a model that predicts the value of a target variable (in this case whether laboratory reports are escalating or not) based on the values of several input variables (Twitter keywords)
This is achieved by searching for all possible explanatory variables (keywords) and all possible values in historic data in order to find the best split – the point that splits the data into two parts with greatest gain. The process is then repeated for each of the resulting branches. The results can be written as a series of rules
For example, if ‘Change in Keyword 1_lag4 weeks’>=50 and ‘Change in Keyword 3 leg_2weeks’> 100 then probability Laboratory cases escalating = 80%
‘Change in Keyword 1_lag4 weeks’>=50 and ‘Change in Keyword 3 leg_2weeks’<=100 then probability Laboratory cases escalating = 20%