Open online data such as microblogs and discussion board messages have the potential to be an incredibly valuable source of information about health in populations. Such data has been rapidly growing, is low cost, real-time and seems likely to cover a significant proportion of the demographic. To take two examples, PatientsLikeMe has enjoyed 10% growth and now has over 200,000 users covering over 1500 health conditions; the generic Twitter service is expanding at a rate of 30% annually with over 200 million active users. Going beyond simple keyword search and harnessing this data for public health represents both an opportunity and a challenge to natural language processing (NLP).
The EPSRC SIPHS project (grant no. EP/M005089/1) is about helping health experts leverage social media for their own clinical and scientific studies through automatic techniques that encode messages according to a machine understandable semantic representation. There are three major challenges this project seeks to address: (1) knowledge brokering: to develop algorithms to identify and code the informal descriptions of conditions, treatments, medications, behaviours and attitudes to standard ontologies such as the UMLS; (2) knowledge management: to create a structured resource of patient vocabulary used in blog texts and link it to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize the coded information to automatically generate meaningful summaries for follow up investigation.
At the technological level SIPHS seeks to pioneer new methods for NLP and machine learning (ML). Social media remains a challenging area for NLP for a variety of reasons: short de-contextualised messages, high levels of ambiguity/out of vocabulary words, use of slang and an evolving vocabulary, as well as inherent bias towards sensational topics. The fellowship seeks to harness the progress made so far in NLP for social media analysis in the commercial domain and develop it further to provide meaningful public health evidence. One key aspect not previously addressed is in the clinical coding of patient messages. Although knowledge brokering systems exist for clinical and scientific texts (e.g. MetaMap), their performance on social media messages has been poor. SIPHS aims to utilise the rich availability of ontological resources in biomedicine together with ML on annotated message data to disambiguate informal language. Research will also aim to understanding the communicative function of messages, for example whether the message reports direct experience or is related to news, humour or marketing. If these problems are successfully overcome an important barrier to data integration with other types of clinical data will be removed.
SIPHS is being led by Dr. Nigel Collier, Principal Research Associate and Co-Director of the Language Technology Lab at the Department of Theoretical and Applied Linguistics, University of Cambridge.
##Publications:
[1] Limsopatham, N. and Collier, N. (2015), “Adapting phrase-based machine translation to normalise medical terms in social media messages”, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17-21 September 2015, pp. 1675-1680. Available at https://www.repository.cam.ac.uk/handle/1810/249295
[2] Limsopatham, N. and Collier, N. (2015), “Towards the semantic interpretation of personal health messages from social media”, in Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015), Workshop on Understanding the City with Urban Informatics (UCUI 2015), Melbourne, Australia, 19-23 October 2015. Available at https://www.repository.cam.ac.uk/handle/1810/249275
Reference 2 above is the canonical reference for the SIPHS project.
##Related information:
[1] EPSRC SIPHS project grant details: http://gow.epsrc.ac.uk/NGBOViewGrant.aspx?GrantRef=EP/M005089/1
[2] Nigel Collier's Web site: https://sites.google.com/site/nhcollier/
[3] Nut Limsopatham's Web site: http://www.mml.cam.ac.uk/nl347
[4] The Language Technology Lab Web site: http://ltl.mml.cam.ac.uk/