Epidemiological Disease Surveillance Using Public Media Text Mining
Andrea Villanes Arellano, North Carolina State University
Despite the improvement in health conditions across the world, communicable diseases remain among the leading mortality causes in many countries. Combating communicable diseases depends on surveillance, preventive measures, outbreak investigation, and the establishment of control mechanisms. However, delays in obtaining country-level data of confirmed communicable disease cases, such as dengue fever, are prompting new efforts for short- to medium-term data. In this paper, we propose the creation of a surveillance tool for communicable diseases, with a focus on dengue fever, by analyzing data on public media. Our research offers the following novel contributions to text analytics, sentiment analysis, epidemiology, and visualization areas: (1) an alternative method for near real-time estimation of disease outbreak, spread, and response based on text analytics of public media sources like newspapers and social media; (2) identification of topics extracted from epidemiological news articles using text mining cluster analysis and topic analysis, which has not been used before in public health surveillance systems; (3) comparison of existing text mining classification techniques to accurately predict news article topics; (4) creation of a communicable disease sentiment dictionary by extending an existing dictionary with epidemiological terms and their associated sentiments, this sentiment dictionary can be used to estimate sentiment in the area of public health; (5) creation of a streamgraph inspired technique to display evolution of topics over time, incorporating known trends to allow for comparison; (6) integration of our cluster, categorization, sentiment analysis, and visualization techniques into an interactive web-based tool that allows domain experts to monitor dengue fever. This tool can be used as the basis for monitoring other communicable diseases in the future.