Handling Missing Values in Joint Sequence Analysis
Alexandra Ballow (Youngstown State University, Lawrence Berkeley National Laboratory)
SRC - Undergraduate
This study focuses on developing methodologies to minimize the effects of incomplete data. Specifically, it hopes to reduced the noise and bias caused in categorical sequence data by data gaps. Some strategies investigated include choosing a substitution “cost” to replace missing values and deleting the missing values at the end of a sequence. Cluster validity metrics are used to determine the accuracy of the unsupervised clustering algorithms and t-SNE is employed to visualize clusters and age biases. It became clear that deleting missing values provided the best results, but all data sets are different. Thus this study recommends employing the studied procedures before conducting analysis on longitudinal sequence data to ensure the results are unbiased. After these tests optimize the data, clustering is conducted to understand the correlation between a person’s state in life and their travel.