Network Aided Classification and Detection of Data

2018-12-13T18:24:00Z (GMT) by Vinay Uday Prabhu
Two important technological aspects of the Big data paradigm have been the emergence of massive<br>scale Online Social Networks (OSNs) (such as Facebook and Twitter), and the rise of the<br>open data movement that has resulted in the creation of richly structured online datasets, such as<br>Wikipedia, Citeseer and the US federal government’s data.gov initiative. The examples of OSNs<br>and online datasets cited above share the common feature that they can be thought of as Online<br>Information Graphs, in the sense that the information embedded in them has a natural graph<br>structure.<br>In this thesis, we consider using this underlying Online Information Graph as a statistical prior<br>to enhance classification accuracy of some hard machine learning problems. Specifically, we look<br>at instances where the graph is undirected and propose using the graph to define an Ising -<br>Markov Random Field (MRF) prior. To begin with, we validate the Ising prior using a novel hypothesis<br>testing framework based approach. Having validated the Ising prior, we demonstrate its<br>utility by showcasing Network Aided Vector classification (NAC) of real world data from fields<br>as varied as vote prediction in the US senate, movie earnings level classification (using IMDb<br>dataset) and county crime-level classification (using the US census data). We then consider a<br>special case of the classification problem which involves Network Aided Detection (NAD) of a<br>global sentiment in an OSN. To this end, we consider Latent Sentiment (LS) detection as well as<br>Majority Sentiment detection. We analyze the performance of the trivial sentiment detector for<br>LS detection using a novel communications-oriented viewpoint, where we view the underlying<br>network as providing a weak channel code that transmits one bit of information (the binary<br>sentiment) and perform error exponent analysis for various underlying graph models. We also<br>address the problem of optimal Maximum A posterior Probability (MAP) detection of majority<br>sentiment in the highly noisy labels weak network effect (NW) scenario, deriving the High Temperature<br>(HT) expansion formula for the partial partition function of the Ising model using the<br>code-puncturing idea from coding theory and then proposing an approximate MAP detector that<br>outperforms the Maximum Likelihood (ML) detector and the trivial detector.