Cleaning Up the Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Cleaning Up the Neighborhood: Duplicate Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Detection and Community Analysis of Schneiderman Hollenbeck Gangs Introduction Background Our Problem Data Cleaning String Cleaning Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Results and Data Sparsity Schneiderman Spectral Clustering Implementation Results Modularity and Multi-Slice August 8, 2012 Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Hollenbeck Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, ◮ 200,000 residents, 15.2 Anna Ma, Daniel Moyer, Brendan square miles Schneiderman ◮ 19 miles east of UCLA Introduction Background ◮ Home to 31 distinct gangs Our Problem Data Cleaning ◮ Bordered by Los Angeles String Cleaning Results and Data River, Vernon, and several Sparsity freeways Spectral Clustering Implementation ◮ Creates social Results insulation making it Modularity and Multi-Slice desirable for Modularity Multiplex Methods sociological study Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Data Collection Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Schneiderman Introduction Background Our Problem Data Cleaning ◮ Every time the poilice stop to talk to someone, they fill String Cleaning Results and Data out a “Field Interview (FI) Card”. Sparsity Spectral Clustering ◮ Includes Name, Address, SSN, Gang Affiliation, Implementation Results Moniker, Location of stop, etc. Modularity and Multi-Slice ◮ Gang members are typically honest about gang Modularity Multiplex Methods affiliation. Intergang ◮ This data was collected, stored, and given to us, by the Relations Intergang Analysis LAPD Future Work Acknowledgements
Cleaning Up the Task 1: Data Cleaning Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Schneiderman Introduction Background Our Problem Data Cleaning String Cleaning Results and Data Sparsity Spectral Clustering Implementation Results ◮ Miscommunications, mistakes, and inconsistencies in Modularity and data Multi-Slice Modularity ◮ eg. ”Aug 18 2007” vs ”18-08-07” Multiplex Methods Intergang ◮ Need to eliminate any duplicates to create most Relations Intergang Analysis accurate social data Future Work ◮ Very large initial data set - over 34,000 entries! Acknowledgements
Cleaning Up the Task 2: Data Analysis Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan ◮ Spectral Clustering Schneiderman ◮ Our runs are modeled after Van Gennip and Hunter et Introduction al. and 2011 UCLA REU Background Our Problem ◮ Modularity: Data Cleaning ◮ Implement another clustering algorithm and compare its String Cleaning Results and Data results to spectral clustering Sparsity Spectral Clustering ◮ Intergang Communities: Implementation Results ◮ Analyze incidents involving different gangs Modularity and Multi-Slice Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Data Cleaning Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan ◮ Initially provided a large excel sheet Schneiderman ◮ 34303 Entries, 71 fields Introduction ◮ Each entry is a single entry on an FI Card Background Our Problem ◮ Want to identify duplicate entries of people Data Cleaning String Cleaning Last First M.I. OLN GangAff Results and Data Sparsity Bruin Joe C.E. Young Crew Spectral Clustering Bruin Joseph D. E123456 Charles E. Young Crew Implementation Results Trojan Tommy A. N654321 SoCal Uni Modularity and Multi-Slice Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Matching People Neighborhood: Duplicate Detection and Community ◮ Want to match Joe, Joey, and Jeoy; but also Shadow, Analysis of Hollenbeck Gangs Ghost Shadow, and Shadow/Killer Ryan de Vera, Anna Ma, Daniel ◮ Jaro-Winkler distance Moyer, Brendan Schneiderman JaroDist 1 , 2 = 1 3( λ + λ + λ − t ) Introduction S 1 S 2 λ Background Our Problem Data Cleaning String Cleaning Results and Data Sparsity Spectral Clustering Implementation Results Modularity and Multi-Slice Modularity Multiplex Methods Intergang ◮ Tokenization via softTFIDF scheme and then Relations application of Jaro-Winkler Intergang Analysis Future Work Acknowledgements
Cleaning Up the Matching People - cont. Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Schneiderman Introduction Background Our Problem Data Cleaning String Cleaning Results and Data Sparsity Spectral Clustering Implementation Results Modularity and Multi-Slice Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Results Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel ◮ 34303 entries — > 8834 self reported gang Moyer, Brendan Schneiderman members— > 3163 unique gang members Introduction ◮ 22610 distinct FI card numbers — > 2987 events (with Background Our Problem at least one gang member) Data Cleaning ◮ Sparsity of Data String Cleaning Results and Data ◮ 1633 singletons (never seen with another gang member) Sparsity ◮ ∼ 0.5% expected intragang connections observed Spectral Clustering Implementation ◮ Last year: 2.66% Results Modularity and ◮ Average degree per person: 1 . 65 ± 3 . 17 Multi-Slice Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Spectral Clustering Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Why Spectral Clustering? Ryan de Vera, Anna Ma, Daniel Moyer, Brendan ◮ It is simple to implement Schneiderman ◮ Can be solved efficiently Introduction Background ◮ Applications ranging from statistics, computer science, Our Problem biology, and social sciences Data Cleaning String Cleaning ◮ Determine the communities into which gang members Results and Data Sparsity in Hollenbeck organize themselves because it is an Spectral Clustering Implementation important step to determining their behavior Results Modularity and ◮ Extend on last year’s REU paper with hopes of less Multi-Slice sparse data and therefore better results Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the How it works Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Schneiderman Introduction Background Our Problem Data Cleaning String Cleaning Results and Data Sparsity Spectral Clustering Implementation ◮ Goal: divide data points into distinct clusters Results Modularity and ◮ Create a normalized affinity matrix that includes both Multi-Slice geographic and social data Modularity Multiplex Methods ◮ Compute the eigenvectors of the affinity matrix Intergang Relations ◮ Use k-means to separate the data into distinct clusters Intergang Analysis ◮ inbed data points in space spanned by first k Future Work Acknowledgements eigenvectors
Cleaning Up the Normalized Affinity Matrix Neighborhood: Duplicate Detection and Community Analysis of W i , j = α S i , j + (1 − α ) e − d 2 i , j /σ i σ j Hollenbeck Gangs Ryan de Vera, � Anna Ma, Daniel 1 if i has met j Moyer, Brendan S i , j = Schneiderman 0 otherwise Introduction Background Our Problem Data Cleaning String Cleaning Results and Data Sparsity Spectral Clustering Implementation Results Modularity and Multi-Slice Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Clustering Structures Embedded in the Neighborhood: Duplicate Eigenvectors Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Schneiderman Introduction Background Our Problem Data Cleaning String Cleaning Results and Data Sparsity Spectral Clustering Implementation Results Modularity and Multi-Slice Modularity Multiplex Methods Intergang Relations Intergang Analysis Future Work Acknowledgements
Cleaning Up the Results of Spectral Clustering Algorithm Neighborhood: Duplicate Detection and Community Analysis of Hollenbeck Gangs Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Schneiderman Introduction Background Our Problem Data Cleaning String Cleaning Results and Data Sparsity Spectral Clustering Implementation Results Purity = 1 Modularity and � max | ω k ∩ c j | Multi-Slice N j Modularity k Multiplex Methods Intergang Z-Rand: the number of standard deviations which ω 1 , 1 is Relations removed from its mean value under a hypergeometric Intergang Analysis Future Work distribution of equally likely assignments Acknowledgements Reference Z-Rand: 1030
Recommend
More recommend