Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, Pedro Szekely USC Information Sciences Institute
Motivation: feature extraction from locations • Essential for machine learning problems involving locations
Machine learning applications • Toponym resolution e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also got extracted in a similar context • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text)
Machine learning applications • Toponym resolution e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also got extracted in a similar context • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text) • Named entity disambiguation e.g. Was ‘Charlotte’ extracted as a name or a location?
Motivation: feature extraction from locations • Essential for machine learning problems Why not use latitude-longitude directly?
What makes for a ‘good’ feature space? • Captures proximity semantics • Real-valued, not very high-dimensional • Not too sensitive (1.0 vs. 1.001) • Extensible • Does not necessarily require manual tuning • Generic i.e. can be visualized in some way
Do lat-long points capture proximity semantics? • Only in a very dense, non-linear space
More formally... • dist(lat 1 , long 1 , lat 2 , long 2 ) is well-approximated using the Haversine formula
Do lat-long points capture proximity semantics? • Discontinuous (in linear space)!
Do lat-long points capture proximity semantics? • Sensitive (more than other features typically used in machine learning pipelines)
What makes for a good feature space? • Captures proximity semantics • Real-valued, not very high-dimensional • Not too sensitive (1.0 vs. 1.001) • Extensible • Does not necessarily require manual tuning • Generic i.e. can be visualized in some way
Id Idea: ‘Embed’ Geonames as a weighted, directed network... • ...in a vector space! • Vector similarities (using dot product similarity) depend inversely on geodesic distances 2-dimensional un-normalized 100-dimensional embeddings (latitude- normalized embeddings in longitude) in complex, dot product space sensitive space
Step 1: Determine set of nodes in network • Nodes in Geonames identifies by following feature codes: [`PPL', `PPLA', `PPLA2', `PPLA3', `PPLA4', `PPLC', `PPLCH', `PPLF', `PPLG', `PPLH', `PPLL', `PPLQ', `PPLR', `PPLS', `PPLW', `PPLX', ~4.4 million nodes `STLMT']
Step 2: Determine edges and weights • Pairwise in the worst case • Slide a window over nodes sorted by latitude or longitude, only form edges between nodes in the same window. ~357,000 nodes • Postprocess by removing nodes with ~9 million edges 0 population
Step 3: Run DeepWalk on network • DeepWalk (Perozzi et al., 2014) is a powerful neural network algorithm for embedding nodes in graphs; has achieved powerful results • Very fast!
Example in paper: North Dakota
Vectors, code and raw data all on GitHub (also, figshare) https://github.com/mayankkejriwal/Geonames-embeddings
Recommend
More recommend