neural embeddings for
play

Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, - PowerPoint PPT Presentation

Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, Pedro Szekely USC Information Sciences Institute Motivation: feature extraction from locations Essential for machine learning problems involving locations Machine learning


  1. Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, Pedro Szekely USC Information Sciences Institute

  2. Motivation: feature extraction from locations • Essential for machine learning problems involving locations

  3. Machine learning applications • Toponym resolution e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also got extracted in a similar context • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text)

  4. Machine learning applications • Toponym resolution e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also got extracted in a similar context • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text) • Named entity disambiguation e.g. Was ‘Charlotte’ extracted as a name or a location?

  5. Motivation: feature extraction from locations • Essential for machine learning problems Why not use latitude-longitude directly?

  6. What makes for a ‘good’ feature space? • Captures proximity semantics • Real-valued, not very high-dimensional • Not too sensitive (1.0 vs. 1.001) • Extensible • Does not necessarily require manual tuning • Generic i.e. can be visualized in some way

  7. Do lat-long points capture proximity semantics? • Only in a very dense, non-linear space

  8. More formally... • dist(lat 1 , long 1 , lat 2 , long 2 ) is well-approximated using the Haversine formula

  9. Do lat-long points capture proximity semantics? • Discontinuous (in linear space)!

  10. Do lat-long points capture proximity semantics? • Sensitive (more than other features typically used in machine learning pipelines)

  11. What makes for a good feature space? • Captures proximity semantics • Real-valued, not very high-dimensional • Not too sensitive (1.0 vs. 1.001) • Extensible • Does not necessarily require manual tuning • Generic i.e. can be visualized in some way

  12. Id Idea: ‘Embed’ Geonames as a weighted, directed network... • ...in a vector space! • Vector similarities (using dot product similarity) depend inversely on geodesic distances 2-dimensional un-normalized 100-dimensional embeddings (latitude- normalized embeddings in longitude) in complex, dot product space sensitive space

  13. Step 1: Determine set of nodes in network • Nodes in Geonames identifies by following feature codes: [`PPL', `PPLA', `PPLA2', `PPLA3', `PPLA4', `PPLC', `PPLCH', `PPLF', `PPLG', `PPLH', `PPLL', `PPLQ', `PPLR', `PPLS', `PPLW', `PPLX', ~4.4 million nodes `STLMT']

  14. Step 2: Determine edges and weights • Pairwise in the worst case • Slide a window over nodes sorted by latitude or longitude, only form edges between nodes in the same window. ~357,000 nodes • Postprocess by removing nodes with ~9 million edges 0 population

  15. Step 3: Run DeepWalk on network • DeepWalk (Perozzi et al., 2014) is a powerful neural network algorithm for embedding nodes in graphs; has achieved powerful results • Very fast!

  16. Example in paper: North Dakota

  17. Vectors, code and raw data all on GitHub (also, figshare) https://github.com/mayankkejriwal/Geonames-embeddings

Recommend


More recommend