for Improved Geotagging of Human Trafficking Webpages Rahul Kapoor, - PowerPoint PPT Presentation

Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages Rahul Kapoor, Mayank Kejriwal and Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering

Domain-specific Insight Graphs (DIG)

Geotagging HT webpages • Disambiguation problem: is Charlotte a name or city? Depends on context!

Geotagging HT webpages • Toponym Resolution • Examples –  “Kansas City” is a city in the state Missouri as well as Kansas  “Los Angeles” is also a town in Texas apart from being a city in California

Potential approach: use Geonames • Open database of geolocations • Contains 2.8 million populated places in the world along with 5.5 million alternate names • Each has a unique id and details of the state, country, latitude, longitude, population • Due to the large size, we use Trie based approach for high recall dictionary extraction More Information at http://www.geonames.org/

Using Geonames lexicon for extractions  Common words like “the”, “makes”, “falls” are city names as well  Some abbreviations used in the text are also marked as cities High Recall City “Want to be the girl Extractions “Want to be the girl Minnesota Actual Extractions Webpage Text that makes you..” that makes you..” “water falls near “water falls near Minnesota” Minnesota ” “This Cali girl..” “This Cali girl..” “AMBER CHASE “AMBER CHASE FEMDOM AVN” FEMDOM AVN ” “We provide NOM, “We provide NOM, DP, ATM, C2C..” DP, ATM , C2C..”

Contexts and constraints are both important • Constraints reflect domain knowledge (‘semantics’ of the domain e.g., that a city is in a state and a state is in a country; also, a priori knowledge) • Context reflect statistical (aka data-driven ) knowledge

Use context to train word embeddings • Many options in the literature (word2vec, random indexing...) • Random indexing found to work well for HT in previous work Hi gentlemen Charlotte visiting next ...

Useful for assigning probabilities to extractions (t-SNE) extractions in 200 dimension vector space

Context-based classifier

How do we encode constraints? • By itself, context is not enough; more can be done to improve performance! • Integer Linear Programming is an established framework • Requires manual crafting of: • Objective functions • Linear Constraints • Weights

OBJECTIVE FUNCTION WEIGHTS

Token Source Weight  Captures relative importance of source of extraction  City appearing in title is more important than those in footer

Context Weight  Captures what extraction is more likely to be correct depending on the context  “I am new to Charlotte ”, “My name is Charlotte ” - in the 1 st sentence the same word is more likely to be a city than in 2 nd

Population Weight  Larger cities are more likely to be referred than smaller cities  When someone mentions “Los Angeles”, he is most likely not referring to a small town in TX but the much larger city in CA

CONSTRAINTS

Semantic Type Exclusivity  An extraction marked as multiple semantic types can be only one of those  Charlotte_City + Charlotte_Name <= 1, means “Charlotte” can be either a city name or a name of a person at a time

Extractions of a Semantic Type  Limits the number of extractions of a page  LosAngeles_City + Seattle_City + Houston_City <= 1, means atmost one of the cities can be selected

Valid City – State-Country Combination  The selected city should be in the selected country/state  LosAngeles_US + NewYorkCity_US <= US, means if one of the cities on the left is selected, the country on the right must be selected

City-State/Country Exclusivity  The chosen city has a corresponding state/country selected  Portland_Oregon + Portland_Maine = Portland, means if Portland is selected, one of its corresponding states must be selected

Putting it together...

EXPERIMENTS

Dataset • Word Embeddings trained on a corpus of 90,000 web pages, using Random Indexing • Context classifier trained on 75 webpages • Groundtruth for ILP contained smaller corpus of 20 webpages coming from 10 different domains, having 175 geolocation annotations

Comparison  The extractions from ILP are compared to:  Random : A random selection from the extractions  Top Ranked : The highest ranked extraction according to the context probabilities  Metrics: Precision, Recall of extractions

Results Model Precision Recall Random 0.5 0.35714286 Top Ranked 0.61538462 0.57142857 ILP 0.78571429 0.78571429

Future Work  Using Probabilistic Soft Logic as an alternative to model the problem ILP PSL As the factors affecting selection Probabilistic model with continuous increase, need to combine weights random variables allows to capture for objective function multiple factors Not possible to model complex Can model based on First Order Logic relations which affect extraction representation selection Each extraction is either selected or Each extraction can be assigned an not selected expectation value May take time to optimize Soft truth values enable faster convergence Refer: http://psl.linqs.org/

for Improved Geotagging of Human Trafficking Webpages Rahul Kapoor, - PowerPoint PPT Presentation

Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages Rahul Kapoor, Mayank Kejriwal and Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering Domain-specific Insight Graphs (DIG)

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Cybercasing the Joint: On the Privacy Implications of Geo-Tagging Gerald Friedland, Robin Sommer

Improved Performance of Existing Technologies 1 Improved Performance Pinholes of 1 mm (0.04

Past experience in experience in improved improved Past fuel quality quality in in Japan

Preliminary Study on Growth, Feed Conversion and Preliminary Study on Growth, Feed Conversion and

J OCELYN W IDMER , P H D MPH Proportion of population with sustainable access to an improved water

Improved analyses and forecasts with AIRS Improved analyses and forecasts with AIRS retrievals

COMBI Improved Field Solver and Space Charge Adrien Florio 22nd February 2016 COMBI Improved

REVITALISING RADCLIFFE TOWNSHIP PLAN IMPROVED EDUCATIONAL EMPLOYMENT ATTAINMENT/ OPPORTUNITIES

Weight Management for Improved Health and Wellbeing Weight Management for Improved Health &

Disclaimer Management: Can It Be Improved? Management: Can It Be Improved? I have nothing to

Improved Reduction from BDD to uSVP Shi Bai; Damien Stehl; Weiqiang Wen ENS de Lyon Improved

Improved process performance of flat dies by a much wider die gap operation window and a new

Information Management Demonstrating Improved Project Performance Michael Gaunt Information

Stakeholder Cooperation for Improved Predictability and Lower Cost Remote Services Joen Dahlberg,

Improved Genetic Algorithm: Channel Allocation in Mobile Computing D. P. Vidyarthi School of

Managing Command and Control Information Using a C2IEDM Based Tasking Grammar Dr. Michael Hieb

On optimal short recurrences for generating orthogonal Krylov subspace bases J org Liesen

STATE VETERANS COMMISSION MEETING OCTOBER 11, 2019 CURRENT AND FUTURE UNIT MOBILIZATIONS ARNG

Draft EE 8235: Lecture 9 1 Lecture 9: Spectral theory for compact normal operators Resolvent

Diagonalization Marco Chiarandini Department of Mathematics & Computer Science University of

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

1 ... v 1 u 1 | | u m . A =

Adaptive boundary element methods with convergence rates Gantumur Tsogtgerel McGill University

for Improved Geotagging of Human Trafficking Webpages Rahul Kapoor, - PowerPoint PPT Presentation

Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages Rahul Kapoor, Mayank Kejriwal and Pedro Szekely Information Sciences Institute, USC Viterbi School of Engineering Domain-specific Insight Graphs (DIG)

Improved pythonDEVS Simulator Improved pythonDEVS Simulator Improved pythonDEVS Simulator

Cybercasing the Joint: On the Privacy Implications of Geo-Tagging Gerald Friedland, Robin Sommer

Improved Performance of Existing Technologies 1 Improved Performance Pinholes of 1 mm (0.04

Past experience in experience in improved improved Past fuel quality quality in in Japan

Preliminary Study on Growth, Feed Conversion and Preliminary Study on Growth, Feed Conversion and

J OCELYN W IDMER , P H D MPH Proportion of population with sustainable access to an improved water

Improved analyses and forecasts with AIRS Improved analyses and forecasts with AIRS retrievals

COMBI Improved Field Solver and Space Charge Adrien Florio 22nd February 2016 COMBI Improved

REVITALISING RADCLIFFE TOWNSHIP PLAN IMPROVED EDUCATIONAL EMPLOYMENT ATTAINMENT/ OPPORTUNITIES

Weight Management for Improved Health and Wellbeing Weight Management for Improved Health &amp;

Disclaimer Management: Can It Be Improved? Management: Can It Be Improved? I have nothing to

Improved Reduction from BDD to uSVP Shi Bai; Damien Stehl; Weiqiang Wen ENS de Lyon Improved

Improved process performance of flat dies by a much wider die gap operation window and a new

Information Management Demonstrating Improved Project Performance Michael Gaunt Information

Stakeholder Cooperation for Improved Predictability and Lower Cost Remote Services Joen Dahlberg,

Improved Genetic Algorithm: Channel Allocation in Mobile Computing D. P. Vidyarthi School of

Managing Command and Control Information Using a C2IEDM Based Tasking Grammar Dr. Michael Hieb

On optimal short recurrences for generating orthogonal Krylov subspace bases J org Liesen

STATE VETERANS COMMISSION MEETING OCTOBER 11, 2019 CURRENT AND FUTURE UNIT MOBILIZATIONS ARNG

Draft EE 8235: Lecture 9 1 Lecture 9: Spectral theory for compact normal operators Resolvent

Diagonalization Marco Chiarandini Department of Mathematics &amp; Computer Science University of

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

1 ... v 1 u 1 | | u m . A =

Adaptive boundary element methods with convergence rates Gantumur Tsogtgerel McGill University

Weight Management for Improved Health and Wellbeing Weight Management for Improved Health &

Diagonalization Marco Chiarandini Department of Mathematics & Computer Science University of