information extraction in illicit web domains
play

Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro - PDF document

Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro Szekely Information Sciences Institute Information Sciences Institute USC Viterbi School of Engineering USC Viterbi School of Engineering kejriwal@isi.edu pszekely@isi.edu


  1. Information Extraction in Illicit Web Domains Mayank Kejriwal Pedro Szekely Information Sciences Institute Information Sciences Institute USC Viterbi School of Engineering USC Viterbi School of Engineering kejriwal@isi.edu pszekely@isi.edu ABSTRACT [14], [21], [15]. In IE systems based on statistical learning, sequence labeling models like Conditional Random Fields Extracting useful entities and attribute values from illicit (CRFs) can be trained and used for tagging the scraped text domains such as human tra ffi cking is a challenging prob- from each data source with terms from the domain ontology lem with the potential for widespread social impact. Such [24], [15]. With enough data and computational power, deep domains employ atypical language models, have ‘long tails’ neural networks can also be used for a range of collective and su ff er from the problem of concept drift. In this pa- natural language tasks, including chunking and extraction per, we propose a lightweight, feature-agnostic Information of named entities and relationships [10]. Extraction (IE) paradigm specifically designed for such do- While IE has been well-studied both for cross-domain mains. Our approach uses raw, unlabeled text from an ini- Web sources (e.g. Wikipedia) and for traditional domains tial corpus, and a few (12-120) seed annotations per domain- like biomedicine [32], [20], it is less well-studied (Section specific attribute, to learn robust IE models for unobserved 2) for dynamic domains that undergo frequent changes in pages and websites. Empirically, we demonstrate that our content and structure. Such domains include news feeds, approach can outperform feature-centric Conditional Ran- social media, advertising, and online marketplaces, but also dom Field baselines by over 18% F-Measure on five anno- illicit domains like human tra ffi cking. Automatically con- tated sets of real-world human tra ffi cking datasets in both structing knowledge graphs containing important informa- low-supervision and high-supervision settings. We also show tion like ages (of human tra ffi cking victims), locations, prices that our approach is demonstrably robust to concept drift, of services and posting dates over such domains could have and can be e ffi ciently bootstrapped even in a serial comput- widespread social impact, since law enforcement and federal ing environment. agencies could query such graphs to glean rapid insights [28]. Illicit domains pose some formidable challenges for tradi- Keywords tional IE systems, including deliberate information obfusca- Information Extraction; Named Entity Recognition; Illicit tion , non-random misspellings of common words, high occur- Domains; Feature-agnostic; Distributional Semantics rences of out-of-vocabulary and uncommon words, frequent (and non-random) use of Unicode characters, sparse content and heterogeneous website structure, to only name a few 1. INTRODUCTION [28], [1], [13]. While some of these characteristics are shared Building knowledge graphs (KG) over Web corpora is an by more traditional domains like chat logs and Twitter, both important problem that has galvanized e ff ort from multiple information obfuscation and extreme content heterogeneity communities over two decades [12], [29]. Automated knowl- are unique to illicit domains. While this paper only consid- edge graph construction from Web resources involves several ers the human tra ffi cking domain, similar kinds of problems di ff erent phases. The first phase involves domain discovery , are prevalent in other illicit domains that have a sizable Web which constitutes identification of sources, followed by crawl- (including Dark Web) footprint, including terrorist activity, ing and scraping of those sources [7]. A contemporaneous and sales of illegal weapons and counterfeit goods [9]. ontology engineering phase is the identification and design As real-world illustrative examples, consider the text frag- of key classes and properties in the domain of interest (the ments ‘Hey gentleman im neWYOrk and i’m looking for domain ontology ) [33]. generous...’ and ‘AVAILABLE NOW! ?? - (4 two 4) six Once a set of (typically unstructured) data sources has 5 two - 0 9 three 1 - 21’ . In the first instance, the correct been identified, an Information Extraction (IE) system needs extraction for a Name attribute is neWYOrk , while in the to extract structured data from each page in the corpus [11], second instance, the correct extraction for an Age attribute is 21 . It is not obvious what features should be engineered in a statistical learning-based IE system to achieve robust c � 2017 International World Wide Web Conference Committee performance on such text. (IW3C2), published under Creative Commons CC BY 4.0 License. To compound the problem, wrapper induction systems WWW ’17 Perth, Australia from the Web IE literature cannot always be applied in such ACM 978-1-4503-4913-0/17/04. http://dx.doi.org/10.1145/3038912.3052642 domains, as many important attributes can only be found in text descriptions, rather than template-based Web extrac- tors that wrappers traditionally rely on [21]. Constructing . 997

Recommend


More recommend