Chinese Event Extraction 复旦大学大数据学院 School of Data Science, Fudan University 杨依莹 2017.11.22
大 纲 ACE program 1 1 Assignment 3: Chinese event extraction 2 CRF++: Yet Another CRF toolkit 3
ACE program 复旦大学大数据学院 School of Data Science, Fudan University Automatic Content Extraction (ACE) program : • The objective of the Automatic Content Extraction (ACE) Program was to develop extraction technology to support automatic processing of source language data (in the form of natural text and as text derived from ASR and OCR). • The program relates to English, Arabic and Chinese texts. • The ACE corpus is one of the standard benchmarks for testing new information extraction algorithms.
ACE program 复旦大学大数据学院 School of Data Science, Fudan University Automatic Content Extraction (ACE) program: Given a text in natural language , the ACE challenge is to detect: 1. entities mentioned in the text, such as: persons, organizations, locations, facilities, weapons. 2. relations between entities, such as: person A is the manager of company B. Relation types include: role, part, located, near, and social. 3. events mentioned in the text, such as: interaction, movement, transfer, creation and destruction.
ACE program 复旦大学大数据学院 School of Data Science, Fudan University Automatic Content Extraction (ACE) program: An example of text
ACE program : entity 复旦大学大数据学院 School of Data Science, Fudan University Entity Detection and Tracking (EDT) • ACE tasks identified seven types of entities: Person, Organization, • Location, Facility, Weapon, Vehicle and Geo-Political Entity (GPEs). Each type was further divided into subtypes. For every mention, the annotator identified the maximal extent of the • string that represents the entity and labeled the head of each mention. Nested mentions were also captured.
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Relation Detection and Characterization (RDC) : involved the identification of relations between entities. • For every relation, annotators identified two primary arguments • (namely, the two ACE entities that are linked) as well as the relation's temporal attributes.
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Create new structured knowledge bases, useful for any app • Augment current knowledge bases • Adding words to WordNet thesaurus, facts to FreeBase or DBPedia • DBpedia : an ontology derived from Wikipedia containing over 2 billion RDF triples. • Freebase : a dataset from Wikipedia infoboxes. • On 16 December 2015, Google officially announced the Knowledge Graph API, which is meant to be a replacement to the Freebase API. • Support question answering • The granddaughter of which actor starred in the movie “E.T.”? (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y) •
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University Automatic Content Extraction (ACE) program: • 7 types and 17 subtypes relations from “Relation Extraction Task” PERSON- GENERAL PART- PHYSICAL SOCIAL AFFILIATION WHOLE Subsidiary Lasting Citizen- Family Near Personal Geographical Resident- Located Ethnicity- Org-Location- Business Religion Origin ORG ARTIFACT AFFILIATION Investor Founder Student-Alum User-Owner-Inventor- Ownership Employment Manufacturer Membership Sports-Affiliation
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Physical-Located PER-GPE • He was in Tennessee • Part-Whole-Subsidiary ORG-ORG • XYZ, the parent company of ABC • Person-Social-Family PER-PER • John’s wife Yoko • Org-AFF-Founder PER-ORG • Steve Jobs, co-founder of Apple…
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Using Patterns to Extract Relations • lexico-syntactic pattern ( 词典 - 语义规则 )
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Supervised Learning 1. Find all pairs of named entities 2. Decide if 2 entities are related 3. If yes, classify the relation
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Supervised Learning • The most important step: classification • e.g. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and generalize the context to learn new patterns. May cause semantic drift
ACE program : relation 复旦大学大数据学院 School of Data Science, Fudan University • Semi-supervised Learning • To avoid semantic drift, we introduce confidence value. • Setting conservative confidence thresholds for the acceptance of new patterns and tuples.
ACE program : event 复旦大学大数据学院 School of Data Science, Fudan University Automatic Content Extraction (ACE) program: Event Detection and Characterization (EDC) •
大 纲 ACE program 1 Assignment 3: Chinese event extraction 2 2 CRF++: Yet Another CRF toolkit 3
Description 复旦大学大数据学院 School of Data Science, Fudan University • In this assignment, you will need to use sequence labeling models for Chinese event extraction. • Event information are defined as two parts: • Trigger : the main word that most clearly expresses the occurrence of an event. • Argument : an entity, temporal expression or value that plays a certain role in the event. • For example: “ 因特尔在中国成立了研究中心 ” • “ 成立 ” is the trigger of type Business • “ 英特尔 ”, “ 中国 ” and “ 研究中心 ” are the arguments of type Agent, Place and Org
Description 复旦大学大数据学院 School of Data Science, Fudan University • This task is separated as two subtasks: • Trigger labeling: identify the trigger word in the sentence, and classify it to the following 8 types: • Argument labeling: identify all the arguments in the sentence, and classify them to 35 types (some are listed below, all types could be found in the training file): • You are required to use both HMM and CRF models for this task. You can use any toolkit for their implementation. • Note that the performance of HMM can be very poor.
Formal Definition 复旦大学大数据学院 School of Data Science, Fudan University Input A sequence of segmented Chinese words. Output Label each word with ‘T_type’ (trigger), ‘A_type’ (argument) or ‘O’ (neither trigger nor argument). Save your labeling result after the real label separated with tab. fg1:input fg2: training instance fg3: testing result
Provided Files 复旦大学大数据学院 School of Data Science, Fudan University • trigger_train.txt & trigger_test.txt : • These two files contain 1,918 and 669 instances for training and testing, respectively. • Each line contains one word and its label separated by tabs. • Instances are separated by blank line. • argument_train.txt & argument_test.txt : • These two files contain 2,131 and 997 instances for training and testing, respectively. • Your job is to predict the sequence label for instances in test files, and write your predictions in result files. The labels in test files are only for evaluation. • eval.py • This file can help you evaluate your model’s recall, accuracy, precision and F1-score.
Submission 复旦大学大数据学院 School of Data Science, Fudan University • Generate a zip file and name it as “sid_homework- 3.zip”. • It should include a python file named “extraction.py”, two output files named “trigger_result.txt” and “argument_result.txt”, and a written report named “chinese event extraction.pdf”. • Program: codes should be written in python. • Report: the report needs to be written in English with no more than 4 pages.
Evaluation 复旦大学大数据学院 School of Data Science, Fudan University • We will mark your homework based on the four criteria: • Final accuracy (20%) • Program (30%) • Report (40%) • HMM implementation (10%)
Due 复旦大学大数据学院 School of Data Science, Fudan University • Submit your homework via E-learning system. • Deadline: Mid-night at December 8 th 2017 • If you have any questions about this homework, send email to TA or our course mailbox. • TA in Charge • 杨依莹 (zoeyangyy@163.com )
大 纲 ACE program 1 Assignment 3: Chinese event extraction 2 CRF++: Yet Another CRF toolkit 3 3
CRF++: Yet Another CRF toolkit 复旦大学大数据学院 School of Data Science, Fudan University • CRF++ ( http://taku910.github.io/crfpp/ ) is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. • CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
Recommend
More recommend