Machine Learning and Text • Textual data is ubiquitous and ever-important Text Learning and Information Extraction – WWW, digital libraries, LexisNexis, Medline, news, …. • Machine learning is required for high performance on key tasks for textual data – Retrieval (search, question answering, extraction) • Learn to (accurately) compute relevance between query and documents – Classification • Learn to (accurately) categorize documents – Clustering • Learn to (accurately) group documents – Object identification • Learn to (accurately) determine whether textual strings are equivalent Text as Data Natural Language Processing • Representing documents: a continuum of richness • An entire field focused on tasks involving syntactic, semantic, and pragmatic analysis of natural language – Vector-space : text is a | V |-dimensional vector ( V is Representation richness vocabulary of all possible words), order is ignored (“bag- text of-words”) – Examples: part-of-speech tagging, semantic role labeling, – Sequence : text is a string of contiguous tokens/characters discourse analysis, text summarization, machine translation. – Language-specific : text is a sequence of contiguous • Using machine learning methods for automating these tokens along with various syntactic, semantic, and tasks is a very active area of research, both for ML and pragmatic properties (e.g. part-of-speech features, NLP researchers semantic roles, discourse models) – Text-related tasks rely on learning algorithms • Higher representation richness leads to higher – Text-related tasks present great challenges and research computational complexity, more parameters to learn, opportunities for machine learning etc., but may lead to higher accuracy Information Extraction Sample Job Posting Subject: US-TN-SOFTWARE PROGRAMMER • Identify specific pieces of information (data) in a Date: 17 Nov 1996 17:37:29 GMT unstructured or semi-structured textual document Organization: Reference.Com Posting Service Message-ID: <56nigp$mrs@bilbo.reference.com> • Transform unstructured information in a corpus of SOFTWARE PROGRAMMER documents or web pages into a structured database Position available for Software Programmer experienced in generating software for PC- • Can be applied to different types of text Based Voice Mail systems. Experienced in C Programming. Must be familiar with communicating with and controlling voice cards; preferable Dialogic, however, experience – Newspaper articles, web pages, scientific articles, newsgroup with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a messages, classified ads, medical notes, … Senior level person who can come on board and pick up code with very little training. Present Operating System is DOS. May go to OS-2 or UNIX in future. • Can employ output of Natural Language Processing tasks for enriching the text representation (“NLP Please reply to: Kim Anderson features”) AdNET (901) 458-2888 fax kimander@memphisonline.com 1
Sample Job Posting Extracted Job Template Subject: US - TN -SOFTWARE PROGRAMMER computer_science_job Date: 17 Nov 1996 17:37:29 GMT id: 56nigp$mrs@bilbo.reference.com Organization: Reference.Com Posting Service title: SOFTWARE PROGRAMMER Message-ID: < 56nigp$mrs@bilbo.reference.com > salary: company: SOFTWARE PROGRAMMER recruiter: Position available for Software Programmer experienced in generating software for PC- state: TN Based Voice Mail systems. Experienced in C Programming . Must be familiar with city: communicating with and controlling voice cards; preferable Dialogic, however, experience country: US with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more language: C experience with PC Based Voice Mail , but will consider as little as 2 years. Need to find a platform: PC \ DOS \ OS-2 \ UNIX Senior level person who can come on board and pick up code with very little training. application: Present Operating System is DOS . May go to OS-2 or UNIX in future. area: Voice Mail Please reply to: req_years_experience: 2 Kim Anderson desired_years_experience: 5 AdNET req_degree: (901) 458-2888 fax desired_degree: kimander@memphisonline.com post_date: 17 Nov 1996 Medline Corpus Medline Corpus TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1 , share common properties of subunit configuration, tyrosine phosphorylation and physical properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein association with the Rb protein AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a growth factor sensitivity, susceptibility to viral subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor (SPF) as well as a candidate proto-oncogene … factor ( SPF ) as well as a candidate proto-oncogene … Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily phosphorylated by pp60c-src in vitro. and, like cyclin A , was readily phosphorylated by pp60c-src in vitro. In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit. becomes associated with p9Ckshs1 , a Cdk-binding subunit. Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 p33cdk2 , and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity … kinase activity … Medline Corpus: Relation Extraction Named Entity Recognition and Linkage • Without an underlying database, simply recognizing TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1 , share common properties of subunit configuration, tyrosine phosphorylation and physical certain named entities can be very useful for better association with the Rb protein searching and indexing AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral subversion and association with a • Identifying co-referent entities is important for tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor ( SPF ) as well as a candidate proto-oncogene … performance: instance of record linkage problem Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo The Senate Democratic leader, Harry Reid of Nevada, The Senate Democratic leader, Harry Reid of Nevada, and, like cyclin A , was readily phosphorylated by pp60c-src in vitro. said Tuesday that he would oppose the confirmation of said Tuesday that he would oppose the confirmation of In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1 , a Cdk-binding subunit. Judge John G. Roberts Jr. as chief justice, surprising both Judge John G. Roberts Jr. as chief justice, surprising both the White House and fellow Democrats still conflicted the White House and fellow Democrats still conflicted Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is associated with both p34cdc2 and about how to vote. about how to vote. p33cdk2 , and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase activity … 2
Recommend
More recommend