Quick Review Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts Amherst From KDD 2003 Bayesian Categorization Bayes Theorem • Let set of categories be { c 1 , c 2 ,… c n } • Let E be description of an instance. • Determine category of E by determining for each c i 1702-1761 ( ) ( | ) P c P E c = ( | ) i i P c E i P ( E ) ( | ) ( ) P E H P H = ( | ) P H E • P( E ) can be determined since categories are complete ( ) and disjoint. P E n n ( ) ( | ) ∑ ∑ P c P E c = = ( | ) i i 1 P c E i P ( E ) = = 1 1 i i n ∑ = P ( E ) P ( c ) P ( E | c ) i i = 1 i 3 4 Naïve Bayesian Motivation • Problem: Too many possible instances (exponential in m ) to estimate all P( E | c i ) Information Extraction • If we assume features of an instance are independent given the category ( c i ) ( conditionally independent ). m ∏ = ∧ ∧ ∧ = ( | ) ( L | ) ( | ) P E c P e e e c P e c i 1 2 m i j i = j 1 • Therefore, we then only need to know P( e j | c i ) for each feature and category. 5
Example: The Problem Example: A Solution Martin Baker , a person Genomics job Employers job posting form Slides from Cohen & McCallum Slides from Cohen & McCallum Location = Continental U.S. Category = Food Services Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Job Openings: Keyword = Baker Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 Slides from Cohen & McCallum Slides from Cohen & McCallum What is “Information Extraction” What is “Information Extraction” As a task: As a task: Filling slots in a database from sub-segments of text. Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy Gates railed against the economic philosophy of open-source software with Orwellian fervor, of open-source software with Orwellian fervor, denouncing its communal licensing as a denouncing its communal licensing as a "cancer" that stifled technological innovation. "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- Today, Microsoft claims to "love" the open- NAME TITLE ORGANIZATION NAME TITLE ORGANIZATION source concept, by which software code is source concept, by which software code is IE Bill Gates CEO Microsoft made public to encourage improvement and made public to encourage improvement and Bill Veghte VP Microsoft development by outside programmers. Gates development by outside programmers. Gates Richard Stallman founder Free Soft.. himself says Microsoft will gladly disclose its himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the crown jewels--the coveted code behind the Windows operating system--to select Windows operating system--to select customers. customers. "We can be open source. We love the concept "We can be open source. We love the concept of shared source," said Bill Veghte, a of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift Microsoft VP. "That's a super-important shift for us in terms of code access.“ for us in terms of code access.“ Richard Stallman, founder of the Free Richard Stallman, founder of the Free Software Foundation, countered saying… Software Foundation, countered saying… Slides from Cohen & McCallum Slides from Cohen & McCallum
Recommend
More recommend