Finding Gene and Chemical Names in Patent Text Supervised Learning with no Training Data slides by Kuzman Ganchev Finding Gene and Chemical Names in Patent Text (Ganchev)
The present invention relates to the use of immuno- modifying imidazoquinoline amines , imidazopyridine amines , 6,7-fused cycloalkylimidazopyridine amines , and 1,2-bridged imidazoquinoline amines to inhibit T helper-type 2 (TH2) immune response and thereby treat TH2 mediated diseases. It also relates to the ability of these compounds to inhibit induction of interleukin (IL)-4 and IL-5 , and to suppress eosinophilia. –US Patent Number 6,610,319 Finding Gene and Chemical Names in Patent Text (Ganchev)
1 Introduction Why Tag Patents for Entities? Example output Demonstration 2 How the tagging was done The Gene Tagger The Chemical Tagger 3 Future Work Much (more) Ado about Patents Emerging Research Questions Finding Gene and Chemical Names in Patent Text (Ganchev)
Motivation Why do we care about patents? • Large collection of public information • Primary resource for drug publications • Drug patents often have long lists of chemicals that can be a good start for research (e.g. inhibitors of proteins similar to the patent protein) Finding Gene and Chemical Names in Patent Text (Ganchev)
Motivation Why would you want to tag genes/chemicals? • Restrict search to specific classes. E.g. “ace” as a gene name: • It inhibits the angiotensin converting enzyme ( ACE ) that catalyses the conversion of . . . ACE inhibitors include benazepril , captopril , . . . • rather than as something else: • . . . a wound healing agent, for example Ace Mannan (or other components of Aloe vera) . . . • On the next day, Lipofect Ace reagent was mixed with . . . • Scan a list of named entities to find out if a patent is relevant to your search. Finding Gene and Chemical Names in Patent Text (Ganchev)
Motivation What else could we do? • Find the chemical most often mentioned with some gene • Find the chemical most specific to a particular gene • Give me a ranked list of genes related to a chemical Finding Gene and Chemical Names in Patent Text (Ganchev)
Extracted Examples: The Good • The compound was found to be not cross-resistant with other inhibitors of DHFR , such as pyrimethamine and cycloguanil . • U.S. Pat. No. 5,703,092 discloses the use of dydroxamic acid compounds and carbocyclic acids as metalloproteinase and TNF inhibitors, and in particular in treatment of arthritis and other related inflammatory diseases. • N,N-diethyl-m-toluamide ( DEET ) is an active ingredient in many insect repellants. Finding Gene and Chemical Names in Patent Text (Ganchev)
Extracted Examples: The Bad • the N-acyl homoserine lactones, which are composed of derivatives of amino acid and fatty acid molecules. • at least one additional solvent chosen from C -C alkyl acetates and C -C alkyl alcohols • Patch ( Nail treatment sheet pack) • 10. Immunizing kit . • . . . a method for the selective extraction of a . . . • Thus, in view of a work by Bushnell et al. (Pacific Sci . 4, 167-83 (1950)) showing . . . Finding Gene and Chemical Names in Patent Text (Ganchev)
Demonstration http://fling-l.seas.upenn.edu/~kuzman/cgi-bin/search.pl Finding Gene and Chemical Names in Patent Text (Ganchev)
The Gene/Protein Tagger • Used Ryan McDonald’s GeneTaggerCRF mostly out of the box • Trained on medline abstracts (patent text is out-of-domain) • Reported performance on medline abstracts: .84 precision and .74 recall • Performance on patent text is not as good but probably enough to be useful Finding Gene and Chemical Names in Patent Text (Ganchev)
The Chemical Tagger Available Resources • Large list of chemicals • Large list of genes • Smaller lists of other types of non-chemicals • Lots of untagged patent text in the drug domain • A small amount of non-expert human time Finding Gene and Chemical Names in Patent Text (Ganchev)
Available Resources List of Chemicals • Obtained from the aliases of PubChem compounds • About 8 . 14 × 10 5 names (nowhere near exhaustive) • Some names are not usually names of chemicals: • organisms: Wormwood, Sunflower, Bitterweed • colors: Wool Yellow, Whortleberry Red, Acetyl Red G • names: Wolfram, Wang, Sweet grass • Lots of complicated systematic names: 0-(3-(N-Cyclopentyl-N-methyl)aminopropyl)-2-chlorophenothiazine Finding Gene and Chemical Names in Patent Text (Ganchev)
Available Resources Other Lists • Gene list: 1 . 1 × 10 6 entries • Gene list: 0 . 15 × 10 6 entries • List of common newswire words 23 . 9 × 10 3 entries • Smaller lists of person, location and organization names (thanks to Partha Talukdar) Finding Gene and Chemical Names in Patent Text (Ganchev)
Available Resources Unlabeled Patent Text • 2002-2003 all patents issued • 8280 patenst in the drugs domain • Most patents are on the order of 2,000-10,000 words Finding Gene and Chemical Names in Patent Text (Ganchev)
The Chemical Tagger Intial strategy • Use membership in the list of chemicals • Need to match full multi-word name (otherwise we match e.g. “Red”) • Train CRF tagger only on sentences where we found chemicals • Very poor recall on systematic names • Precision is not perfect Finding Gene and Chemical Names in Patent Text (Ganchev)
The Chemical Tagger Next Step • Create a simple character language model to distinguish words that sound like chemicals • Label words that sound a lot like chemicals as chemicals • Improves performance on systematic names Finding Gene and Chemical Names in Patent Text (Ganchev)
Chemical Language Model Generative Incarnation • Naive Bayes model classifies each word • Features were character 3-grams pro Easy to implement con Some 3-grams were poorly estimated (so smoothing required) Finding Gene and Chemical Names in Patent Text (Ganchev)
Chemical Language Model Discriminative Incarnation • Maximum entropy classifies each word • Features were 2, 3 and 4 character N-grams • Now we get backoff for free Finding Gene and Chemical Names in Patent Text (Ganchev)
Chemical Language Model Training • Positive examples from list of chemicals: • 814,293 entries, 1,813,343 tokens • “acid” occurs 57,000 times as a token • “methyl” occurs 17,000 times as a token • Negative examples: • paragraphs of patent text • removed lines that contain common chemical 4-grams like meth, thyl, nol$ (by hand, using grep ) • 459,694 lines, 4,720,732 tokens • most common non-stop words “invention”, “means”, “present” Finding Gene and Chemical Names in Patent Text (Ganchev)
Chemical Language Model Example Results P( CHEM ) Examples . 0 − . 1 length, soybean, shrink . 1 − . 2 Fetal, adapter, casein . 2 − . 3 Cataract, diet, extrachromasomal . 3 − . 4 phytotoxicity, altrose, oral, drop . 4 − . 5 lipid, anestetized, copper . 5 − . 6 Serum, alba, nifedipine . 6 − . 7 schema, Sephadex, thiamine . 7 − . 8 acetate/hexane, proline, amphotericin . 8 − . 9 proazulene, acne, enterotoxins . 9 − 1 . alicyclic, bromide, dihydroxyethylstearylamine 1 . − 1 . 2-(1-hydroxyethyl)-4,5-dimethyl-cyclohexanone Finding Gene and Chemical Names in Patent Text (Ganchev)
Problems with References • Inline references are confusing for taggers • Gene tagger has not trained on text with long references • Chemical tagger often tags names of journals or authors as chemicals: . . . by Bushnell et al. (Pacific Sci . 4, 167-83 (1950)) Finding Gene and Chemical Names in Patent Text (Ganchev)
Problems with References • Inline references are confusing for taggers • Gene tagger has not trained on text with long references • Chemical tagger often tags names of journals or authors as chemicals: . . . by Bushnell et al. (Pacific Sci . 4, 167-83 (1950)) • Solution: make a seperate tagger for references Finding Gene and Chemical Names in Patent Text (Ganchev)
Reference Tagger • Trained on 40 instances (12,814 tokens) • About 30 features (an extra 20,000 add about 1 point to precision/recall) • Tested on 25 instances (6,372 tokens) • About 87% token precision/recall • Good enough to eliminate most errors on references • Total time to develop: 3-4 hours Finding Gene and Chemical Names in Patent Text (Ganchev)
The Chemical Tagger Current incarnation • Train a CRF tagger on training data: • Use list membership in the chemicals and non-chemicals lists • Use high-confidence positive predictions of language model • Label text in references as non-chemical • One extra rule: “*-ic acid” is a chemical, except “nucleic acid” • Train only on instances that contain chemical names. Finding Gene and Chemical Names in Patent Text (Ganchev)
Future work Aggregation As we said in the introduction, we would like to be able to e.g. • Find the chemical most often mentioned with some gene • Find the chemical most specific to a particular gene • Find a ranked list of genes related to a chemical Finding Gene and Chemical Names in Patent Text (Ganchev)
Future work Normalization All of the following (and many others) refer to the same chemical: • flavone acetic acid • Mitoflaxone • 2-Phenyl-8-(carboxymethyl)-benzopyran-4-one • 4-Oxo-2-phenyl-4H-1-benzopyran-8-acetic acid Searching for one should bring up results for the others also. Finding Gene and Chemical Names in Patent Text (Ganchev)
Recommend
More recommend