a simple method for citation metadata extraction using
play

A simple method for citation metadata extraction using hidden - PowerPoint PPT Presentation

A simple method for citation metadata extraction using hidden Markov models Erik Hetzner (California Digital Library) JCDL 2008 Advantages of our method Good performance on homogeneous citations. Reasonable performance on heterogeneous


  1. A simple method for citation metadata extraction using hidden Markov models Erik Hetzner (California Digital Library) JCDL 2008

  2. Advantages of our method Good performance on homogeneous ⊲ citations. Reasonable performance on heterogeneous ⊲ citations. Extractor can be implemented in a few pages ⊲ of code.

  3. Improving HMM performance Reduce the size of the alphabet by mapping ⊲ words to a smaller set of symbols. Use two states for each label: first & rest. ⊲ Use ‘separator states’, one for each possible ⊲ transition between labels.

  4. Hidden Markov models .25 .25 0 .75 .25 b .75 .75 1 a .5 0 .5 1

  5. Alphabet of symbols: words? exorcised throed deposed roil vaporized rattletrap mocking prohibit sleetier effectual tweeter decremented atrophied nearby captor earn oboe ticked in- oculate algorithmic extremist inherited burping silenced harassment doctri- naire emptiest tarting freewheeled parqueting gentlewoman optimal dash- board taskmaster acceptance mucky prototyping virtual recapture per- petrate junking rewrote goody cooperated mottling yahoo gridiron suc- cessfully bumper siphoned witchcraft jettison capering grouchier disal- lowed eyeballing medic sullen certitude tearier parlor becoming morpho- logical cognomen saddening apprenticed signpost lignite wishing boldface postage audibility jingoistic lousy reacted rivulet arboreal primping eddy belatedly necessity ordinance retrogressed perverting sponging neutralizer deadlier inferential easel aptly trapeze circumlocution descanted caress- ing redeemable entice thunderstruck lectured postmarking twanged bel- lowing rainier grouching cozier flimsiest grizzly decorously jawboning tinier crookeder liberation sleeting heehawed puffin paisley daunt screenwriter …

  6. Alphabet of symbols: keywords wAND wAPPEAR wCOMMUNICATIONS wCONFERENCE wDE wDISSERTATION wEDITOR wIN wINC wJOURNAL wNOTICES wNUMBER wPAGES wPHD wPRESS wPROCEEDINGS wREPORT wSUBMITTED wTECHNICAL wTHESIS wTRANSACTIONS wUNIVERSITY wVAN wVOLUME

  7. Alphabet of symbols: punctuation pPERIOD pCOMMA pLEFTPAREN pRIGHTPAREN pLEFTBRACKET pRIGHTBRACKET pHYPEN pCOLON pSEMICOLON pQUESTIONMARK pMISC pAPOSTROPHE pDOUBLEQUOTE pSINGLEQUOTE

  8. Alphabet of symbols: word classes wMONTH wSEASON

  9. Alphabet of symbols: features fINITIAL fTC fUPPER fLOWER fNUMERAL4 fNUMERAL fMIXED

  10. okens → symbols T → 1 ˆ[aA][nN][dD]$ wAND ˆ[Jj]an(uary)?$ → 2 cMONTH ˆ \ .$ → 3 pPERIOD → 4 ˆ,$ pCOMMA → 5 ˆ[A-Z]$ fINITIAL → 6 ˆ[A-Z][A-Z]+$ fUPPER …

  11. okens → symbols T Friedman, Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995.

  12. okens → symbols T fTC , Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995.

  13. okens → symbols T fTCpCOMMA Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995.

  14. okens → symbols T fTCpCOMMA fTC fINITIALpPERIODpCOMMA wAND fTC fTCpPERIOD wTHE fTC wTCpPERIOD fMIXED wEDITIONpPERIOD fTCpCOMMA fTCpPERIODpCOLON wTHE fUPPER fTCpCOMMA fNUMERAL4pPERIOD

  15. Label states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f fTC

  16. Label states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r pCOMMA

  17. Label states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r fTC

  18. Label states Friedman, Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r fINITIAL

  19. Label states Friedman, Daniel P., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r pPERIOD

  20. Separator states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r a|a pCOMMA

  21. Separator states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:f a:r a|a wAND

  22. Label states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a:r a|a a:f fTC

  23. Label states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a|a a:f a:r fTC

  24. Separator states Friedman, Daniel P ., and Matthias Felleisen. The Little Schemer. 4th Edition. Cambridge, Mass.: The MIT Press, 1995. a|a a:f a:r a|t pPERIOD

  25. Results on the Cora dataset token .944 field .892 whole instance .613

  26. Improving HMM performance Reduce the size of the alphabet by mapping ⊲ words to a smaller set of symbols. Use two states for each label: first & rest. ⊲ Use ‘separator states’, one for each possible ⊲ transition between labels.

  27. Erik Hetzner erik.hetzner@ucop.edu http://purl.net/net/egh/hmm cite parser/

Recommend


More recommend