Improving IR performance from OCRed text using cooccurrence RISOT - PowerPoint PPT Presentation

Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu Ghosh and Anirban Chakraborty Indian Statistical Institute Kolkata, India Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 1 / 22

Title Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 2 / 22

Key Terms Co-occurrence : We say that two words co-occur if they appear in a window of certain number of words of each other in a document LCS similarity : LCS stands for L ongest C ommon S ubsequence LCS( industry , industrial ) = industr LCS similarity( industry , industrial ) = LCS( industry , industrial )/max( industry , industrial ) = 0.7 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 3 / 22

RISOT task Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 4 / 22

OCRed Resources Legal documents as hard copies IIT CDIP 1.0 corpus - TREC Legal Ad Hoc SIGIR Digital Museum - Cleverdon, Salton, Sparck Jones Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 5 / 22

RISOT task - Without Original Corpus Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 6 / 22

Social Networks : Direct Connections Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 7 / 22

Social Networks : Indirect Connections Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 8 / 22

Clustering Algorithm CLUSTERING Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 9 / 22

Clustering Algorithm (Phase I) For each word w in the OCRed corpus: Let S w 1 and S w 2 be empty sets. Let S w = S w 1 ∪ S w 2 For word w 1 co-occurring with w , calculate LCS similarity 1 between w and w 1 . Store w 1 in S w 1 if LCS similarity( w , w 1 ) > some threshold T . For each w ′ in S w 1 , find the words w 2 co-occurring with w ′ 2 such that LCS similarity( w , w 2 ) > T. Include all these words in S w 1 . Repeat step (2) until no new word is added to S w 1 . 3 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 10 / 22

Clustering Algorithm (Phase I) Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 11 / 22

Clustering Algorithm (Phase II) Consider top m (in terms of frequency in corpus) words 1 co-occurring with w . For each such word w 3 , find the words w 4 cooccurring with w 3 such that LCS similarity( w , w 4 ) > T. Include all these words in S w 2 . For each w ′′ in S w 2 , find the words w 5 co-occurring with w ′′ 2 such that LCS similarity( w , w 5 ) > T. Include all these words in S w 2 . Repeat step (2) until no new word is added to S w 2 . 3 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 12 / 22

Clustering Algorithm (Phase II) Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 13 / 22

Query word cluster mapping calculate LCS similarity( w q , w C ), where w C is a word in 1 cluster C , for each word in the corpus Choose all the clusters C for which LCS similarity( w q , w C ) is 2 greater than a high threshold For each cluster C obtained from step (2) define 3 C ′ = C ∪ { w q } Create complete-linkage clusters from each C ′ of step (3) and 4 keep those clusters containing w q Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 14 / 22

Query word cluster mapping For a cluster C , let us consider LCS similarity between each 5 pair of words in it. Let GM C denote the Geometric Mean of LCS similarity of all the pairs. Then, compute GM C for each cluster given by step (4) Select the cluster C with maximum GM C as the appropriate 6 cluster for w q Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 15 / 22

Query word clusters Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 16 / 22

Results Run MAP P5 Original text 0.2567 0.3485 OCRed text (baseline) 0.1791 0.2738 0.1974 1 Proposed method on OCRed text 0.2831 1 Not significant Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 17 / 22

Querywise Performance Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 18 / 22

Failure Analysis : clusters Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 19 / 22

Failure Analysis Compact clusters over all-inclusive clusters Re-clustering based on string match and co-occurrence Chance co-occurrence - harmful Incorporation of co-occurrence frequencies - essential Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 20 / 22

Brighter side Practical utility Language independent Context information - reliable Captures both erroneous and inflectional variants (effect of stemming) Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 21 / 22

THANK YOU Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 22 / 22

Improving IR performance from OCRed text using cooccurrence RISOT - PowerPoint PPT Presentation

Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu Ghosh and Anirban Chakraborty Indian Statistical Institute Kolkata, India Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

Sujoy Das & Aarti Kumar Associate Professor Research Scholar Department of

Number Theory and Algebra: A Brief Introduction Rana Barua Indian Statistical Institute Kolkata

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Decision Procedures An Algorithmic Point of View Revision 1.0 D.Kroening O.Strichman Outline 1

Quantitative estimates of a drainage network model Rahul Roy Indian Statistical Institute, New

Generalized Matroid Secretary Problem Sourav Chakraborty (Indian Statistical Institute) Sourav

Building Trust: facilitating Data Use and Reuse Prof. Devika P. Madalli Indian Statistical

Different Features of ELmD, EME Based Authenticated Encryption Schemes Nilanjan Datta and Mridul

Improving IR performance from OCRed text using cooccurrence RISOT - PowerPoint PPT Presentation

Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu Ghosh and Anirban Chakraborty Indian Statistical Institute Kolkata, India Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Title of an article [16 pt] Introduction [14 pt] Text. Text. Text. Text. Text. Text. Text. Text.

Sujoy Das &amp; Aarti Kumar Associate Professor Research Scholar Department of

Number Theory and Algebra: A Brief Introduction Rana Barua Indian Statistical Institute Kolkata

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Decision Procedures An Algorithmic Point of View Revision 1.0 D.Kroening O.Strichman Outline 1

Quantitative estimates of a drainage network model Rahul Roy Indian Statistical Institute, New

Generalized Matroid Secretary Problem Sourav Chakraborty (Indian Statistical Institute) Sourav

Building Trust: facilitating Data Use and Reuse Prof. Devika P. Madalli Indian Statistical

Different Features of ELmD, EME Based Authenticated Encryption Schemes Nilanjan Datta and Mridul

Sujoy Das & Aarti Kumar Associate Professor Research Scholar Department of