Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu Ghosh and Anirban Chakraborty Indian Statistical Institute Kolkata, India Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 1 / 22
Title Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 2 / 22
Key Terms Co-occurrence : We say that two words co-occur if they appear in a window of certain number of words of each other in a document LCS similarity : LCS stands for L ongest C ommon S ubsequence LCS( industry , industrial ) = industr LCS similarity( industry , industrial ) = LCS( industry , industrial )/max( industry , industrial ) = 0.7 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 3 / 22
RISOT task Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 4 / 22
OCRed Resources Legal documents as hard copies IIT CDIP 1.0 corpus - TREC Legal Ad Hoc SIGIR Digital Museum - Cleverdon, Salton, Sparck Jones Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 5 / 22
RISOT task - Without Original Corpus Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 6 / 22
Social Networks : Direct Connections Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 7 / 22
Social Networks : Indirect Connections Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 8 / 22
Clustering Algorithm CLUSTERING Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 9 / 22
Clustering Algorithm (Phase I) For each word w in the OCRed corpus: Let S w 1 and S w 2 be empty sets. Let S w = S w 1 ∪ S w 2 For word w 1 co-occurring with w , calculate LCS similarity 1 between w and w 1 . Store w 1 in S w 1 if LCS similarity( w , w 1 ) > some threshold T . For each w ′ in S w 1 , find the words w 2 co-occurring with w ′ 2 such that LCS similarity( w , w 2 ) > T. Include all these words in S w 1 . Repeat step (2) until no new word is added to S w 1 . 3 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 10 / 22
Clustering Algorithm (Phase I) Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 11 / 22
Clustering Algorithm (Phase II) Consider top m (in terms of frequency in corpus) words 1 co-occurring with w . For each such word w 3 , find the words w 4 cooccurring with w 3 such that LCS similarity( w , w 4 ) > T. Include all these words in S w 2 . For each w ′′ in S w 2 , find the words w 5 co-occurring with w ′′ 2 such that LCS similarity( w , w 5 ) > T. Include all these words in S w 2 . Repeat step (2) until no new word is added to S w 2 . 3 Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 12 / 22
Clustering Algorithm (Phase II) Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 13 / 22
Query word cluster mapping calculate LCS similarity( w q , w C ), where w C is a word in 1 cluster C , for each word in the corpus Choose all the clusters C for which LCS similarity( w q , w C ) is 2 greater than a high threshold For each cluster C obtained from step (2) define 3 C ′ = C ∪ { w q } Create complete-linkage clusters from each C ′ of step (3) and 4 keep those clusters containing w q Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 14 / 22
Query word cluster mapping For a cluster C , let us consider LCS similarity between each 5 pair of words in it. Let GM C denote the Geometric Mean of LCS similarity of all the pairs. Then, compute GM C for each cluster given by step (4) Select the cluster C with maximum GM C as the appropriate 6 cluster for w q Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 15 / 22
Query word clusters Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 16 / 22
Results Run MAP P5 Original text 0.2567 0.3485 OCRed text (baseline) 0.1791 0.2738 0.1974 1 Proposed method on OCRed text 0.2831 1 Not significant Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 17 / 22
Querywise Performance Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 18 / 22
Failure Analysis : clusters Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 19 / 22
Failure Analysis Compact clusters over all-inclusive clusters Re-clustering based on string match and co-occurrence Chance co-occurrence - harmful Incorporation of co-occurrence frequencies - essential Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 20 / 22
Brighter side Practical utility Language independent Context information - reliable Captures both erroneous and inflectional variants (effect of stemming) Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 21 / 22
THANK YOU Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kolkata, India () 22 / 22
Recommend
More recommend