INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 4: Scoring, Term Weighting, The Vector Space Model I Paul Ginsparg Cornell University, Ithaca, NY 6 Sep 2011 1 / 52
Administrativa Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, office hour Fri 3:30pm-4:30pm in the small conference room (133) at 301 College Ave, and by email, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/ Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Sch¨ utze see also Information Retrieval , S. B¨ uttcher, C. Clarke, G. Cormack http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307 2 / 52
Overview Recap 1 Why ranked retrieval? 2 Term frequency 3 tf.idf weighting 4 The vector space model 5 3 / 52
Outline Recap 1 Why ranked retrieval? 2 Term frequency 3 tf.idf weighting 4 The vector space model 5 4 / 52
Heaps’ law for Reuters Vocabulary size M as a function of collection size T (number of tokens) for 6 Reuters-RCV1. For these data, the dashed line log 10 M = 0 . 49 ∗ log 10 T + 1 . 64 5 is the best least squares fit. Thus, M = 10 1 . 64 T 0 . 49 4 and k = 10 1 . 64 ≈ 44 log10 M 3 and b = 0 . 49. 2 M = kT b = 44 T . 49 1 0 0 2 4 6 8 log10 T 5 / 52
http://en.wikipedia.org/wiki/Zipf’s law Zipf’s law: the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. Brown Corpus: “the”: 7% of all word occurrences (69,971 of > 1M). � “of”: ∼ 3.5% of words (36,411) “and”: 2.9% (28,852) Only 135 vocabulary items account for half the Brown Corpus. The Brown University Standard Corpus of Present-Day American English is a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources . . . for many years among the most-cited resources in the field. 6 / 52
Zipf’s law for Reuters 7 6 5 4 log10 cf 3 2 1 0 0 1 2 3 4 5 6 7 log10 rank Fit far from perfect, but nonetheless key insight: Few frequent terms, many rare terms. 7 / 52
more from http://en.wikipedia.org/wiki/Zipf’s law “A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as expected. Zipf’s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.” 8 / 52
Another wikipedia count, 15 May 2010 http://imonad.com/seo/wikipedia-word-frequency-list/ “Word frequency distribution follows Zipf’s law” 9 / 52
Assignment 1 posted See http://www.infosci.cornell.edu/Courses/info4300/2011fa/assignment1.html 40 text files from four sources: nasa.gov, biologynews.net, news.cnet.com, sciencedaily.com file00.txt . . . file39.txt Due 18 Sep 2011 10 / 52
Collection Freq in perl #!/usr/bin/perl #Usage: cat test/file*.txt | collection freq.pl while ( �� ) { chomp; tr/[A-Z]/[a-z]/; foreach my $word (split) { next if $word ! ∼ /ˆ[a-z]/; ++$cf { $word } ; } } foreach my $word (sort { $cf { $b } � = � $cf { $a }} keys %cf) { print ++$m,” $word $cf { $word }\ n”; } 11 / 52
Collection Frequencies (from 30 file dataset 2010) 190 before 10 1 the 918 37 more 38 129 off 12 3307 circular 1 191 going 9 2 of 551 38 which 38 130 understanding 12 3308 division 1 192 video 9 3 to 476 39 about 37 131 several 12 3309 range 1 193 working 9 4 and 444 40 but 36 132 interactions 12 3310 quarter 1 194 business 9 5 in 364 41 been 35 133 if 12 3311 informed 1 195 mobile 9 6 a 363 42 first 35 134 neurons 12 3312 transmitter 1 196 sequences 9 7 that 212 43 had 33 135 carbon 12 3313 jury 1 197 map 9 8 for 170 44 brain 32 136 spacecraft 12 3314 medium 1 198 increase 9 9 is 152 45 these 31 137 science 12 3315 root 1 199 own 9 10 on 135 46 cells 29 138 must 12 3316 severe 1 200 possible 9 11 are 106 47 who 26 139 launch 12 3317 beta 1 201 later 9 12 as 100 48 space 26 140 them 12 3318 transmission 1 202 lead 9 13 with 98 49 up 26 141 gene 12 3319 repair 1 203 stories 9 14 at 94 50 what 25 142 human 12 3320 implies 1 204 professor 9 15 from 91 51 development 25 143 mammalian 12 3321 remained 1 205 each 9 16 will 84 52 genes 25 144 he 11 3322 declined 1 206 your 9 17 by 79 53 also 25 145 known 11 3323 doldrums 1 207 court 9 · · · 18 have 78 54 all 24 146 both 11 3324 sudden 1 · · · · · · 208 report 9 19 said 77 55 than 24 147 get 11 3325 perspective 1 209 robot 9 20 it 74 56 data 24 148 appear 11 3326 community 1 210 analysis 9 21 be 71 57 into 24 149 many 11 3327 catalyze 1 211 next 9 22 this 67 58 some 24 150 life 11 3328 answers 1 212 similar 9 23 an 62 59 now 24 151 say 11 3329 represents 1 213 same 9 24 was 57 60 million 23 152 together 11 3330 primary 1 214 since 9 25 new 53 61 you 23 153 observations 11 3331 statistically 1 215 done 9 26 has 53 62 over 23 154 radio 11 3332 absent 1 216 early 9 27 not 53 63 most 22 155 feature 11 3333 availability 1 217 used 9 28 its 52 64 between 22 156 her 11 3334 modifications 1 218 event 9 29 they 50 65 found 22 157 where 11 3335 picture 1 219 his 9 30 were 50 66 like 22 158 percent 11 3336 competition 1 220 methylation 9 31 their 48 67 time 22 159 then 11 3337 requests 1 221 while 9 32 we 46 68 cell 22 160 changes 11 3338 thin 1 222 buckyballs 9 33 other 45 69 way 21 161 aphids 11 3339 seriously 1 223 browser 9 34 or 42 70 when 21 162 make 11 3340 analyze 1 224 health 9 35 one 41 71 may 21 163 news 11 3341 candidate 1 225 temperature 9 36 can 38 72 how 21 164 do 11 3342 clearer 1 12 / 52
Collection Frequency vs Rank 30 document test data for Assignment 1 (Aug ’10 from nasa/cnet/bio-news, T=17428 total tokens, M=3342 distinct, 1856 appear once [1487–3342]) 1000 the 900 the 1000 of 800 to and in a 700 that for is on are 100 600 of 500 brain (44,32) to and 400 in 10 a 300 that 200 for is on are 100 1 brain (44,32) 0 0 500 1000 1500 2000 2500 3000 3500 1 10 100 1000 13 / 52
Document Frequencies (from 30 file dataset 2010) 110 you 10 1 a 30 37 not 20 164 addition 7 1351 aaron 1 111 already 9 2 and 30 38 they 20 165 cell 7 1352 abandoning 1 112 am 9 3 in 30 39 about 19 166 changes 7 113 another 9 1353 absent 1 4 is 30 40 been 19 167 company 7 114 based 9 1354 absolutely 1 5 of 30 41 first 19 168 didnt 7 115 during 9 1355 abundant 1 6 the 30 42 when 19 169 director 7 116 get 9 1356 academy 1 7 to 30 43 which 18 170 discovered 7 117 go 9 1357 accelerate 1 . 8 by 29 44 also 17 171 discovery 7 118 he 9 . 9 for 29 45 can 17 172 do 7 . 119 however 9 10 on 29 46 these 17 173 done 7 3987 wolfson 1 120 information 9 11 that 29 47 may 16 174 dont 7 3988 woman 1 121 institute 9 12 with 29 48 now 16 175 each 7 3989 wondering 1 122 international 9 13 are 28 49 who 16 176 few 7 3990 words 1 123 journal 9 14 as 28 50 between 15 177 focus 7 3991 worker 1 124 just 9 15 from 28 51 most 15 178 going 7 3992 workers 1 125 large 9 16 this 28 52 well 15 179 having 7 3993 worst 1 126 life 9 17 at 26 53 had 14 180 help 7 3994 worth 1 127 nasas 9 · · · 18 have 26 54 like 14 181 internet 7 3995 wreak 1 128 science 9 · · · · · · 19 an 25 55 over 14 182 later 7 3996 writing 1 129 scientists 9 20 be 25 56 some 14 183 launch 7 3997 wrong 1 130 similar 9 21 it 25 57 those 14 184 likely 7 3998 year-old 1 131 since 9 22 will 24 58 up 14 185 make 7 3999 yearlong 1 132 space 9 23 has 23 59 what 14 186 many 7 4000 yield 1 133 system 9 24 its 23 60 would 14 187 member 7 4001 york 1 134 understanding 9 25 but 22 61 all 13 188 might 7 4002 youd 1 135 use 9 26 one 22 62 including 13 189 national 7 4003 youll 1 136 used 9 27 other 22 63 only 13 190 news 7 4004 younger 1 137 way 9 28 said 22 64 our 13 191 patterns 7 4005 youngest 1 138 where 9 29 their 22 65 so 13 192 percent 7 4006 youre 1 139 aug 8 30 was 22 66 such 13 193 possible 7 4007 youtube 1 140 basic 8 31 we 22 67 than 13 194 power 7 4008 zamponi 1 141 before 8 32 new 21 68 time 13 195 program 7 4009 zarya 1 142 different 8 33 or 21 69 world 13 196 published 7 4010 zdnet 1 143 down 8 34 were 21 70 according 12 197 researchers 7 4011 zhao 1 144 even 8 35 august 20 71 both 12 198 say 7 4012 zuckerberg 1 145 future 8 36 more 20 72 could 12 199 several 7 4013 zune 1 146 his 8 14 / 52
Recommend
More recommend