historic goals
play

Historic Goals A memex is a device in which an individual stores - PDF document

10/27/10 Historic Goals A memex is a device in which an individual stores all his books, records, and communications, and Search Engines which is mechanized so that it may be consulted and with exceeding speed and flexibility. It is an


  1. 10/27/10 Historic Goals “ A memex is a device in which an individual stores all his books, records, and communications, and Search Engines which is mechanized so that it may be consulted and with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” Information Search Vannevar Bush, As we may think, Atlantic Monthly , July 1945. (assigned week 2) “Google's mission is to organize the world's information and make it universally accessible and useful” Google’s mission statement, ~ 1998. Prophetic: Hypertext Vannevar Bush’s 1945 vision • Director of the Office of  "associative indexing, the basic idea of Scientific Research and which is a provision whereby any item may Development (1941-1947) be caused at will to select immediately and • End of WW2 - what next big automatically another. This is the essential challenge for scientists? Vannevar Bush,1890-1974 feature of the memex. The process of tying two items together is the important "This is a much larger matter than merely the thing.” extraction of data for the purposes of scientific research; it involves the entire process by which man profits by his inheritance of acquired knowledge” Prophetic: Wikipedia et al How have we achieved search capability? • "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails • Vannevar Bush envisioned personal index running through them, ready to be dropped into the memex and there amplified.” • General open collections – keyword/subject-based search  full-text search  hypertext enhanced search 1

  2. 10/27/10 Full-text search: beginnings Think first about text documents • Early digital searches – digital card • Gerald Salton, founding father of catalog: information retrieval – subject classifications, keywords Gerald Salton – SMART retrieval system (1927-1995) • “Full text” : words + English structure – No “meta-structure” • Although search has changed, classic • Classic study techniques still provide foundations – our – Gerald Salton SMART project 1960’s starting point 8 Information Retrieval Information Retrieval cont. • Define each of the words in quotes • User wants information from a collection of “objects”: information need – Information object – Query • User formulates need as a “query” – Satisfying objects – Language of information retrieval system – Useful presentation • System finds objects that “satisfy” query • System presents objects to user in “useful form” • Notion of relevance critical • User determines which objects from among – What really want? those presented are relevant – Insufficient structure for exact retrieval • Develop algorithms for the search and retrieval tasks 9 10 Modeling: “ satisfying ” AND Model • Document: set of terms • What determines if document satisfies query? • Query: set of terms • That depends …. – Document model • Satisfying: – Query model – document satisfies query if all terms of – definition of “satisfying” can still vary query appear in document • START SIMPLE – better understanding – Use components of simple model later Currently used by Web search engines 11 12 2

  3. 10/27/10 OR Model Scaling • Document: set of terms • What are attributes changing from 1960’s • Query: set of terms to online searches of today? • Satisfying: – document satisfies query if one or more terms of query appear in document • How do they change problem? Original IR model why? 13 14 Full-text search: beginnings Introducing Ranking • Gerald Salton, founding father of information retrieval Gerald Salton – SMART retrieval system • Order documents that satisfy a query by (1927-1995) • major idea: score documents by how well match the query frequency of words of query occur • How capture relevance to user by – take into account algorithmic method of ordering? • document length • frequency of words in collection • one of many major contributions 15 Frequency Model example Scoring documents (vector model) Doc 1 : “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying frequency- adjusted for this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, based word value this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence. Doc 1 Doc 2 Doc 1 Doc 2 (cos 116 description) Frequencies: “science” 1 2 .51 1.02 science 1; knowledge 2; principles 0; engineering 0 “engineering” 1 1.6 Doc 2: “An introduction to computer science in the context of scientific, “principles” 1 1.6 engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time “knowledge” 2 3.2 preparing students to use computers effectively for applications in computer science …” (cos 126 description) Combined Frequencies: 3 4 3.71 4.22 SCORE science 2; knowledge 0; principles 1; engineering 1 3

  4. 10/27/10 Review Using word occurrence and features • Current collections • word frequency in documents – Millions to trillions of documents & plain text • positions of words in documents petabytes of information • Queries of a few words • appearance in special parts of documents - title • Create index of collection organized by word - abstract – Words sorted alphabetically - section header – Binary search look-up marked-up text - … • Design ranking method • special features of word – Classic: word frequency - bold font – “Mark-up (e.g. HTML): attributes of words in doc. - larger font • word position, appearance, … - … Revisit the Index with ranking in mind Processing a query • Retrieval systems record all info will use about • Find index entry for each word in query document in the index • Each index entry gives list of documents • index organized by word containing word, usually sorted by doc. ID – all words in all documents = lexicon • for each word, index records list of: • Scan through lists in parallel looking for – documents in which it appears SORTED! documents containing all query words - positions at which it occurs in each doc. SORTED! – Sorting makes this linear time! – attributes for each occurrence • Process positions of query words and other • record summary information for documents attributes of query words for documents • record summary information for words containing all query words  Means index about as big as combination of – Allows look at how near different query words are to documents! each other in a document Along came the Web Anchor text • Major new element: links • the words used when making a link to – hypertext another document: “Assignment 1 is now available.” • Early Web search engines did not use links • Add words of anchor text to document – Excite, WebCrawler, Lycos, Infoseek, AltaVista, pointed to Inktomi, Ask Jeeves (now Ask) and more text of • How use links? document Assignment 1 + – anchor text “Assignment 1 ” – link analysis 4

  5. 10/27/10 Example Example 2 nd result: 14 th result for search on Google: toxin in URL Link analysis Graph model • Intuition: • Capture relationships between things when Web page points to another Web page, • Nodes represent things ( Web pages) confers status/authority/popularity to that page • Edges between nodes represent that • Find a scoring of pages that captures intuition things are related (links between pages) link – Directed or undirected (directed) page • Many many applications • Studied mathematically • Many algorithms for graph problems PageRank More graph examples • Algorithm that gave Google its “killer” performance • Social network: Larry Page, Sergey Brin, R. Motwani, T. Winograd node = person (1998) edge if friends – directed or undirected? • Random walk model • Rail system: + random leap node = station edge if non-stop train service between stations 1 • Tournament node = contestant / team edge from team A to team B if A beat B - directed 2 4 3 5

Recommend


More recommend