lectures and exercises lectures and exercises
play

Lectures and Exercises Lectures and Exercises Lectures Lectures - PDF document

1 TDT4215 Web-intelligence TDT4215 Web intelligence Main topics: Information Retrieval Information Retrieval Large textual document collections Text mining NLP for document analysis NLP for document analysis


  1. 1 TDT4215 Web-intelligence TDT4215 Web intelligence Main topics: • Information Retrieval • Information Retrieval • Large textual document collections • Text mining • NLP for document analysis NLP for document analysis • Ontologies for document management How to extract knowledge from large document collections? How to extract knowledge from large document collections? TDT4215 - Introduction TDT4215 - Introduction 2 Lectures and Exercises Lectures and Exercises Lectures Lectures • Researcher Stein L. Tomassen • Additional lecturers: - PhD student Geir Solskinnsbakk - PhD student Wei Wei - PhD student Nattiya Kanhabua • Guest lectures: - PhD George Tsatsaronis from Athens University of Economics and Business - PhD student Simon Jonassen PhD t d t Si J • Thursdays 08.15-11.00 in S6 (that’s right, three hours!) Exercises • Researcher Stein L. Tomassen • Fridays 14.15-16.00 in F3 All relevant information are continuously published at http://www.idi.ntnu.no/emner/tdt4215 / http://www.idi.ntnu.no/emner/tdt4215 / TDT4215 - Introduction

  2. 3 Text Materials Text Materials • Baeza-Yates & Ribeiro-Neto: Modern Information Retrieval. Addison-Wesley, 1999. (selected chapters) (selected chapters) • Manning, Raghavan and Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008. (selected chapters, available for download) • Compendium from IDI (selected book chapters and papers) (selected book chapters and papers) • Details are published at the homepage of the course TDT4215 - Introduction 4 Assessment Assessment • Group project: 25% of grade – Groups of 3-5 people – Discuss a particular theoretical topic Discuss a particular theoretical topic – Develop an information retrieval / text mining application – Evaluate application To be carried out the first half of the term (25 th Feb To be carried out the first half of the term (25 th Feb – 7 th Apr) 7 th Apr) – – Stein L. Tomassen is responsible for the group project • I di id Individual written examination: 75% of grade l i i i 75% f d 20 th of May – – 4 hours written examination (discussions, calculations, no programming) – Based on everything we will learn in the course TDT4215 - Introduction

  3. 5 Course Characteristics Course Characteristics • Experimental science: – No clear answers or theories – Lots of formulas (that are hard to justify) Lots of formulas (that are hard to justify) • Relevance: – Concerns real-world problems – A basis for knowledge management applications: Search engines, document management systems, publication systems, digital libraries, enterprise business applications, business/web intelligence systems semantic interoperation/integration software etc systems, semantic interoperation/integration software, etc. • Multi-disciplinary: – Combines techniques from several other sciences: Statistics linguistics conceptual modeling artificial intelligence databases Statistics, linguistics, conceptual modeling, artificial intelligence, databases, etc. TDT4215 - Introduction 6 Projects and Exercises Important Projects and Exercises Important • One mandatory project: – Practice in setting up an application – How to evaluate the quality of IR/TM applications? – How to extract knowledge from specific types of text? Which techniques for which types of text? • Exercises: – Examples from lectures – Understand how formulas are used in practice – Be comfortable with “unproven theories” – Representative for examination questions p q • Exercises are important! TDT4215 - Introduction

  4. 7 Lecture Plan (1) Lecture Plan (1) TDT4215 - Introduction 8 Lecture Plan (2) Lecture Plan (2) TDT4215 - Introduction

  5. 9 Lecture Plan (3) Lecture Plan (3) TDT4215 - Introduction 10 From Documents to Knowledge From Documents to Knowledge • Document collections • Knowledge and documents g • Document retrieval • Text Mining • Ontologies 80% of organizational data is textual with no proper structure! TDT4215 - Introduction

  6. 11 Overall approach Overall approach Retrieve document Discover knowledge Information Knowledge elicitation Morpho-syntax Text Text Mining Retrieval Knowledge representation Semantics Ontology Existing Existing New New TDT4215 - Introduction 12 Document Collections Document Collections • Domain-dependent or domain-independent • Structured or non-structured text • Formatted or non-formatted documents • Textual or multimedia documents • • Monolingual and multilingual document collections Monolingual and multilingual document collections • Centralized or non-centralized document management • Confidential or non-confidential • Controlled or free addition of documents • Stable or non-stable collections User User Document Information collection system TDT4215 - Introduction

  7. 13 Case 1: SAP at STATOIL Case 1: SAP at STATOIL • SAP used for major internal business processes • Named user accounts: 29,000 Concurrent users: 3,200 • System complexities: 894,000 customers 18,000 vendors 382 000 382,000 materials t i l • Work orders created each month: 11,000 • Sales orders created each month: 245,000 (11,600 per day) Sales orders created each month: 245 000 (11 600 per day) • Documents produced each month: 2,25 million • • Growth of database: 35 GB per month (Aug 2001) Growth of database: 35 GB per month (Aug 2001) • Document characteristics: highly structured, textual and tabular, formatted, controlled addition, high growth, non-centralized, formatted, controlled addition, high growth, non centralized, possibly multilingual TDT4215 - Introduction 14 Case 2: Reengineering project at g g p j Hydro Agri • Objective: Reengineer organization and implement SAP R3 to support business processes • Project duration: July 1995 – March 1999 j y • Costs: USD 126 million • Staffing: 500+ (140 external consultants) • Document management: Specialized Lotus Notes databases g p • Document production: • SHARE Training: g 1061 docs 868 MB • SHARE Test: 1632 docs 218 MB • SHARE Development: 12859 docs 218 MB • HAE User document.: 1312 docs 133 MB • TOTAL: 16864 docs 1437 MB 359 per month 12 per day TDT4215 - Introduction

  8. 15 Text is Difficult Text is Difficult • Most organizational knowledge encoded in textual documents • Unstructured or semi-structured text difficult to retrieve, interpret or analyze • Particular problems: – Inconsistent documents – Incomplete descriptions Incomplete descriptions – Duplicates – Different terminologies/languages/abbreviations/perspectives TDT4215 - Introduction 16 Knowledge and Documents Knowledge and Documents • One particular document is needed E.g.: What textbook is used in TDT4215? • Several documents provide partial answers E.g.: What is the definition of “text mining”? • All documents contribute to answer E.g.: Who writes about Rosenborg? E.g.: Who writes about Rosenborg? • W Words versus concepts d t • Manual inspection versus automatic reasoning TDT4215 - Introduction

  9. 17 Document Retrieval Document Retrieval • Information retrieval = information access • Retrieve documents that satisfy a user’s information Retrieve documents that satisfy a user s information need from a document collection – Document indexing Document Document Document Document Document – Query interpretation Q i t t ti representations representations representations representations – Ranking of retrieved documents identify relevant – Linguistics and statistics information query formulation formulation display documents to user TDT4215 - Introduction 18 Document Retrieval Example Document Retrieval Example • AllTheWeb from Fast Search & Transfer (2002) • Index: 2,1 GB documents • Languages supported: 52 • Linguistics used: Lemmatization Linguistics used: Lemmatization, language identification, phrasing, anti- phrasing, text categorization, clustering, offensive content reduction, finite-state automata • 30 mill. queries a day • www.alltheweb.com is today part of Yahoo and uses the Inktomi search engine • The old AllTheWeb search engine used Yahoo’s verticals TDT4215 - Introduction

Recommend


More recommend