1
play

1 Data and information services: Scientific data directories/ - PDF document

General background Driving forces rapid evolution of : Digital Libraries Computing power hardware Memory Networking (internet) From Information retrieval to search engines DB systems e-books, e-libraries &


  1. General background Driving forces – rapid evolution of : Digital Libraries • Computing power hardware • Memory • Networking (internet) • From Information retrieval to search engines • DB systems • e-books, e-libraries & related topics software • IR systems • Hypertext (WWW) • GUI/ presentation tools (html) DL - 2004 Introduction – Beeri/Feitelson 2 DL - 2004 Introduction – Beeri/Feitelson 1 Consequences: (Classical) Libraries: • Automation of catalogs (old stuff) • On-line e-journals Transformation of existing applications • Collections of born-digital materials Generation of new applications related to New: data collection, organization, classification, • Digitized collections (images, maps,..) access • on-line archives • On-line, virtual museums Some examples: DL - 2004 Introduction – Beeri/Feitelson 4 DL - 2004 Introduction – Beeri/Feitelson 3 Digital libraries & bibliographic services: Portals, directories, search engines • ACM Digital Library acm-diglib • Yahoo – a directory (manual labor) collection of all (full) papers from ACM journals • Google – a search engine (fully automatic) • SIGMOD digital anthology anthology IR technology & hypertext structure • DBLP dblp collection of bibliographic information • Citeseer citeseer • Amazon (& similar on-line sales companies) citation and impact factor data (book/ dvd/ .. Portal/ directory) DL - 2004 Introduction – Beeri/Feitelson 6 DL - 2004 Introduction – Beeri/Feitelson 5 1

  2. Data and information services: Scientific data directories/ repositories/ portals : • Medline & US national library of medicine: nlmed • Bioinformatics: -- over 500 db’s/ data sources • On-line encyclopedias: bio-source Wikipedia, art encyclopedia • Astronomy – the world-wide telescope project • Lexis-nexis (and many like it) wwt • Classical humanities – the Perseus digital library perseus DL - 2004 Introduction – Beeri/Feitelson 8 DL - 2004 Introduction – Beeri/Feitelson 7 New kinds of services: Issues: • Heterogeneity � data transformation/ integration • Query subscription on XML/ data streams • Dependence on experiments � scientific niagara, niagaracq experiment – news meta-data – Stocks • Huge volumes � fast, approximate, on-line – satellite data stream processing Needs ultra-fast Issues: Fast streams • stream processing Millions and more queries • Query evaluation/ & subscribers data routing DL - 2004 Introduction – Beeri/Feitelson 10 DL - 2004 Introduction – Beeri/Feitelson 9 Library/IR basic concepts A few more relevant buzzwords : • Customer profiling Classification: ןוימ targeted advertisment Axiom: unique position for a book • Knowledge management � Unique call number סמ ' ןוימ Collecting, managing, exploiting the know-how in (+ secondary subjects) large organizations � hierarchical & universal classification system • E-learning Dewey (1876) LC (1900) • E-publication + subjects � books x Single main catalog x Claim to universality a End of examples DL - 2004 Introduction – Beeri/Feitelson 12 DL - 2004 Introduction – Beeri/Feitelson 11 2

  3. Metadata לע ינותנ Operations in collection creation/ maintenance : data describing a collection/ item Cataloging: גולטק create metadata record for an item Example: library bibliographic record for a book (many style/ spelling … conventions used here) See: Indexing: חותפימ Dublin Core & its use in stanford identify key terms (in all text/ some fields) • Controlled רקובמ -- uses a fixed vocabulary • Uncontrolled terms chosen by indexer Abstraction: רוצקת create short description (of key ideas) DL - 2004 Introduction – Beeri/Feitelson 14 DL - 2004 Introduction – Beeri/Feitelson 13 Products: Thesaurus: םינותנ ןולימ , סורואזית • Bibliographic record db a vocabulary of standard terms/ concepts • Author/ title catalog • Hierarchical organization – every term has • Subject (header) catalog – BT – broader term, NT – narrower terms (provides entry points in on-line catalogs) • Additional relationships – Synonyms, RT – related terms Currently, operations performed manually Used for: • Expensive, slow, time-consuming • Indexing � standardization of index terms • Require experts • Querying � standardization of query terms • Results are non-uniform (even with experts) (supposedly solves problem of non-uniform indexing) Creation: complex, lengthy, community process Automatic indexing & abstracting --- See also: ontology, KWIC (google them!) research areas DL - 2004 Introduction – Beeri/Feitelson 16 DL - 2004 Introduction – Beeri/Feitelson 15 Technology/theory background Unstructured data – (free) text : information retrieval רוזחיא עדימ Data = collection of texts Structured data – dbms Query = set of terms (words) • Schema (structure) describes data precisely Indices = inverted lists (words to locations) • Queries & query language (based on Results = ranked answers structure) • Indices, query optimization Issues/ challenges: covered in DB course • Answers are imprecise, approximate • Difficult to evaluate answer goodness Big challenge: integration of multiple heterogeneous autonomous sources DL - 2004 Introduction – Beeri/Feitelson 18 DL - 2004 Introduction – Beeri/Feitelson 17 3

  4. Hypertext/ www : Semi-structured data --- XML : html, http, soap, … . A standard for data exchange, also stored A browsing model covered in bsdi course covered in bsdi course • Self-describing data Issues/ disadvantages: • Optional meta-data --- DTD/ schemas • No notion of query, just browsing • Validation tools • No structure on data • query language • no data quality guarantees • Stream processing tools (under development) Current move: extend to semantic web DL - 2004 Introduction – Beeri/Feitelson 20 DL - 2004 Introduction – Beeri/Feitelson 19 The course Machine learning: Used for automatic • IR -- classical to Google • Classification – System architectures • Indexing – Kinds of queries • Abstracting – Auxiliary data structures (indices) & • Clustering efficient query processing – Compression, a bit of theory, uses in IR Of free text/ semi-structured data – Extensions to hypertext covered in machine learning courses (using link structure) • “Conceptual” topics – E-books – E-publishing End of technology survey DL - 2004 Introduction – Beeri/Feitelson 22 DL - 2004 Introduction – Beeri/Feitelson 21 End of Introduction DL - 2004 Introduction – Beeri/Feitelson 23 4

Recommend


More recommend