data intensive programming lecture 3
play

Data-intensive Programming Lecture #3 Timo Aaltonen Department of - PowerPoint PPT Presentation

Data-intensive Programming Lecture #3 Timo Aaltonen Department of Pervasive Computing Guest Lectures Ill try to organize two guest lectures Oct 14, Tapio Rautonen, Gofore LTd, Making sense out of your big data Oct 7, ??? Outline


  1. Data-intensive Programming Lecture #3 Timo Aaltonen Department of Pervasive Computing

  2. Guest Lectures • I’ll try to organize two guest lectures • Oct 14, Tapio Rautonen, Gofore LTd, Making sense out of your big data • Oct 7, ???

  3. Outline • Course Work • Apache Sqoop • SQL Recap • MapReduce Examples – Inverted Index – Finding Friends – Computing Page Rank • (Hadoop) – Combiner – Other programming languages

  4. Course Work • MySportShop is a sports gear retailer. All the sales happens online in their webstore. Examples of their products are different game jerseys and sport watches. • The webstore has an Apache web server for the incoming HTTP requests. The web server logs all traffic to a log file. – Using these logs, one can study the browsing behavior of the users. • The sales data of MySportShop is in PostrgreSQL, which is a relational database. Among other things, the database has a table order_items containing data of all sales events of the shop.

  5. Course Work: Questions • Based on the data answer to the following questions 1. What are the top-10 best selling products in terms of total sales? 2. What are the top-10 browsed products? 3. What anomaly is there between these two? 4. What are the most popular browsing hours?

  6. Course Work • Since the managers of the company don’t use Hadoop but a RDBMS, all the data must be transferred to PostgreSQL • In order to do that – Transfer Apache logs (with Apache Flume) to the HDFS – Compute the frequencies of viewing of different products using MapReduce (Question 2) – Compute the viewing hour data with MapReduce (Q4) – Transfer the results (with Apache Sqoop) to PostgreSQL – Find answer to the questions in PostgreSQL using SQL (Q1-4)

  7. Environment: three options 1. You can use your own computer by installing VirtualBox 5.x – We offer you a virtual machine, which has been installed all required software and data – In the next weekly exercises assistants solve VirtualBox-related problems, if you encounter any 2. We offer you a virtual machine from TUT cloud – All required software and data is installed – No graphical user interface – Guidance available in the weekly exercises 3. Own installation/cloud service can be used – No help from the course personnel

  8. Course Work • The work is done in groups of three – Enroll in Moodle: https://moodle2.tut.fi/course/view.php?id=9954 – opens today at 10 o’clock • Deadline is Oct 14 th • Instructions for returning will be published later – IntelliJ IDEA project

  9. Course Work • Material – https://flume.apache.org/FlumeUserGuide.html – https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.ht ml – http://hadoop.apache.org/docs/r2.7.3/ – https://www.postgresql.org/docs/9.5/static/index.html

  10. MapReduce l Simple programming model l Map is stateless - allows running map functions in parallel l Also Reduce can be executed in parallel l The canonical example is the word count

  11. Inverted Index • Collating – Problem : There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes. – Solution : Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.

  12. Simple Inverted Index • Reduced output: word, list of docIDs Doc #1 this, 1 Reduced output doc, 1 This doc contains, 1 contains text text, 1 this: 1 doc: 1, 2 Doc #2 contains: 1, 2 my, 2 text: 1, 2 doc, 2 My doc my: 2 contains, 2 contains my my, 2 text text, 2

  13. (Normal) Inverted Index • Reduced output: word(, list (docID, frequency) Doc #1 this, (1,1) Reduced output doc, (1,1) This doc contains, (1,1) contains text text, (1,1) this: (1,1) doc: (1,1), (2,1) Doc #2 contains: (1,2), (2,1) my, (2,1) text: (1, 1), (2,1) doc, (2,1) My doc my: (2,2) contains, (2,1) contains my my, (2,1) text text, (2,1)

  14. Using Inverted Index: Searching • Documents – D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink. • Index – he: (1,2), (2,1) , (3,1), (4,1), (5,1) – ink: (3,1), (4,1), (5,1) – pink: (4,1), (5,1) – thing: (3, 1) – wink: (1,1), (5,1)

  15. Using Inverted Index • Indexing makes search engines fast • Data is sparse since most word appear only in one document – (id, val) tuples – sorted by id Index – compact he: (1,2), (2,1) , (3,1), (4,1), (5,1) – very fast ink: (3,1), (4,1), (5,1) • Linear merge pink: (4,1), (5,1) thing: (3, 1) wink: (1,1), (5,1)

  16. Linear Merge • Find documents marching query {ink, wink} – Load inverted lists for all query words – Linear merge O(n) • n is the total number of items in the two lists • f() is a scoring function: how well doc matches the query ink --> (3,1) (4,1) (5,1) wink--> (1,1) (5,1) Matching set: 5: f(1,1) 1: f(0,1) 3: f(1,0) 4: f(1,0)

  17. � Scoring Function • Specify which docs are matched – in: counts of query words in a doc – out: ranking score • how well doc matches the query • 0 if document does not match – Example: (1: 𝑜 , > 0 • Boolean AND: 𝑔 𝑅, 𝐸 = ∏ , ∈1 0: 𝑜 , = 0 – 1 iff all query words are present

  18. Phrases and Proximity • Query “pink ink” as a phrase D4: The ink he likes to drink is pink. • Using regular index: D5: He likes to wink and drink pink ink . – match #and(pink, ink) -> – scan match match documents for query string (slow) • Idea: index all bi-grams as words – can approximate “drink pink ink” pink_ink-> (5,1) – fast, but index size explodes drink_pink-> (5,1) – inflexible: can’t query #5(pink, ink) • Construct proximity index

  19. Proximity Index • Embed position information to the inverted lists – called positional/proximity index (prox-list) – handles arbitrary phrases, windows – key to “rich” indexing: structure, fields, tags, …

  20. Proximity Index • Reduced output: word, list of (docID, location) Doc #1 this, (1,1) Reduced output doc, (1,2) This doc contains, (1,3) contains text text, (1,4) this: (1,1) doc: (1,2), (2,1) Doc #2 contains: (1,3), (2,3) my, (2,1) text: (1, 4), (2,5) doc, (2,3) My doc my: (2,1),(2,4) contains, (2,3) contains my my, (2,4) text text, (2,5)

  21. Proximity Index • Documents – D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink. • Index – he: (1,1), (1,5) , (2,1), (3,3), (4,3), (5,1) – ink: (3,8), (4,2), (5,8) – pink: (4,8), (5,7) – thing: (3, 2) – wink: (1,4), (5,5)

  22. Using Proximity Index • Query: “pink ink” • Linear Merge – compare docIDs under pointer – if match – check pos(ink) - pos(pink) = 1 – near operator ink --> (3,8) (4,2) (5,8) pink--> (4,8) (5,7)

  23. Structure and Tags • Documents are not always flat – meta-data: title, author, date – structure: part, chapter, section, paragraph – tags: named entity, link, translation • Options for dealing with structure – create separate index for each field (like in SQL) – push structure into index values – construct extend index

  24. Extent Index • Special “term” for each element, field or tag – spans a region of text • words in the span belong to the field – allows multiple overlapping spans – similar stand-off annotation formats

  25. Extent Index • Documents – D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink. • Index – he: (1,1), (1,5) , (2,1), (3,3), (4,3), (5,1) – ink: (3,8), (4,2), (5,8) – pink: (4,8), (5,7) – thing: (3, 2) – wink: (1,4), (5,5) – link: (3, 1:2), (4, 1:2), (5, 7:8)

  26. Using Extent Index • Query: find an ink-related hyper-link • Same approach as with proximity – only now “tag” and “word” must have distance = 0 – Linear Merge, match when positions fall into extent – amenable to all optimizations ink --> (3,8) (4,2) (5,8) link-> (3,1:2) (4,1:2) (5,7:8)

  27. Overview on Inverted Indices • Normal • Positional – phrases, near operator • Extent – metadata, structure

  28. MR Example: Finding Friends l http://stevekrenzel.com/finding-friends-with- mapreduce l Facebook could use MapReduce in the following way

  29. MR Example: Finding Friends l Facebook has a list of friends - the relation is bidirectional l FB has lots of disk space and serve millions of requests per day l Certain results are pre-computed to reduce the processing time of requests - E.g. ”You and Joe have 230 mutual friends” - The list of common friends is quite stable - so recalculating would be wasteful

Recommend


More recommend