Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group - PowerPoint PPT Presentation

Towards Scalable Multimedia Analytics Björn Þór Jónsson data sys group Computer Science Department IT University of Copenhagen

Today’s Media Collections • Massive and growing – Europeana > 50 million items – DeviantArt > 250 million items (160K/day) – Facebook > 1,000 billion items (200M/day) • Variety of users and applications – Novices à enthusiasts à scholars à experts – Current systems aimed at helping experts • Need for understanding and insights 2

Media Tasks Exploration Search 3

Media Tasks 4 [Zahálka and Worring, 2014]

Multimedia Analytics Multimedia Visual Multimedia Analytics Analysis Analytics 5 [Zahálka and Worring, 2014]

From Data to Insight 6 [Zahálka and Worring, 2014; Keim et al., 2010]

The Two Gaps Semantic Gap Generic data Specific context and annotation and task-driven [Smeulders et al., 2000] based on objective subjective understanding understanding Pragmatic Gap Predefined, fixed Dynamically evolving annotation based and interaction-driven [Zahálka and Worring, 2014] on understanding of understanding of the collection collections 7 [Zahálka and Worring, 2014]

Multimedia Analytics State of the Art • Theory is developing • Early systems have appeared • No real-life applications (?) • Small collections only 8

Scalable Multimedia Analytics Scalable Visual Multimedia Multimedia Analytics Analysis Analytics Data Management 9 [Jónsson et al., 2016]

The Three Gaps Semantic Gap Generic data Specific context and annotation and task-driven [Smeulders et al., 2000] based on objective subjective understanding understanding Pragmatic Gap Predefined, fixed Dynamically evolving annotation based and interaction-driven [Zahálka and Worring, 2014] on understanding of understanding of the collection collections Scale Gap Pre-computed Serendipitous indices and bulk and highly interactive [Jónsson et al., 2016] processing of large sessions on small datasets data subsets 10

VOLUME VELOCITY VARIETY Ten Research Questions for Scalable Multimedia Analytics [Jónsson et al., MMM 2016] VISUAL INTERACTION

Building Systems? 13

Big Data Framework: Lambda Architecture Speed Service Layer Layer Batch Layer Storage Layer 14 [Marz and Warren, 2015]

Big Data Framework: Lambda Architecture Speed Service Layer Layer Batch Layer Storage Layer 15

Outline • Motivation: Scalable multimedia analytics • Batch Layer: Spark and 43 billion high-dim features • Service Layer: Blackthorn and 100 million images • Conclusion: Importance and challenges of scale! 16

Gylfi Þór Guðmundsson , Laurent Amsaleg, Björn Þór Jónsson, Michael J. Franklin Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys) Taipei, Taiwan, June, 2017 17

Spark Case Study: Motivation • How can multimedia tasks harness the power of cloud-computing? – Multimedia collections are growing – Computing power is abundant • ADCFs = Hadoop || Spark – Automatically Distributed Computing Frameworks – Designed for high-throughput processing 18

Design Choices: ADCF = Spark • Hadoop is not suitable (more later) • Resilient Distributed Datasets (RDDs) – Transform one RDD to another via operators – Lazy execution – Master and Workers paradigm • Supports deep pipelines • Supports worker’s memory sharing • Lazy execution allows for optimizations 19

Design Choices: Application Domain • Content-Based Image Retrieval (CBIR) – Well known application – Two phases: Off-line & “On-line” CBIR System Query Image Search results 20

Design Choices: DeCP Algorithm Properties: Why? • Clustering-based • Very simple • Deep hierarchical index • Prototypical of many CBIR algorithms • Approximate k -NN search • Previous Hadoop • Trades response time for implementation throughput by batching facilitates comparison 21

DeCP as a CBIR System • Off-line Clustered collection Index is – Build the index in RAM hierarchy – Cluster the data Clusters reside on disk collection • On-line Searching a single feature – Approximate k -NN Identify Retrieve k-NN search Scan – Vote aggregation 22

Multi-Core Parallelism: Scaling-Out l Relative indexing time on 2-24 virtual cores Relative Scalabilty Measured wall Optimal trend Corrected clock time trend 0.60 0.60 Real cores HT cores Time relative to using 2 cores 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 2 4 6 8 10 12 14 16 18 20 22 24 Number of cores

Batch Processing: Throughput Throughput 8.1B desc. 1.6M clusters, L=4 Index Batches of images from 10 to 100,000 Scan wall time Scan CPU time per image per image 6,000 5,000 4,000 Milliseconds 3,000 2,000 81ms 1,000 0 10 100 200 500 1K 2K 3K 5K 10K 20K 50K 100K Batch size in images

Design Choices: Feature Collection • YLI feature corpus from YFCC100M – Various feature sets (visual, semantic, …) – 99.2M images and 0.8M videos – Largest dataset publicly available • Use all 42.9 billion SIFT features! – Goal is to test at a very large scale – No feature aggregation or compression – Largest feature collection reported! 25

Research Questions • What is the complexity of the Spark pipeline for typical multimedia-related tasks? • How well does background processing scale as collection size and resources grow? • How does batch size impact throughput of an offline service? 26

Requirements for the ADCF R1 Scalability R4 Updates Ability to scale out Ability to gracefully update with additional the data structures for computing power dynamic workloads R2 Computational flexibility R5 Flexible pipeline Ability to carefully balance Ability to easily implement system resources variations of the indexing as needed and/or retrieval process R3 Capacity R6 Simplicity Ability to gracefully handle How efficiently the data that vastly exceeds programmer’s time main memory capacity is spent 27

DeCP on Hadoop • Prior work evaluated DeCP on Hadoop using 30 billion SIFTs on 100+ machines • Conclusion = limited success – Scalability limited due to RAM per core – Two-step Map-Reduce pipeline is too rigid • Ex: Single data-source only • Ex: Could not search multiple clusters – R1, R2, R3 = partially; R4 = no; R5, R6 = no 28

DeCP on Spark • A very different ADCF from Hadoop • Several advantages – Arbitrarily deep pipelines • Easily implement all features and functionality – Broadcast variables • Solves the RAM per core limitation – Multiple data sources • Ex: Allows join operations for maintenance (R4) 29

Spark Pipeline Symbols • .map = one-to-one transformation .map • .flatmap = one-to-any transformation .flatmap • .groupByKey = Hadoop’s Shuffle .groupByKey • .reduceByKey = Hadoop’s Reduce .reduceByKey • .collectAsMap = collect to Master .collectAsMap 30

Indexing Pipeline 31

Search Pipeline Indexing Search 32

Evaluation: Off-line Indexing • Hardware: 51 AWS c3.8xl nodes – 800 real cores + 800 virtual cores – 2.8 TB of RAM and 30 TB of SSD space • Indexing time as collection grows Features Indexing time Scaling (billions) (seconds) (relative) 8.5 3,287 – 17.2 5,030 1.53 26.0 11,943 3.63 34.5 14,192 4.31 42.9 19,749 6.00 33

Evaluation: “On-line” Search ● Throughput with batching Hadoop limit 34

Summary Computational Scalability Simplicity Flexibility Pipelines Capacity Flexible Updates R1 R2 R3 R4 R5 R6 Partial Spark Yes Yes Yes Yes Yes full re- shuffle Partial Hadoop Partial Partial No No No RAM per core demonstration web-site under development 35

Largest Collections in the Literature... 1. Guðmundsson, Amsaleg, Jónsson, Franklin (2017) – 42.9 billion SIFTs – 51 servers 2. Moise, Shestakov, Guðmundsson, Amsaleg (2013) – 30.2 billion SIFTs – 108 servers 3. Lejsek, Jónsson, Amsaleg (2017?) – 28.5 billion SIFTs – 1 server 4. Lejsek, Jónsson, Amsaleg (2011) – 2.5 billion SIFTs – 1 server 5. Jégou et al. (2011) – 2 billion ... – 1 server 6. Sun et al (2011) – 1.5 billion ... – 10 servers

Role of Spark Speed Service Layer Layer Batch Layer Storage Layer 37

Outline • Motivation: Scalable multimedia analytics • Batch Layer: Spark and 43 billion high-dim features • Service Layer: Blackthorn and 100 million images • Conclusion: Importance and challenges of scale! 38

Framework: Lambda Architecture Speed Service Layer Layer Batch Layer Storage Layer 39

Jan Zahálka , Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Blackthorn: Large-Scale Interactive Multimodal Learning Accepted to IEEE Transactions on Multimedia Jan Zahálka , Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Interactive Multimodal Learning on 100 Million Images Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR) New York, NY, USA, June, 2016 40

Multimedia Analytics Process 41 [Zahálka and Worring, 2014; Keim et al., 2010]

Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group - PowerPoint PPT Presentation

Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group Computer Science Department IT University of Copenhagen Todays Media Collections Massive and growing Europeana > 50 million items DeviantArt >

Multimedia Systems Definition of Multimedia System A Multimedia System is a system capable of

Multimedia Applications Multimedia Applications Srinidhi Varadarajan Multimedia Applications

Chapter 1 Introduction to Multimedia 1.1 What is Multimedia? 1.2 Multimedia and Hypermedia 1.3

Distributed Multimedia Systems 8. Multimedia Applications Multimedia Applications - 1 Lszl

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

Summary User-centric Social Social Multimedia Multimedia Computing From Users: user-perceptive

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

The MeeGo Multimedia Stack Dr. Stefan Kost Nokia - The MeeGo Multimedia Stack - CELF Embedded

Streaming Multimedia Applications Multimedia Networking Multimedia Applications? What are

Multimedia Queries and Indexing Prof Stefan Rger Multimedia and Information Systems Knowledge

Multimedia Indexation Titus ZAHARIA, Pr. Titus.Zaharia@telecom-sudparis.eu Multimedia indexation

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

New Territory of Machine Translation Kyunghyun Cho Courant Institute of Mathematical Sciences

Tweakable Block Cipher Secure Beyond the Birthday Bound in the Ideal Cipher Model Jooyoung Lee ,

The Perl 6 Express Jonathan Worthington Nordic Perl Workshop 2009 The Perl 6 Express About Me

UI development for the Web slides by Anastasia Bezerianos Divide and conquer A

Base of Tongue/ Head and Neck 2019 2019-2020 NAACCR W EBINAR SERIES 2 Q&A Please submit

1 Ray Tracing History Ray Tracing History From SIGGRAPH 18 Outline in Code Image Raytrace

T A T EX Session 7: L EX and XL P . S. Langeslag 29 November 2018 foo(?=bar) foo(?!bar)

Lectures 20, 21: Single-cell Sequencing and Assembly Spring

Sambuz

Useful Links

Newsletter

Mail Us

Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group - PowerPoint PPT Presentation

Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group Computer Science Department IT University of Copenhagen Todays Media Collections Massive and growing Europeana > 50 million items DeviantArt >

Multimedia Systems Definition of Multimedia System A Multimedia System is a system capable of

Multimedia Applications Multimedia Applications Srinidhi Varadarajan Multimedia Applications

Chapter 1 Introduction to Multimedia 1.1 What is Multimedia? 1.2 Multimedia and Hypermedia 1.3

Distributed Multimedia Systems 8. Multimedia Applications Multimedia Applications - 1 Lszl

Multimedia Information Retrieval 1 What is multimedia information retrieval? 2 Basic Multimedia

Summary User-centric Social Social Multimedia Multimedia Computing From Users: user-perceptive

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

The MeeGo Multimedia Stack Dr. Stefan Kost Nokia - The MeeGo Multimedia Stack - CELF Embedded

Streaming Multimedia Applications Multimedia Networking Multimedia Applications? What are

Multimedia Queries and Indexing Prof Stefan Rger Multimedia and Information Systems Knowledge

Multimedia Indexation Titus ZAHARIA, Pr. Titus.Zaharia@telecom-sudparis.eu Multimedia indexation

1 What is multimedia information retrieval? 1.1 Information retrieval 1.2 Multimedia 1.3

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

New Territory of Machine Translation Kyunghyun Cho Courant Institute of Mathematical Sciences

Tweakable Block Cipher Secure Beyond the Birthday Bound in the Ideal Cipher Model Jooyoung Lee ,

The Perl 6 Express Jonathan Worthington Nordic Perl Workshop 2009 The Perl 6 Express About Me

UI development for the Web slides by Anastasia Bezerianos Divide and conquer A

Base of Tongue/ Head and Neck 2019 2019-2020 NAACCR W EBINAR SERIES 2 Q&amp;A Please submit

1 Ray Tracing History Ray Tracing History From SIGGRAPH 18 Outline in Code Image Raytrace

T A T EX Session 7: L EX and XL P . S. Langeslag 29 November 2018 foo(?=bar) foo(?!bar)

Lectures 20, 21: Single-cell Sequencing and Assembly Spring

Sambuz

Useful Links

Newsletter

Mail Us

Base of Tongue/ Head and Neck 2019 2019-2020 NAACCR W EBINAR SERIES 2 Q&A Please submit