Towards Scalable Multimedia Analytics Björn Þór Jónsson data sys group Computer Science Department IT University of Copenhagen
Today’s Media Collections • Massive and growing – Europeana > 50 million items – DeviantArt > 250 million items (160K/day) – Facebook > 1,000 billion items (200M/day) • Variety of users and applications – Novices à enthusiasts à scholars à experts – Current systems aimed at helping experts • Need for understanding and insights 2
Media Tasks Exploration Search 3
Media Tasks 4 [Zahálka and Worring, 2014]
Multimedia Analytics Multimedia Visual Multimedia Analytics Analysis Analytics 5 [Zahálka and Worring, 2014]
From Data to Insight 6 [Zahálka and Worring, 2014; Keim et al., 2010]
The Two Gaps Semantic Gap Generic data Specific context and annotation and task-driven [Smeulders et al., 2000] based on objective subjective understanding understanding Pragmatic Gap Predefined, fixed Dynamically evolving annotation based and interaction-driven [Zahálka and Worring, 2014] on understanding of understanding of the collection collections 7 [Zahálka and Worring, 2014]
Multimedia Analytics State of the Art • Theory is developing • Early systems have appeared • No real-life applications (?) • Small collections only 8
Scalable Multimedia Analytics Scalable Visual Multimedia Multimedia Analytics Analysis Analytics Data Management 9 [Jónsson et al., 2016]
The Three Gaps Semantic Gap Generic data Specific context and annotation and task-driven [Smeulders et al., 2000] based on objective subjective understanding understanding Pragmatic Gap Predefined, fixed Dynamically evolving annotation based and interaction-driven [Zahálka and Worring, 2014] on understanding of understanding of the collection collections Scale Gap Pre-computed Serendipitous indices and bulk and highly interactive [Jónsson et al., 2016] processing of large sessions on small datasets data subsets 10
VOLUME VELOCITY VARIETY Ten Research Questions for Scalable Multimedia Analytics [Jónsson et al., MMM 2016] VISUAL INTERACTION
12
Building Systems? 13
Big Data Framework: Lambda Architecture Speed Service Layer Layer Batch Layer Storage Layer 14 [Marz and Warren, 2015]
Big Data Framework: Lambda Architecture Speed Service Layer Layer Batch Layer Storage Layer 15
Outline • Motivation: Scalable multimedia analytics • Batch Layer: Spark and 43 billion high-dim features • Service Layer: Blackthorn and 100 million images • Conclusion: Importance and challenges of scale! 16
Gylfi Þór Guðmundsson , Laurent Amsaleg, Björn Þór Jónsson, Michael J. Franklin Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys) Taipei, Taiwan, June, 2017 17
Spark Case Study: Motivation • How can multimedia tasks harness the power of cloud-computing? – Multimedia collections are growing – Computing power is abundant • ADCFs = Hadoop || Spark – Automatically Distributed Computing Frameworks – Designed for high-throughput processing 18
Design Choices: ADCF = Spark • Hadoop is not suitable (more later) • Resilient Distributed Datasets (RDDs) – Transform one RDD to another via operators – Lazy execution – Master and Workers paradigm • Supports deep pipelines • Supports worker’s memory sharing • Lazy execution allows for optimizations 19
Design Choices: Application Domain • Content-Based Image Retrieval (CBIR) – Well known application – Two phases: Off-line & “On-line” CBIR System Query Image Search results 20
Design Choices: DeCP Algorithm Properties: Why? • Clustering-based • Very simple • Deep hierarchical index • Prototypical of many CBIR algorithms • Approximate k -NN search • Previous Hadoop • Trades response time for implementation throughput by batching facilitates comparison 21
DeCP as a CBIR System • Off-line Clustered collection Index is – Build the index in RAM hierarchy – Cluster the data Clusters reside on disk collection • On-line Searching a single feature – Approximate k -NN Identify Retrieve k-NN search Scan – Vote aggregation 22
Multi-Core Parallelism: Scaling-Out l Relative indexing time on 2-24 virtual cores Relative Scalabilty Measured wall Optimal trend Corrected clock time trend 0.60 0.60 Real cores HT cores Time relative to using 2 cores 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 2 4 6 8 10 12 14 16 18 20 22 24 Number of cores
Batch Processing: Throughput Throughput 8.1B desc. 1.6M clusters, L=4 Index Batches of images from 10 to 100,000 Scan wall time Scan CPU time per image per image 6,000 5,000 4,000 Milliseconds 3,000 2,000 81ms 1,000 0 10 100 200 500 1K 2K 3K 5K 10K 20K 50K 100K Batch size in images
Design Choices: Feature Collection • YLI feature corpus from YFCC100M – Various feature sets (visual, semantic, …) – 99.2M images and 0.8M videos – Largest dataset publicly available • Use all 42.9 billion SIFT features! – Goal is to test at a very large scale – No feature aggregation or compression – Largest feature collection reported! 25
Research Questions • What is the complexity of the Spark pipeline for typical multimedia-related tasks? • How well does background processing scale as collection size and resources grow? • How does batch size impact throughput of an offline service? 26
Requirements for the ADCF R1 Scalability R4 Updates Ability to scale out Ability to gracefully update with additional the data structures for computing power dynamic workloads R2 Computational flexibility R5 Flexible pipeline Ability to carefully balance Ability to easily implement system resources variations of the indexing as needed and/or retrieval process R3 Capacity R6 Simplicity Ability to gracefully handle How efficiently the data that vastly exceeds programmer’s time main memory capacity is spent 27
DeCP on Hadoop • Prior work evaluated DeCP on Hadoop using 30 billion SIFTs on 100+ machines • Conclusion = limited success – Scalability limited due to RAM per core – Two-step Map-Reduce pipeline is too rigid • Ex: Single data-source only • Ex: Could not search multiple clusters – R1, R2, R3 = partially; R4 = no; R5, R6 = no 28
DeCP on Spark • A very different ADCF from Hadoop • Several advantages – Arbitrarily deep pipelines • Easily implement all features and functionality – Broadcast variables • Solves the RAM per core limitation – Multiple data sources • Ex: Allows join operations for maintenance (R4) 29
Spark Pipeline Symbols • .map = one-to-one transformation .map • .flatmap = one-to-any transformation .flatmap • .groupByKey = Hadoop’s Shuffle .groupByKey • .reduceByKey = Hadoop’s Reduce .reduceByKey • .collectAsMap = collect to Master .collectAsMap 30
Indexing Pipeline 31
Search Pipeline Indexing Search 32
Evaluation: Off-line Indexing • Hardware: 51 AWS c3.8xl nodes – 800 real cores + 800 virtual cores – 2.8 TB of RAM and 30 TB of SSD space • Indexing time as collection grows Features Indexing time Scaling (billions) (seconds) (relative) 8.5 3,287 – 17.2 5,030 1.53 26.0 11,943 3.63 34.5 14,192 4.31 42.9 19,749 6.00 33
Evaluation: “On-line” Search ● Throughput with batching Hadoop limit 34
Summary Computational Scalability Simplicity Flexibility Pipelines Capacity Flexible Updates R1 R2 R3 R4 R5 R6 Partial Spark Yes Yes Yes Yes Yes full re- shuffle Partial Hadoop Partial Partial No No No RAM per core demonstration web-site under development 35
Largest Collections in the Literature... 1. Guðmundsson, Amsaleg, Jónsson, Franklin (2017) – 42.9 billion SIFTs – 51 servers 2. Moise, Shestakov, Guðmundsson, Amsaleg (2013) – 30.2 billion SIFTs – 108 servers 3. Lejsek, Jónsson, Amsaleg (2017?) – 28.5 billion SIFTs – 1 server 4. Lejsek, Jónsson, Amsaleg (2011) – 2.5 billion SIFTs – 1 server 5. Jégou et al. (2011) – 2 billion ... – 1 server 6. Sun et al (2011) – 1.5 billion ... – 10 servers
Role of Spark Speed Service Layer Layer Batch Layer Storage Layer 37
Outline • Motivation: Scalable multimedia analytics • Batch Layer: Spark and 43 billion high-dim features • Service Layer: Blackthorn and 100 million images • Conclusion: Importance and challenges of scale! 38
Framework: Lambda Architecture Speed Service Layer Layer Batch Layer Storage Layer 39
Jan Zahálka , Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Blackthorn: Large-Scale Interactive Multimodal Learning Accepted to IEEE Transactions on Multimedia Jan Zahálka , Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Interactive Multimodal Learning on 100 Million Images Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR) New York, NY, USA, June, 2016 40
Multimedia Analytics Process 41 [Zahálka and Worring, 2014; Keim et al., 2010]
42
Recommend
More recommend