CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department | Colorado State University CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 1. INTRODUCTION TO BIG DATA What is Big Data? Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Big Data The three(or four) Vs in Big Data • Things one can do at a large scale that cannot be done at a smaller one • Volume • Voluminous • To extract new insights • It does not have to be certain number of petabytes or quantity. • Create new forms of values • Velocity • How fast the data is coming in? • How fast you need to be able to analyze and utilize it • Big Data is about analytics of huge quantities of data in order to infer probabilities • Variety • Big Data is NOT about trying to “teach” a computer to “think” like humans • Number of sources or incoming vectors • Providing a quantitative dimension it never had before • Veracity • Can you trust the data itself, source of the data, or the process? • User entry errors, redundancy, corruption of the values • Data cleaning CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Who is using Big Data? http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Photo Credit:https://datafloq.com/read/car-manufacturers-are-using-big-data/1204 CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Connected cars • Single hybrid plug-in car generates up to 25 gigabytes per hour • Connected cars • $130 billion • Traffic problem, re-routing based on the volume of traffic • Alerts driver when a road conditions are hazardous by automatically activating anti-lock break • This information is shared by the vehicles that are nearby CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University The Artemis project: Saving “preemies” using Big Data • The Artemis project • Dr. Carolyn McGregor • Toronto’s Hospital for Sick Children, University of Ontario Institute of Technology and IBM • Captures and process the patients’ data in real time • 16 different data streams • Heart rate, respiration rate, temperature, blood pressure and blood oxygen level • Around 1,260 data points per second • System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Look Who’s Peeking at Your Paycheck Related research areas • Experian’s Income Insight • Storage systems • Estimates people’s income level • How can we efficiently resolve queries on massive amounts of input data? • Based on their credit history • The input dataset may be presented in the form of a distributed data stream • Trains the estimation model using selected credit history and tax information from IRS • Machine learning • How can we efficiently solve large-scale machine learning problems? • The input data may be massive; stored in a distributed cluster of machines • Distributed computing • How can we efficiently solve large-scale optimization problems in distributed computing environments? • For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large scale graphs? KAREN BLUMENTHAL, “Look Who’s Peeking at Your Paycheck”, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106 CS535 Big Data | Computer Science Department | Colorado State University CS535 BIG DATA Big Data Lab at Colorado State University • Director: Sangmi Pallickara • Algorithmic and systems design • Scalable analytics over voluminous datasets PART A. BIG DATA TECHNOLOGY on complex distributed architectures 2. COURSE INTRODUCTION • Research has been deployed in the following domains • Precision agriculture, atmosphere science, environmental biology, ecology, civil engineering, bioinformatics, and public health Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University People Big Data Lab at Colorado State University • Awards • Cochran Family Professorship 2018-2021 • IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher) 2018 • National Science Foundation CAREER Award 2016 Sangmi Pallickara Saptashwa Mitra Walid Budgaga Dan Rammer • Funded by • The National Science Foundation Sam Armstrong • The Advanced Research Projects Agency-Energy (Department of Energy) Laksheen Mendis Undergraduate researchers at CURC • Department of Homeland Security Ryan Becwar • The Environmental Defense Fund Kevin Brewwiler • Google, Amazon, and Hewlett Packard Caleb Carlson Kartik Khurana Aaron Pereira Paahuni Khandelwal http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3
CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Goal of this course Communications [1/2] • Understanding fundamental concepts in Big Data Analytics • Course Website • Computing Systems + Scalable Algorithms and Models • http://www.cs.colostate.edu/~cs535 • Announcements: Check the course website at least twice a week. • Learn about existing technologies and how to apply them • Schedule (course materials, readings, assignments) Computing Algorithms systems and models • Policies Specialized • Canvas Graph models modeling tools • Assignment submission Computing Predictive • Grades frameworks models Storage systems Analytics and middle ware • Piazza • Discussion board CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Communications [2/2] Course Structure GEAR V: Algorithmic Techniques for Big Data • Contact Me Week 13, 14 • sangmi@colostate.edu GEAR IV: Large Scale Recommendation Systems and • Office hour: Friday 10:00AM ~ 11:00AM and by appointment Social Media: Week 11, 12 • Office: CSB456 GEAR III: Big Graph Analysis Research Group Meeting Week 9, 10 • URL: http://www.cs.colostate.edu/~sangmi When: 1:30-2:30pm Fridays GEAR II: Machine Learning for Big Data Where: CSB305 Week 7, 8 • Contact GTAs GEAR I: Peta-scale Storage Systems • Paahuni Khandelwal Week 5, 6 • Mohamed Chaabane Big Data Technology Week 1 ~ Week 4 • Office hours (in CSB120 and online office hours TBA) CS535 Big Data | Computer Science Department | Colorado State University CS535 Big Data | Computer Science Department | Colorado State University Course Structure | Part A: Big Data Technology Course Structure | Part B : GEAR Sessions • What is the GEAR Session? • Week 1 ~ Week 4 • Guided Exploration for Big Data Analytics Research • Big Data Technology • Purposes • Goals • Guided learning environment for advanced research topics in Big Data • Understand concepts of Big Data computing environment • Understanding different aspects of Big Data research with lectures and discussions • Hands-on experience • Topics • Sessions • Introduction to Big Data Session I. Peta-scale Storage Systems • Lambda Model Session II. Machine Learning for Big Data Session III. Big Graph Analysis • Quick view of MapReduce Session IV. Large Scale Recommendation Systems and Social Media • Introduction to Apache Spark Session V. Algorithmic Techniques for Big Data • Analytics with Apache Storm • Duration: 2 weeks/session • Up to 3 lectures • 1 student-led research discussion http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4
Recommend
More recommend