Big Data Management and NoSQL Databases Lecture 12 PD Dr. Andreas Behrend behrend@cs.uni-bonn.de Acknowledgements I am indebted to Prof. Dr.-Ing. Sebastian Michel, Prof. Johan Gamper, and Dr. Holubova for providing me slides.
What is Big Data? buzzword? bubble? gold rush? revolution? “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” Dan Ariely
What is Big Data? Volume Big Data V ariety Velocity No standard definition First occurrence of the term: High Performance Computing (HPC) Gartner: “ Big Data ” is high v olume, high v elocity, and/or high v ariety information assets that require new 3 (4, 5) Vs forms of processing to enable enhanced decision making, insight discovery and process optimization.
http://www.ibmbigdatahub.com/ What is Big Data? Mobile devices (tracking all objects all the time) Sensor technology and networks Social media and networks Scientific instruments (measuring all kinds of data) (collecting all sorts of (all of us are generating data) data) IBM: Depending on the industry and organization, Big Data encompasses information from internal and external sources such as transactions, social media, enterprise content, sensors, and mobile devices. Companies can leverage data to adapt their products and services to better meet customer needs, optimize operations and infrastructure, and find new sources of revenue.
http://www.ibmbigdatahub.com/ Big Data Characteristics: Volume (Scale) 10 18 10 21 Data volume is increasing exponentially , not linearly 10 9 10 12
http://www.ibmbigdatahub.com/ Big Data Characteristics: Variety (Complexity) Various formats, types, and structures (from 10 18 semi-structured 10 9 XML to unstructured multimedia) Static data vs. streaming data
http://www.ibmbigdatahub.com/ Big Data Characteristics: Velocity (Speed) Data is being generated fast and need to be processed fast Online Data Analytics
http://www.ibmbigdatahub.com/ Big Data Characteristics: Veracity (Uncertainty) Uncertainty due to inconsistency, incompleteness, latency , 10 12 ambiguities, or approximations.
Some Numbers as of 2015 Estimated Size of Data • Google: 15 000 PB (=15 Exabytes) • Facebook: 300 PB • Ebay: 90 PB MB = 10 6 Bytes • Spotify: 10 PB GB = 10 9 Bytes TB (Terabyte) = 10 12 Bytes PB (Petabyte) = 10 15 Bytes Data Processed per Day EB (Exabyte) = 10 18 Bytes • Google: 100 PB • Ebay: 100 PB • NSA: 29 PB • Facebook: 600 TB • Twitter: 100 TB • Spotify: 2,2 TB
How does Data Look Like? • Not necessarily like you got used to in database lectures: usually not nicely structured (BCNF or 3NF) relations with known schema information. • But: – Twitter Tweets – Server Access Logs – Web Pages – Web Graph – Huge CSV files in general (e.g., holding a “relation”)
How to store or analyse such Data? {"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823764586496,"id_str":"557920823764586496","text":"#T ulsaAirport #Oklahoma Jan 21 08:53 Temperature 37\u00b0F clouds Wind NW 7 km\/h Humidity 85% .. http:\/\/t.co\ /SnC8ST3gQC","source":"\u003ca href=\"http:\/\/www.woweather.com\/USA\/TulsaIAP.htm\" rel=\"nofollow\"\u003eupd ate weather tulsa\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":nu ll,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":255167 921,"id_str":"255167921","name":"Weather Tulsa","screen_name":"wo_tulsa","location":"Tulsa","url":"http:\/\/itu nes.apple.com\/app\/weatheronline\/id299504833?mt=8","description":"Weather Tulsa\n\nhttp:\/\/www.woweather.com \/USA\/Tulsa.htm","protected":false,"verified":false,"followers_count":111,"friends_count":60,"listed_count":5, "favourites_count":0,"statuses_count":33805,"created_at":"Sun Feb 20 20:31:42 +0000 2011","utc_offset":7200,"ti me_zone":"Athens","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_b ackground_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1 \/bg.pn g","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_ back ground_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_ color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/ \ /pbs.twimg.com\/profile_images\/1249942071\/WO-20px- linien_normal.png","profile_image_url_https":"https:\/\/pbs .twimg.com\/profile_images\/1249942071\/WO- 20px-linien_normal.png","default_profile":true,"default_profile_imag e":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place ":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"TulsaAirport", "indices":[0,13]},{"text":"Oklahoma","indices":[14,23]}],"trends":[],"urls":[{"url":"http:\/\/t.co\/SnC8ST3gQC","expa nded_url":"http:\/\/bit.ly\/188eNcw","display_url":"bit.ly\/188eNcw","indices":[93,115]}],"user_mentions":[],"sym bols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"e n","timestamp_ms":"1421853664710"} {"created_at":"Wed Jan 21 15:21:04 +0000 2015","id":557920823877464064,"id_str":"557920823877464064","text":"An ime episode updated: Kyoukai no
Processing Big Data OLTP: Online Transaction Processing (DBMSs) Database applications Storing, querying, multiuser access OLAP: Online Analytical Processing (Data Warehousing) Answer multi-dimensional analytical queries Financial/marketing reporting, budgeting, forecasting, … RTAP: Real-Time Analytic Processing (Big Data Architecture & Technology) Data gathered & processed in a real-time Streaming fashion Real-time data queried and presented in an online fashion Real-time and history data combined and mined interactively
http://e-theses.imtlucca.it/34/ Key Big Data-Related Technologies Distributed file systems NoSQL databases Grid computing, cloud computing MapReduce and other new paradigms Large scale machine learning
Relational Database Management Systems (RDMBSs) Predominant technology for storing structured data Established query languages, e.g. SQL, RA Often thought of as the only alternative for data storage Persistence, concurrency control, consistency control, … Alternatives: Object databases or XML stores Never gained the same adoption and market shareT
Why Distributed File Systems? • Assume you got 10 TB data on disk • Now, do some analysis of it • With a 100MB/s disk , reading alone takes – 100000 seconds – 1666 minutes – 27 hours
Need to do something about it http://www.google.com/about/datacenter http://flickr.com/photos/jurvetson/157722937/
Scale-up vs Scale-out Scale-Up ( vertical Scale-Out ( horizontal scaling): scaling): More RAM More CPU More HDD Same Hardware Connected by network
Data Centers source: http://www.google.com/about/datacenters/inside/index.html
Hardware Failures • Lots of machines (commodity hardware) failure is not an exception but very common • P[machine fails today] = 1/365 • n machines: P[failure of at least 1 machine] = 1-(1-P[machine fails today])^n – for n=1: 0.0027 – for n=10: 0.02706 0.239 – for n=100: – for n=1000: 0.9356 – for n=10 000: ~ 1.0 source: google.com
Fallacies of Distributed Computing 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous source: Peter Deutsch and others at Sun
Failure Handling & Recovery • Hardware failures happen virtually at any time • Algorithms/Infrastructures have to compensate • Issues in distributed computing: • Replication of data • Logging of state • Redundancy in task execution
Recommend
More recommend