Big Data Max Kemman University of Luxembourg October 19, 2015 - PowerPoint PPT Presentation

Big Data Max Kemman University of Luxembourg October 19, 2015 Online slides optimised for Full-HD screens in full-screen mode Download PDF here Doing Digital History: Introduction to Tools and Technology

Recap from last time What is a digital library or archive? How are sources digitised? How can we search the digital archive? Can we research the digital library or archive as a whole?

Today • Are digital libraries big data? • N=ALL • Messy data • From causality to correlation • Radical contextualisation • Next time

Are digital libraries big data? Last week we discussed digital libraries/archives Europeana contains about 32M digital objects Is this big data?

What is "data" anyway? Term has rhetorical function: "that which is given prior to argument" (Gitelman, 2014) Common description: "raw data" But creating data requires vast amount of work (as we saw last week) Interpretive work into creating data

What is "big data"? Metaphors used to describe big data give different interpretations (Awati & Shum, 2014) • Food: raw or cooked • Resource: oil, gold • Liquid: ocean, tsunami

What is "big data" anyway? 'Classic' definition by V's: • Volume: size • Velocity: accumulation • Variety: heterogeneous Another definition: too much data to handle

Is this new? Andrew Prescott (2015): • Domesday book • US Census 1890

What is "big data" anyway? What is the difference between "lots of data" and "big data"? (Lagoze, 2014) • "Large" is historical: computers change • Big data makes us rethink what science is

Are digital libraries big data? Or, does History have big data? From the definitions so far: • Size: not so much (compared to CERN) • Velocity: not so much • Variety: yes! • Too much data to handle: probably • Makes us think what science is: maybe Some say History/Humanities do not have big data

Why is big data interesting BUT, why are we concerned with big data, but not with particle physics? (Wallach, 2014) Two reasons: • Social: big data are about people • Granularity: individual people and their activities Here maybe History/Humanities do have interest in big data

Big data is a big topic Another definition of big data (Mayer-Schönberger & Cukier, 2014) • N=ALL • Messy • From causality to correlation Let's discuss these features

N=ALL "N" refers to the number of observations done as part of the sample size Sample: a group that represents the entire population So N=ALL refers to measuring everything, rather than a representative smaller group

All historical sources? A difference between "a lot of data" and "all data" Remember Rosenzweig from week 1: The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survived Rosenzweig (2003)

Is size that interesting? If big data is merely a quantitative difference, what's the interest? But, quantitive can lead to qualitative difference (Mayer-Schönberger, 2014)

Longue durée Rather than focusing on a very short timespan, see development over ages

Messy data Big data has Variety A heterogeneous dataset • Different data-types • Different variables Too much data to manually check

Can we use messy data? Mayer-Schönberger & Cukier: size makes up for messiness Exactness is from the age of spare information The noise can be smoothed out

Crowdsourcing One way of trying to get someone to look at the data Need to trust anonymous people

Does big data reflect the world? With N=ALL, big data = reality, right? But (big) data incorporates choices of what to measure Twitter/Facebook are biased reflections of the world

How big data is 'unfair' The average person is a fiction Hitchcock: it is the exceptions we are interested in!

Looking at the exceptions Wallach agrees: use the granularity of big data to study minorities & exceptions How do we discover the minorities & exceptions of interest? To repeat; cannot look at all cases individually Some statistical analysis is required

From causality to correlation Correlation: two variables show a statistical relation • Positive: when A increases, B increases • Negative: when A increases, B decreases Causation: one variable explains the second • Example: when it rains, more people take umbrellas with them

Correlation found A nice example is Google Flu Trends: • Took flu data from national health center for number of years • Investigated which keyword searches occurred shortly before or during flu outbreaks • Use keyword searches to predict outbreak of flu

Correlation and causation Important to remember: correlation does not equal causation The keyword searches do not cause the flu! Sometimes you don't know which variable comes first Maybe a third variable explains the two measured ones

Meaningful correlation Does the correlation mean anything? Google Flu Trends later found not to produce accurate results Spurious correlations

Spurious correlations

Spurious correlations http://www.tylervigen.com/spurious-correlations Find a correlation yourself: http://tylervigen.com/discover

Meaningful correlation We cannot only use the statistics, we need to interpret them But still we do not want to manually check all the possible correlations

Machine learning Wallach describes herself as machine learning researcher A simple introduction to machine learning (Geitgey, 2014) Rather than telling the computer what to do, it learns what to do • Supervised • Unsupervised

Supervised learning Provide enough answers to learn to give a new answer Computer figures out how to go from data to the answer

Supervised learning Or beat masters at chess

Unsupervised learning No given answer Are there patterns? Outliers?

Train without knowing the rules What do pregnant women buy? How are sentences translated to different languages?

Patterns Rens Bod: discovery of patterns with tools is Humanities 2.0 Hermeneutic interpretation of these patterns is Humanities 3.0 Fickers: context more interesting than the data

Radical contextualisation What is the context of each datapoint? Hitchcock - contextualize using the big data

Context If content is king, context is its crown Your search keywords make sense in your context

Radical context Remember from week 1: what does this tweet mean as part of 31M? Or actually: what does this tweet mean outside of Twitter?

Zooming Hitchcock describes the macroscope quoting Katy Börner Macroscopes provide a "vision of the whole," helping us "synthesize" the related elements and detect patterns, trends, and outliers while granting access to myriad details. Rather than make things larger or smaller, macroscopes let us observe what is at once too great, slow, or complex for the human eye and mind to notice and comprehend.

Zooming in on people If today we have a public dialogue that gives voice to the traditionally excluded and silenced – women, and minorities of ethnicity, belief and dis/ability – it is in no small part because we now have beautiful histories of small things. In other words, it has been the close and narrow reading of human experience that has done most to give voice to people excluded from ‘power’ by class, gender and race. Hitchcock

Close reading Hitchcock argues for interchange of close and distant reading Distant reading? That's the next lecture

For next time 19 October (double lecture) Distant Reading

Distant Reading Max Kemman University of Luxembourg October 19, 2015 Doing Digital History: Introduction to Tools and Technology

Recap from last time What is big data? Do digital libraries and historians have big data? How can big data be analyzed?

Today • What is distant reading? • Reading the distance • Biases in the chart • Hands-on • Next time • Assignment

What is distant reading? “distant reading”: understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data. (Schulz, 2011)

Aggregating Rather than analyzing a book page by page, analyze a corpus book by book Corpus (here): an aggregated set of sources

Viewing the aggregate (Moretti)

Viewing the aggregate (Aiden & Michel)

The charts The charts aim to show how one variable relates to another Vertical: y-axis Horizontal: x-axis Y-axis is often frequency per X words X-axis is often time

X-Axis Not always the case, e.g. Gendered Language in Teacher Reviews X-axis: frequency per million words Y-axis: discipline Colour: gender

Reading the distance

Looking closer (Aiden & Michel)

Looking closer (Moretti)

Playing around with the view

Finding a correlation

Big Data Max Kemman University of Luxembourg October 19, 2015 - PowerPoint PPT Presentation

Big Data Max Kemman University of Luxembourg October 19, 2015 Online slides optimised for Full-HD screens in full-screen mode Download PDF here Doing Digital History: Introduction to Tools and Technology Recap from last time What is a

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

claim Have you ever settled a claim via the ACAS Early Conciliation process? Poll 3 Overview

From a moment to a movement for children Building a

Albert-Lszl Barabsi With Emma K. Towlson, Sebastjan Ruf, Michael Danziger and Louis

Responding to Crises UNU WIDER, 23-24 September 2016 The Economics of Forced Migrations Insights

Privilege: An Update Litigation Privilege: a brief definition Confidential Made for

Detection and Visualization of Performance Variations to Guide Identification of Application

Solving parity games Definition (Parity game) G = V E , V A , R , : V N where

MAT 137 LEC 0601 Instructor: Alessandro Malus TA: Julia Kim November 26th, 2020 Warm-up :

Sambuz

Useful Links

Newsletter

Mail Us

Big Data Max Kemman University of Luxembourg October 19, 2015 - PowerPoint PPT Presentation

Big Data Max Kemman University of Luxembourg October 19, 2015 Online slides optimised for Full-HD screens in full-screen mode Download PDF here Doing Digital History: Introduction to Tools and Technology Recap from last time What is a

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

claim Have you ever settled a claim via the ACAS Early Conciliation process? Poll 3 Overview

From a moment to a movement for children Building a

Albert-Lszl Barabsi With Emma K. Towlson, Sebastjan Ruf, Michael Danziger and Louis

Responding to Crises UNU WIDER, 23-24 September 2016 The Economics of Forced Migrations Insights

Privilege: An Update Litigation Privilege: a brief definition Confidential Made for

Detection and Visualization of Performance Variations to Guide Identification of Application

Solving parity games Definition (Parity game) G = V E , V A , R , : V N where

MAT 137 LEC 0601 Instructor: Alessandro Malus TA: Julia Kim November 26th, 2020 Warm-up :

Sambuz

Useful Links

Newsletter

Mail Us

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data