Profiling Big Data sources to assess their selectivity Piet Daas - PowerPoint PPT Presentation

Profiling Big Data sources to assess their selectivity Piet Daas and Joep Burger With special thanks to Marco Puts & Dong Nguyen 1

Big Data – More and more organizations want to use Big Data as a new/additional source of information – However, there are some major challenges : – Selectivity of Big Data – Source does not have to completely cover the target population – What part of the population is included? 2

Profiling: extracting ‘features’ – Extract background characteristics (‘features’) from the ‘units’ in Big Data in an attempt to determine its selectivity ‐ The need for this depends on the ‘type’ of Big data source and its foreseen use – Important background characteristics for statistics are: ‐ Persons : gender , age, income, education, origin, urbanicity, household composition, .. ‐ Companies: number of employees, turnover, type of economic activity, legal form, .. 3

Social Media: Twitter as an example – On Social media persons, companies and ‘others’ can create an account and create messages ‐ In the Netherlands 70% of the population is active on social media – What kind of information is available on Twitter of a user ‐ Focus on gender! – Let’s look at a profile: @ pietdaas 4

4) Picture 3) Messages content 1)Name 2) Short bio 5

Studied a Twitter sample – From a list of Dutch Twitter users (~330.000) – A random sample of 1000 unique ids was drawn – Of the sample: ‐ 844 profiles still existed • 844 had a name • 583 provided a short bio • 473 created ‘tweets’ • 804 had a ‘non - default’ picture Default Twitter picture • 409 Men (49%) • 282 Women (33%) • 153 ‘Others’ (18%) • companies, organizations, dogs, cats, ‘bots’.. 6

Gender findings: 1) First name – Used Dutch ‘ Voornamenbank ’ website (First name database) – Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered – Unknown names scored -1 (usually companies/organizations) 7

Gender findings: 2) Short bio – If a short bio is provided ‐ Quite a number of people mention there ‘position’ in the family • Mother, father, papa, mama, ‘son of’, etc. ‐ Sometimes also occupations are mentioned that reflect the gender (‘ studente ’) ‐ 155 of 583 (27%) indicated there gender in short bio ‐ Need to check both English and Dutch texts 8

Gender findings: 3) Tweets content – In cooperation with University of Twente (Dong Nguyen) – Machine learning approach that determines gender specific writing style ‐ Language specific: Messages need to be Dutch! ‐ 437 of 473 (92%) persons that created tweets could be classified

Gender findings: 4) Profile picture 1 3 2 – Use OpenCV to process pictures 1) Face recognition 2) Standardisation of faces (resize & rotate) 3) Classify faces according to gender 10 - 603 of 804 (75%) profile pictures had 1 or more faces on it

Gender findings: overall results Diagnostic Odds Ratio (log) Diagnostic Odds Ratio = (TP/FN) / (FP/TN) First name 6.41 Short bio 3.50 random guessing Tweet content 2.36 log(DOR) = 0 Picture (faces) 0.72 ‐ Multi-agent findings • Need clever ways to combine these • Take processing efficiency of the ‘agent’ into consideration 11

Thank you for your attention ! 12

Profiling Big Data sources to assess their selectivity Piet Daas - PowerPoint PPT Presentation

Profiling Big Data sources to assess their selectivity Piet Daas and Joep Burger With special thanks to Marco Puts & Dong Nguyen 1 Big Data More and more organizations want to use Big Data as a new/additional source of information

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Data Sources; SCNL Data Sources Data sources producing waveform data can come from a remote

Sources Sources: Kinds of Sources Citizen witness Confidential informants Anonymous

Sources of Start Sources of Start- -up Capital up Capital up Capital Sources of Start Sources

RC circuits with DC sources A Circuit i (resistors, voltage sources, v C current sources,

TAPE: a Transactional Application Profiling Environment Hassan Chafi , Chi Cao Minh, Austen

Property Acquisitions and Capital Raising 20 DECEMBER 2011 IMPORTANT INFORMATION This

Web-based Employee Time and Productivity Tracking Use Yaware to increase employee productivity A

Analyst Presentation March 6, 2020 Disclaimer The information contained in this presentation is

SELECTION New York Regional Office Compliance Outreach Program William J. Delmage September 13,

Improving Road Safety by Profiling Different Accident Type Te Team am 7 : 7 : An Ange gela,

Profiling in ClearPass Policy Manager Jonas Humble #ArubaAirheads Why Profile? #ArubaAirheads

Profiling and Repor/ng Dornoch Firth 3-18 Campus The expecta/ons from Educa/on Scotland in

Sambuz

Useful Links

Newsletter

Mail Us