listening to big data
play

Listening(to(big(data( ( - PowerPoint PPT Presentation

Overview( Listening(to(big(data( ( Is(clone(analysis(/(empirical(SE(a(Big(Data(problem?( (and(should(we(care?( Or,(philately(will(get(you(everywhere( Looking(hard(for(the(Big(Picture( And(why(someJmes(that(can(be(a(bad(idea(


  1. Overview( Listening(to(big(data( ( • Is(clone(analysis(/(empirical(SE(a(Big(Data(problem?( – …(and(should(we(care?( Or,(philately(will(get(you(everywhere( • Looking(hard(for(the(Big(Picture( – And(why(someJmes(that(can(be(a(bad(idea( Mike(Godfrey (( So<ware(Architecture(Group( • Let's(go(swimming(with(the(data!( University(of(Waterloo( – Some(experiences(and(some(advice( (More(data(+(simple(algorithms)(( "Big(data"( >>((complex(algorithms)( • Three(Vs(( • FantasJc(talk(by(Peter(Norvig(of(Google:( – Volume,(Velocity,(Variety( "The(unreasonable(effecJveness(of(data"( h[p://www.youtube.com/watch?v=yvDCzhbjYWs( • Why?( – Enhanced(decision(making,(insight(discovery,(and( • "Every'(me'I'fire'a'linguist,'my'scores'get'be8er."'' process(opJmizaJon( – [Fred(Jelinek,(paraphrased]( • Common(problems:( • But(does(that(work(for(clone(detecJon(/(ESE(too?( – Capture,(curaJon,(storage,(search,(sharing,(transfer,( – Should(we(all(use(Ncgram(algorithms?( analysis, ( and(visualizaJon(

  2. Data(quality( (Big(data(+(simple(algorithms)?( • NLP,(for(example,(analyzes(unstructured(prose( – Much(variaJon:(intent,(word(ordering,(relaJonships,(…( – NLP(o<en(does(some(precprocessing(e.g.,(stemming( • ESE(examines(development(arJfacts(with(lots(of(internal(structure(+( external(linkage,(implicit(and(explicit( – Source(code(text,(including(comments(( – Version(control(metacdata( – Bug(reports( – …( • When(you(have(reliable(structure,(exploit(it!( – Yes?( – So(maybe(big(ESE(data(isn't(really(big(data(…( Looking(for(the(Big(Picture( Looking(for(the(Big(Picture ' A(selecJve(a[enJon(test( Trials(and(Errors:(Why(Science(is(Failing(Us( ( Wired(Magazine,(December(2011( "I'used'to'think'that'the'brain'was'the'most'wonderful'organ' in'my'body.'Then'I'realized'who'was'telling'me'this."' by(Jonah(Lehrer( '—'Emo'Philips' ' h[p://www.youtube.com/watch?v=vJG698U2Mvo(

  3. Tim(Minchin( h[p://www.upworthy.com/thisciscthecmostcinspiringcyetcdepressingcyetchilariousc yetchorrifyingcyetcheartwarmingcgradcspeech( "Physics'is'the'only'real'science.'' 'The'rest'are'just'stamp'collec(ng."' ' Ernest Rutherford (1871-1937) Father of atomic physics Nobel prize for … chemistry

  4. The("S"(curve(of(successful(growth( The � S � curve of successful growth size time Linux(kernel:( Linux(kernel:( Growth of Linux kernel source tree Growth(of(kernel(src(tree((#(of(files)( Average(/(median( .h (file(size( (# of src files) ! 140 y = .21*x 2 + 252*x + 90,055 r2=.997 6000 120 5000 Development releases (1.1, 1.3, 2.1, 2.3) 100 # of source code files (*.[ch] ) Stable releases (1.0, 1.2, 2.0, 2.2) Uncommented LOC 4000 80 3000 60 2000 40 Average .h file size -- dev. releases Average .h file size -- stable releases 1000 20 Median .h file size -- dev. releases Median .h file size -- stable releases 0 0 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

  5. 'Cloning(considered(harmful � ( Source code cloning considered(harmful( � Number one in the stink parade is 1. Forking 3. Post-hoc customizing duplicated code. If you see the – Hardware variation – Bug workarounds same code structure in more than e.g., Linux SCSI drivers one place, you can be sure that – Replicate + specialize – Platform variation your program will be better if you – Experimental variation find a way to unify them. � – “Bad Smells” 2. Templating [Beck/Fowler in Refactoring ] – Boilerplating – API / library protocols – Generalized programming idioms – Parameterized code Cloning harmfulness: What(to(do?( Two open source case studies • Swim (with(the(data( Apache Gnumeric Group Pattern Good Harmful Good Harmful Forking Hardware variation 0 0 0 0 • Be (the(gorilla(in(the(mist( Forking Platform variation 10 0 0 0 Forking Experimental variation 4 0 0 0 Templating Boiler-plating 5 0 6 7 Templating API 0 0 0 9 • Look'for'lumps' under(the(carpet(&(ask("Why?"( Templating Idioms 0 12 1 1 Templating Parameterized code 5 12 10 34 Customizing Replicate + specialize 12 4 15 16 Customizing Bug workarounds 0 0 0 0 Total 36 28 32 67 Apache httpd 2.2.4 - 60 Tokens Gnumeric 1.6.3 - 60 Tokens

Recommend


More recommend