DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING A tour of the Q programming language
HISTORY OF VECTOR LANGUAGES ➤ Vectors (arrays), not scalars, are the principle data type ➤ Not a new idea ( APL, 1965 ) ➤ Ok… maybe new compared to functional programming ( λ - calculus, 1930s ) ➤ Ken Iverson’s Iverson Notation ➤ Notation as a tool of thought ➤ Notation for people first, computers later ➤ Influenced: Mathematica, Matlab, R, Julia ➤ Descendents: I.N. → APL, J, A+, K, Q
Q PRIMER The basic concepts
FUNCTION APPLICATION ➤ Monadic functions have a word name and take argument to the right abs -1 til 10 1 0 1 2 3 4 5 6 7 8 9 ➤ Dyadic verbs appear between the arguments 1 + 2 9 mod 3 3 0 ➤ Function application is a verb abs @ -1 (-) . 1 2 1 -1
ATOMIC FUNCTIONS ➤ Primitive functions (and verbs) are atomic (apply to atoms) -1 * 0 1 2 3 4 0 -1 -2 -3 -4 5 * 10 + til 5 50 55 60 65 70 5 * (1; 2 3; (4; 5 6); 7 8; 9) (5; 10 15; (20; 25 30); 35 40; 45) ➤ Evaluation is always right-to-left ➤ Typically read top-down (left-to-right)
LIST VERBS ➤ List primitives (we have them too, just use less characters): take ( # ) 2#til 10 -2#til 10 0 1 8 9 join ( , ) (til 4) , til 4 0 1 2 3 0 1 2 3 split ( _ ) 0 3 6 _ til 9 0 1 2 3 4 5 6 7 8
MAPPING A LIST - FP 101 → 0 3 6 _ til 9 count each 0 3 6 _ til 9 0 1 2 3 3 3 3 4 5 6 7 8 But Wait! There’s More! ➤ If dyadic , combine with an adverb (a pairing operator) ➤ eg, each-both (‘) take (#) + each-both (‘) = take-each-both (#’) 3 3 3#'0 1 2 → 3#0 0 0 0 0 0 0 1 1 1 2 2 2
ADVERBS noun verb adverb noun 3 3 3 #' 0 1 2
FOLD AND SCAN ARE ADVERBS … MORE FP 101 ➤ Fold ( / ) is an adverb, we call it over 0 +/ til 5 A plus reduction over 0 1 2 3 4 10 ➤ Scan ( \ ) returns the incremental values of over (left-to-right) 0 +\ til 5 Partial sums of 0 1 2 3 4 0 1 3 6 10
FLEXIBLE MAPPING WITH ADVERBS ➤ Only 6 adverbs, but they come up all the time each-right ( /: ) max @/: 0 3 6 _ til 9 2 5 8 each-left ( \: ) (floor;ceiling) @\: 5.5 5 6 0 -': til 5 each-prior ( ‘: ) 0 1 1 1 1 compose: (min;max) @\:/: 0 3 6 _ til 9 each-left-each-right ( \:/: ) 0 2 3 5 6 8
THINKING IN ARRAYS Prime Numbers
THINKING IN ARRAYS - NO STINKING LOOPS* function isPrime (n) { if (n < 2) return false; var q = Math.floor(Math.sqrt(n)); for (var i = 2; i <= q; i++) { if (n % i == 0) { return false; } } return true; } Steve Apter nsl.com *
THINKING IN ARRAYS x mod y 1 .. 100
THINKING IN ARRAYS x mod y = 0
THINKING IN ARRAYS y = x y = 1
THINKING IN ARRAYS primes
THINKING IN ARRAYS
THE RESULT p : {n where 2=sum 0=n mod/: n:1+til x} rle : {(count;first)@\:/:(where not =‘:[x])_x} expand : {(),/(#).’x} ➤ Extremely concise, 111 bytes ➤ 29 characters left for emojis when tweeting it! rle : {(count;first)@\:/:(where not =‘:[x])_x} Only short programs have any hope of being correct ~ Arthur Whitney
HOW CAN WE USE Q FOR DATA ANALYSIS? ➤ Q has dictionaries (associations) and tables (flipped dictionaries) ➤ Tables are first-class and columnar, operations on columns are fast and e ffi cient ➤ It is actually the scripting language for kdb+ ➤ Has an integrated sql-like query language called q-sql select avg price by sym from trades where date > .z.d - 5 ➤ Has really nice temporal types, temporal arithmetic, and temporal joins
Q FOR DATA ANALYSIS
STEP 1. GET SOME DATA Monthly page visit information for people on WikiPedia // System commands start with \ \wget .../pantheon.tsv \wget .../pageviews_2008-2013.tsv -O pageviews.tsv Column types Tab separated File name // ETL in Q people : ("iSiSSSSSffsissssiffiiff"; enlist "\t") 0: `:pantheon.tsv; pageviews : ("iSSiSisssss",72#"i"; enlist "\t") 0: `:pageviews.tsv; Each month is a single column We have a short fat table, want a long skinny table…
STEP 2. CLEAN THE DATA! Month values // All of the months months : "M"$ssr[;"-";"."] each string 11_cols pageviews; Long skinny table // Create a new table of the months flattened monthly : ungroup 2!([] 4 columns id : pageviews`id; lang : pageviews`lang; month : (count pageviews)#enlist months; clicks : flip pageviews c:11_cols pageviews) Left join // Left-Join click information with person information clickinfo : monthly lj `id`lang xkey people; id lang month clicks id name occupation lang ---------------------------- ------------------------------------------ 307 af 2008.01 4 307 Abraham Lincoln POLITICIAN af 307 af 2008.02 5 307 Abraham Lincoln POLITICIAN am 307 af 2008.03 0 307 Abraham Lincoln POLITICIAN an 307 af 2008.04 5 307 Abraham Lincoln POLITICIAN ang 307 af 2008.05 5 307 Abraham Lincoln POLITICIAN ar 307 af 2008.06 1 307 Abraham Lincoln POLITICIAN arz … …
STEP 3. ASK SOME QUESTIONS select from clickinfo where occupation like “COMPUTER SCIENTIST”
STEP 3. ASK SOME QUESTIONS select from clickinfo where occupation like “COMPUTER SCIENTIST”
STEP 4…CLEAN THE DATA… AGAIN… file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process each years; doodles : ungroup 1!flip `month`name!(key;value)@\:results; month name ---------------------------- <p>On <b>Tuesday, July 6, 2010</b>, the birth of <a href="/wiki/Frida_Kahlo" title="Frida 2010.01 Isaac Newton Kahlo">Frida Kahlo</a> was celebrated with a 2010.01 Django Reinhard → gold Google logo wrapped with vines, flowers, and a painting of herself in her painting 2010.01 Anton Chekhov styles.<sup id="cite_ref-18" 2010.02 2010 Winter Olympics class="reference"><a … href="#cite_note-18">[18]</a></sup></p>
PARALLELIZATION IN Q file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process each years; doodles : ungroup 1!flip `month`name!(key;value)@\:results;
PARALLELIZATION IN Q file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process each years; doodles : ungroup 1!flip `month`name!(key;value)@\:results;
Recommend
More recommend