Polyglot data science the force awakens with F#, R and D3.js Evelina Gabasova @evelgab Tomas Petricek @tomaspetricek
Part I F# with type providers
fslab.org : Doing data science using F# The data science workflow Data access with type providers Interactive analysis with .NET and R libraries Visualization with HTML/PDF charts and reports High-quality open-source libraries
LINQ before it was cool :-) var res = StockData.MSFT .Where(stock => stock.Close stock.Open > 7.0) .Select(stock => stock.Date) Looking under the cover Extension methods take Func<T1, T2> delegates Immutable because it returns a new IEnumerable Functional design allows method chaining
LINQ before it was cool :-) StockData.MSFT |> Array.filter (fun stock > stock.Close stock.Open > 7.0) |> Array.map (fun stock > stock.Date) Looking under the cover Pipeline operator for composing functions Lambda functions written using fun Immutable lists, sequences, arrays, etc.
Charting libraries for F# XPlot - cross platform, HTML-based (recommended) F# Charting - flexible but Windows-only library Other options: FnuPlot and R provider For latest information See FsLab.org - the F# data science homepage
Charting with XPlot Draw sin for values from to : 0 2 π [| 0.0 .. 0.1 .. 6.3 |] |> Array.map (fun x > x, sin x) |> Chart.Line Uses Google Charts behind the scenes: 1.0 0.5 0.0 0.5 1.0 0.0 1.5 3.0 4.5 6.0
What are type providers?
Type provider patterns Providers for a specific data source let wb = WorldBankData.GetDataContext() wb.Countries.India.Indicators.``Population, total`` Parameterized provider for a data format type Rss = XmlProvider<"data/bbc.xml"> Rss.Load(url).Channel.Description
TASK: Star Wars movie pro�ts Star Wars rating and box office 18 94 2,400,000,000 1,800,000,000 Box office 1,200,000,000 600,000,000 0 1,980 1,990 2,000 2,010 2,020 Year
github.com/evelinag/polyglot-data- science
Part II Visualization with D3.js
The Star Wars social network
D3.js visualizations made easier Gallery of examples
D3.js social network visualization Force-directed network layout
Part III Analyzing social networks with R
Social network analysis Who is the most central character? How to the movies compare between themselves?
The R language "domain-specific" language for statistical analysis
Very quick R intro # assignment x < 1 x = 1 # variable and function names x x.y read.csv
Very quick R intro: pipeline |> turns into %>% install.packages("magrittr") library(magrittr) xs < c(1,2,3,4,5,6,7,8,9,10) xs %>% mean
Network analysis with igraph igraph website igraph documentation install.packages("igraph") library(igraph)
Creating igraph network library(igraph) g < graph(edges) edges = list of nodes n1, n2, n3, n4, n5, ... represents (n1, n2), (n3, n4), ...
Calculating degree d < degree(graph)
F# open RProvider.igraph let degree = R.degree(network)
F# export JSON into list of edges R perform the network analysis
Degree
Degree
Degree
Degree Degree( v ) = Number of links v ↔ v ′ v ≠ v ′
Betweenness
Betweenness
Betweenness
Betweenness
Betweenness
Betweenness S v = Number of shortest paths between a and b through v S = Number of shortest paths between a and b S v Betweenness( v ) ab = S
Betweenness S v = Number of shortest paths between a and b through v S = Number of shortest paths between a and b S v Betweenness( v ) = ∑ S ab
Network structure How do the the movies differ? Size Density Clustering coefficient
Density
Density
Density Density = Existing connections Potential connections = Existing connections 1 N ( N − 1) 2
Clustering coef�cient
Clustering coef�cient
Clustering coef�cient
Clustering coef�cient
Clustering coef�cient
Clustering coef�cient
Clustering coef�cient K v = Number of neighbours of v E v = Number of links between neighbours of v E v Clustering( v ) = 1 2 K v K v ( − 1)
Clustering coef�cient K v = Number of neighbours of v E v = Number of links between neighbours of v Clustering(network) = 1 E v N ∑ 1 2 K v K v ( − 1) v
Size Number of characters Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 0 10 20 30 40 Number of characters
Density Network density Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 15 20 25 30 35 Density (%)
Clustering coefficient Clustering coefficient (transitivity) Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 0.40 0.48 0.56 0.64 0.72 Clustering coefficient
CONCLUSIONS
non-profit books and tutorials cross-platform community data science F# Software Foundation commercial support open-source contributions www.fsharp.org machine learning web and cloud consulting user groups research
The Learning Pyramid
Community chat and Q&A #fsharp on Twitter StackOver�ow F# tag Open source on GitHub Visual F# repo github.com/Microsoft/visualfsharp F# Compiler and core libraries github.com/fsharp F# Incubation project space github.com/fsprojects FsLab Organization repository github.com/fslaborg More resources Scott Wlaschin's
Scott Wlaschin's fsharpforfunandprofit.com F# Books and Resources fsharp.org/about/learning.html
The Force Awakens Evelina Gabasova @evelgab evelina@evelinag.com www.evelinag.com Tomas Petricek @tomaspetricek tomas@tomasp.net www.tomasp.net
Recommend
More recommend