Python & Spark PTT18/19 Prof. Dr. Ralf Lämmel Msc. Johannes Härtel Msc. Marcel Heinz (C) 2018, SoftLang Team, University of Koblenz-Landau
The ‘Big Picture’ [Aggarwal15] (C) 2018, SoftLang Team, University of Koblenz-Landau
Plenty of Building Blocks are involved in this ‘Big Picture’ (C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’ [Aggarwal15] (C) 2018, SoftLang Team, University of Koblenz-Landau
Foundations (C) 2018, SoftLang Team, University of Koblenz-Landau
Technologies and APIs There are several technologies and APIs related to data-analysis in Python but the most convenient one is Pandas . The following tutorial is inspired by the Book ‘Python for data Analysis’ [McKinney12]. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is contained in this CSV? Some imports and configuration needed to read and print a CSV with Pandas. Jack CSV Nicholson File (angry) Python (C) 2018, SoftLang Team, University of Koblenz-Landau
What is contained in this CSV? Reading and printing CSV data with Pandas. (C) 2018, SoftLang Team, University of Koblenz-Landau
What are the first 5 ratings in this CSV? Selecting a range of rows returns another Dataframe. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the title a rating refers to? Selecting one column returns a Series ( ╯ °□° )╯ ︵ ┻━┻ (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the gender and the genre of a rating? Selecting columns by passing a list returns a Dataframe ┬──┬ ◡ノ (° -° ノ ) (C) 2018, SoftLang Team, University of Koblenz-Landau
What are ratings of female persons? First we need a condition for filtering. Such condition can be stated as a Series of booleans. (C) 2018, SoftLang Team, University of Koblenz-Landau
What are ratings of female persons? We can use this condition as a selection mechanism for rows. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the amount of female and male ratings? Let’s try this! (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the amount of female and male ratings? But we can also use dedicated Pandas functionality to create a Series that is indexed by the the distinct values. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the amount of female and male ratings? … and we can make python plot this. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the average rating given by a user? First we need to group the ratings of users. The following shows how to get all ratings of one user. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the average rating given by a user? After grouping we can select the rating column and take the mean for each group. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the average rating given by a user? We can also create a summarization in terms of a boxplot. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is a gender’s average rating of a film? A pivot table species rows and columns and aggregates the values using a passed function. (C) 2018, SoftLang Team, University of Koblenz-Landau
What are the top female rated films? i) We filter out films below a rating count of 250 to concentrate on the important candidates. ii) We increase the max rows since this is serious data! iii) We sort by column ‘F’ containing the average female ratings. (C) 2018, SoftLang Team, University of Koblenz-Landau
What are the top female rated films? (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the film with the biggest disagreement in female and male rating? We add a new column to the ‘film_mean_ratings’ Dataframe assigned to the difference between the female and male column. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the film with the biggest disagreement in female and male rating? (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the movies with the most disagreement among all viewers? The standard deviation can be used to describe such disagreement in ratings. (C) 2018, SoftLang Team, University of Koblenz-Landau
What is the movie with the most disagreement among all viewers? (C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’ [Aggarwal15] (C) 2018, SoftLang Team, University of Koblenz-Landau
Data (C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (JSON) JSON data can be loaded from a file and accessed comparable to dictionaries. JSON File Python cf. [web_json] (C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (SQL) An sqlite package provides, for instance, an in-memory database. cf. [web_sql] (C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (CSV) Some CSV data needs to be combined before being processed. cf. [McKinney12] (C) 2018, SoftLang Team, University of Koblenz-Landau
Data Integration (CSV) Comparable to joining tables in SQL, Pandas can merge different Dataframes. cf. [McKinney12] (C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java) The ‘right’ features need to be extracted from artifacts for further processing. [AntoniolCCD00] Some some class Class SomeClassDoingNothing doing Doing nothing Nothing (C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java) The ‘javalang’ package provides a parser for Java written in Python that can be installed from git. [web_jl] (C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java) The Java abstract syntax tree can be created from a file using ‘javalang’. Java (C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java) Intuitively, the most relevant feature in this artifact is the classname. Java SomeClassDoingNothing (C) 2018, SoftLang Team, University of Koblenz-Landau
Feature Extraction (Java) Camel-case is split and strings are made lower-case. SomeClassDoingNothing Some Class Doing Nothing some class doing nothing (C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’ [Aggarwal15] (C) 2018, SoftLang Team, University of Koblenz-Landau
Analytical Processing (C) 2018, SoftLang Team, University of Koblenz-Landau
Classification Support vector machines are provided by the ‘scikit-learn’ package as a supervised machine learning technique doing classification. cf. [scikit_cls] [Aggarwal15] (C) 2018, SoftLang Team, University of Koblenz-Landau
Classification Support vector machines in Python Spark. [spark] (C) 2018, SoftLang Team, University of Koblenz-Landau
Clustering The ‘scipy’ package provides hierarchical clustering as a unsupervised machine learning technique used to group this two-dimensional data. cf. [web_cluster] (C) 2018, SoftLang Team, University of Koblenz-Landau
Clustering Hierarchical clustering outputs a linkage array that can be depicted as a dendrogram. cf. [web_cluster] (C) 2018, SoftLang Team, University of Koblenz-Landau
Clustering K-means clustering in Python Spark. [spark] (C) 2018, SoftLang Team, University of Koblenz-Landau
Back to the ‘Big Picture’ [Aggarwal15] (C) 2018, SoftLang Team, University of Koblenz-Landau
Output (C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Boxplot) Gives a summary of distribution of numeric variables. Package: ● Matplotlib ● Seaborn cf. [seaborn] (C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Line chart) Depicts the evolution of one or many columns. Package: ● Matplotlib (C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Bar chart) Depicts the ranking present in one column. Package: ● Matplotlib (C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Scatter plot) Depicts the correlation of two columns. Package: ● Matplotlib ● Seaborn (C) 2018, SoftLang Team, University of Koblenz-Landau
Plot Types (Pie plot) Depicts the part-whole relation. Package: ● Matplotlib cf. [py_pie] (C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and Axis The table shows metrics on, e.g., the contributed code of Developers (column ‘DCon_PE_d’). While a few developers share very high contribution values most developer’s contributions is very low for one project. (C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and Axis Axis can have different scales to correctly depict the data. (C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and Axis Setting the axis on log does not work due to the 0 entries. (C) 2018, SoftLang Team, University of Koblenz-Landau
Scaling and Axis However, symlog works as it starts to scale linear under a given threshold. (C) 2018, SoftLang Team, University of Koblenz-Landau
Subplots Supplots can be used to group multiple plots that optionally share axis. (C) 2018, SoftLang Team, University of Koblenz-Landau
Recommend
More recommend