APACHE BIG DATA CONFERENCE How to transform data into money using Big Data technologies
INTRO THE FIRST SPARK-BASED BIG DATA PLATFORM RELEASED After almost a decade developing Big Data projects in Paradigma, through its R+D department we were early adopters of Spark, which led to the creation of Stratio
MY PROFILE SKILLS JORGE LOPEZ-MALLA After working with traditional processing methods, I started to do some R&S Big Data projects and I fell in love with the Big Data world. Currently i’m doing some awesome Big Data projects at Stratio
MY PROFILE SKILLS ALBERTO RODRÍGUEZ DE LEM A After graduating I've been programming for more than 10 years. I’ve built high performance and scalable web applications for companies such as Indra Systems, Prudential and Springer Verlag Ltd. @ardlema
STRATIO GO TO SPACE SPARK-BASED BD ENTERPRISE SPARK PLATFORM On – premise & cloud, our platform is The first Spark-Based big data geared towards helping companies platform released I I PURE SPARK OPEN-SOURCE SOLUTIONS The only pure Spark platform, Our enterprises solutions are the only global solution based on open source technologies
OUR CLIENT M IDDLE EAST TELCO COM PANY o 9.500 mil. daily eventsprocessed o 9.2 mil. clients
USE CASES
USE CASES 1 M ANAGEM ENT & NORM ALIZATION OF DATA SOURCES
USE CASES 1 M ANAGEM ENT & NORM ALIZATION OF DATA SOURCES
USE CASES 2 NETWORK COVERAGE IM PROVEM ENT
USE CASES 3 PEOPLE GATHERING
USE CASES 3 PEOPLE GATHERING
USE CASES 4 DATA M ONETIZATION
USE CASES 4 DATA M ONETIZATION
USE CASES 4 DATA M ONETIZATION
TECHNICAL CHALLENGES
TECHNICAL PROBLEMS 1 2 3 4 5 Huge volumen Huge size Distributed Hard Recognized of data of Data processing to read patterns
1 HUGE VOLUM E OF DATA SOLUTION APACHE HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
1 HUGE VOLUM E OF DATA 9500 mil. csv daily records-> circa 1 6 Gb Requirements: High availability Concurrent file reads
2 HUGE SIZE OF DATA SOLUTION APACHE PARQUET
2 HUGE SIZE OF DATA 1 6.5 Gb of daily event information stored as csv text in HDFS 4.3 Gb of daily event information stored as parquet files in HDFS STORE IM PROVEM ENT Circa 70 %
2 HUGE SIZE OF DATA Time to count daily csv events -> 6.2 minutes . Time to count daily Parquet events -> 1 minute READ PROCESS IM PROVEM ENT Circa 80%
3 DISTRIBUTED PROCESSING SOLUTION APACHE SPARK
3 DISTRIBUTED PROCESSING - REQUIREM EN TS Complex algorithmicswith the minimum amount of resources Reduction of the processtime in order to obtain data when it still isused
3 DISTRIBUTED PROCESSING - REQUIREMENTS Sharing the cluster with legacy processes Use of legacy outputs processeswithout doesany change
4 HARD TO READ SOLUTION SCALA + APACHE SPARK
4 HARD TO READ Reducing developing time LOCsdramatically reduced Number of classesdramatically reduced
4 HARD TO READ Testsand application readability improvements DSLsmake our liveseasier Spark makesMap Reduces jobseven simpler
5 RECOGNIZED PATTERNS SOLUTION APACHE SPARK M LLIB
5 RECOGNIZED PATTERNS Millonsof data processed in order to obtain mathematical models Applied complex mathematical algorithms to obtain accurate weekly behaviors
THANK YOU UNITED STATES EUROPE Tel: (+1) 408 5998830 Tel: (+34) 91 828 64 73 contact@stratio.com www.stratio.com
Recommend
More recommend