10 B EST P RACTICES F OR S OLUTION A RCHITECTURES THAT WOULD TAME BIG DATA !!!
B IG D ATA B EST P RACTICE -1 U SE CASE ! U SE C ASE ! U SE C ASE ( F RAME IT TIGHT )
T HE IDEA IN B RIEF … What are the questions at the heart of the problem ? Formulate the hypothesis/questions at the heart of the issue ! Distill them into a clear set of hypothesis to be tested Remember Hadoop and associated technology components are a means Isolate $ Denting Analytical Use Case
R EAL LIFE EXAMPLE : C URATING USE CASE IN TELECOM SECURITY INTELLIGENCE Business Context What new signals to listen to prevent adverse events from happening ? 4 Data Pools Netsweepeer logs Radius logs Switch CDR MMS logs 2 Use cases Watch list analysis + Network link analysis MMS Video virality
Have an intensive ½ day cross functional workshop with business to boil down the game changing use case Is it a “nice to have” use case or a “$ impacting use case” ? Who is the consumer of the use case ? How does it help him optimize cost or reduce risk or increase revenue ? Business backwards and NOT technology forward
B IG D ATA B EST P RACTICE -2 I MPACT “ AHA ” MOMENT IN 60 90 DAYS . S TART WITH A S KELETAL WORKING SOLUTION ( MVP )
T HE IDEA IN BRIEF D ELIVER F IRST B IG D ATA “ AHA ” MOMENT IN 60 90 DAYS Skeletal MVP : End to end implementation that links all architectural components together Could be the answer to a previously unanswered question Propels momentum of Big data project
A REAL LIFE EXAMPLE Industry = OTA Context : Important to improve look to book Is there a co-relation between response time of a web page and the look to book ratio ? Hadoop cluster + Infobright + Hive jobs ready in 3 weeks Scaled data and improvised dashboard experience for another 3 weeks Business readout in 6 weeks
T HEREFORE Break it into 3 chunks 30 day milestones 60 day milestones 90 day milestones In 30 days plan to cover functional breadth Hadoop infrastructure + cluster Integrate disparate components – data pipeline, Columnar database, machine learning process , Hadoop cluster Have a small file go from start to end thru the process chain In 60 days plan to cover scalability Scale for 12 months data atleast Tableau / Pentaho In 90 days plan to cover bells n whistles Configurators Alerters Additional abtraction Don’t wait for 6 -9 months !
B IG D ATA B EST P RACTICE -3 A CTIONS NOT INSIGHTS ACTION INSIGHTS DATA
B EST P RACTICE -3 A CTIONS NOT INSIGHTS Actions are executed in the frontline Call centre Mobile Store channel Digital channel Actions could be Behaviour based discounts Help close a digital transaction Serve customized webpage Take proactive actions Insights are nice to know Actions impact $
T HEREFORE W HAT ACTIONS ARE DRIVEN AS A RESULT OF THESE INSIGHTS ? H OW ARE WE DISSEMINATING INSIGHTS TO FRONT LINE CHANNELS ? A SK “ SO WHAT ” 5 TIMES !!!
B IG D ATA B EST P RACTICE -4 : L ISTEN TO UNSTRUCTURED INTELLIGENCE FOR S TRONG SIGNALS
R EAL LIFE EXAMPLE Keyword frequency “Leaks”, “Leakage”, “Noise”, “Sound”, “Vibrations” Noise / leakage frequency is a better predictor of repeat sales than any other indicators including marketing spends !!!
A REAL LIFE EXAMPLE Statistical Technique Raw data • Text mining Business Question • Visual data exploration www.yelp.com How can we create a strategy • Hypothesis testing Slide 16 to respond to what we are • Affinity analysis hearing about XYZs buzz www.twitter.com online ? Insights derived Sentiment trends :+/- Sentiment benchmark with McDonalds XYZ Online Top keywords for XYZ Buzz analysis Top keywords for McDonalds Keyword affinities Business Action • Theme specific campaigns • NPD process • Instore experience • Reverse impact of negative buzz
W HERE DO CUSTOMERS EXPRESS THEMSELVES ? 2854 136 posts 552 posts posts Yelp.com Epinions.com planetfeedback.com 1500 500 posts posts Twitter.com Facebook.com Universe of XYZ sentiment data = 5 sources, 5556 posts,3 years data we’s phase-1 analysis = www.yelp.com, 136 posts, 2 years data Slide 17
S OURCE = T WITTER . COM Slide 18
S OURCE = Y ELP . COM Slide 19
S OURCE = F ACEBOOK . COM Slide 20
S TEP BY STEP SENTIMENT TEXT MINING PROCESS Process • Blogs • Customer review sites • Inferences • Online consumer • Customer’s forum sentiments • Customers\Ven dors emails • Unstructured data from Applications Output Input Slide 21
O VERALL S ENTIMENTS D ASHBOARD Slide 22
T HEREFORE R text mining algorithm RHadoop
Which devices are infected from a malicious attack ? B IG D ATA B EST PRACTICE - 5 : C OLUMNAR &I N M EMORY ARCHITECTURES TO SPEED UP CHAIN OF THOUGHT
H OW TO H ANDLE “N EEDLE IN A H AYSTACK ” W ORKLOADS ? What happened on firewall-3 between 3:17 and 3:21 am ? How many payment gateway drops happened between 9:47 am and 9:52 am on 15-Nov-2012 ? Data forensic queries supporting chain of thoughts
Columnar DB – Concept in Brief Id Name Designation Tenure S1 Prem Founder 8 S2 Simon Security Architect 5 S3 Bhavana Sales Head 6 S4 Ram CEO 3 S5 Shyam Developer 1 S1PremFounder8 S2SimonSecurityArchitect5 S3BhavanaSalesHead6 S4RamCEO3 S5ShyamDeveloper1 S1S2S3S4S5PremSimonBhavanaRamShyamFounderSecurityHeadSalesHeadCEODeveloper85631 26
I N M EMORY D ATABASES ! interactive or real-time query for large datasets =key to analyst productivity (support chain of thought analysis). Chain of thought analysis = Explore data torrent by quickly running off a series of iterative queries, each informed by the last. Most solutions aren’t fast enough and reduce analytical effectiveness when users chain of thought process is interrupted In memoy DB Tools Dremel at Google, Druid at Metamarkets, Sting at Netflix, Cloudera’s Impala C Berkeley’s AMPLab’s Spark, SAP Hana, Platfora.
T HEREFORE Examine columnar databases and inmemory databases to speed up important query workloads Download evaluation version of Actian, Infobright and do a POC
B EST P RACTICE -6 H OW TO P LAN FOR 100 X SCALABILITY ? B IG D ATA B EST PRACTICE - 6 : T HINK 100 X S CALABILITY !!!
R EAL LIFE EXAMPLE Industry = Telecom Business context National content filtering solution Events Generated Per Day : 1 Billion Events New URL’s Classified per Day : 1 Million Daily log Volume : 400Gb average
The Organisation The data torrent Real time sense making Price sensitive search Ratings based ordering Store search Basket add Comparator events events Payment Gateway events B IG D ATA B EST PRACTICE - 7 : D ETECT D ATA PATTERNS IN REAL TIME !!!
T HE CONTEXT Velocity is high Decision making window is low Cost of not intervening is high
R EAL TIME EXAMPLE Decision window = 8 mins If a high value customer ( decile = 1 on last 36 months revenue ) and intra book interval > threshold and recency of search < 70 then route to call center channel
T HEREFORE Include S4 and other real time analytics into your Big data reference architecture
B IG D ATA B EST P RACTICE -8 C APTOLOGY = P ERSUASION THRU TECHNOLOGY
T HE BASICS Captology = Persuasion thru technology D ESIGN FOR B EHAVIOURAL C HANGE Persuasion examples Users to change channel behaviour ( Move from Desktop to Mobile channel ) Persuade users to advocate friends
C APTOLOGY IN A CTION Captology in Insurance Reduce rates each time a person reports his or her exercise behaviour to a group of peers online Captology in Social
T HERE ARE TOO MANY GOOD PRODUCTS HIDDEN BEHIND BAD USER INTERFACES P RODUCT = I NTERFACE F OR B IZ USER , WHAT LIES UNDER THE HOOD DOES NOT MATTER
B IG D ATA B EST P RACTICE -9 S TRETCH KEY B IG D ATA COMPONENTS TO SEE WHAT BREAKS !
B EST P RACTICE -9 I NTERSECT OF M OVING P ARTS ARE THE WEAK LINKS Big Data Moving moving parts Hadoop Columnar databases Cluster Hadoop clusters Advanced visualisation layer Real time components Data pipelines API’s scrappers to syndicate info Bridge to existing DW The intersect can give away as data / user volumes increase A real life big data architecture architecure Event loggers Hbase/Cassandra for high velocity event absorption Sqoop/Flume for data ingestion Hadoop cluster for massive data crunching R for extracting patterns Columnar database for 10 x lightning retrieval Tableau for advanced visualisation S4 for real time analytics Channel integration components Infobright R Columnar Predictor DB ranking
T HEREFORE … W ATCH THE FOLLOWING 4 W EAK LINKS Link between Operational 1. event streams and Hadoop cluster Link between Hadoop 2. cluster and Columnar database Link between Columnar 3. database and the visualisation tool Time it takes for the 4. machine learning algorithm to run
Recommend
More recommend