Building the Brickhouse Jerome Banks Confidential
Overview of Brickhouse ● Custom UDF’s for Hadoop Hive ● Provides “missing pieces” ● Tools to build scalable/robust data pipelines ● Increases Data engineer’s productivity ● Supports MR Design Patterns/Best Practices Confidential
History Spring 2012 - Maxwell Project - New Klout Score ● Needed to generate large number of features ● Lots of exploratory Data Science needed ● Needed to move to production fairly quickly ● Legacy score was traditional Hadoop Mappers/Reducers ○ Hard to develop ○ Hard to re-use code Confidential
History Spring 2012 - Maxwell Project - New Klout Score Solution !!! Implement in Hive !!! ● Proven technology ● Able to prototype quickly ● Semantics fairly well-understood But !!! ● Functionality was missing in Hive ● Naive queries were inefficient ○ Generated too many MR steps ○ Multiple passes of the same data ○ Attempting to “sort the world” Confidential
History Spring 2012 - Maxwell Project - New Klout Score Warehouse developed internally at Klout ● Maxwell Score ● Klout for Business ● Topic Thunder Early 2013 - Open-sourced as “Brickhouse” ● Spread to other Hadoop Hive shops ● Expanded functionality and code quality ● 2014 - Sponsorship by Tagged Confidential
Functionality across broad areas ● collect ● json ● sketch_set ● distributed_cache ● hbase ● timeseries ● bloom ● hll Confidential
Array/Map operations ● collect ● collect_max ● cast_array ● map_key_values ● map_filter_keys ● join_array ● map_union ● truncate_array Confidential
collect Opposite of UDTF Avoid “self-join” Anti-pattern select a.id, select col_map[‘A’] as a_val, a.value as a_val, col_map[‘B’] as b_val b.value as b_val from ( from ( select id, select * from mytable collect( type,value) where type=’A’) a as col_map join ( from mytable select * from mytable group by id ) cm; where type=’B’) b on ( a.id = b.id ); Confidential
collect_max Similar to collect but returns map with top 20 values select ks_uid, select collect_max( combined_score ks_uid, from maxwell_score combined_score, order by combined_score 20 ) limit 20; from maxwell_score; Confidential
to_json,from_json Serialize to and from JSON Avoid ugly, error-prone string concats Guaranteed valid JSON output select select to_json( concat("{\"kscore\":", named_struct(“kscore”,kscore, kscore,",\" “moving_avg”, avg, moving_avg\":", “start_date”, start, avg, ",\" “end_date”, end) ) start_date\":",start, from mytable; ",\"end_date\":", end,"}") from mytable; Confidential
to_json,from_json Serialize to and from JSON Parse arbitrarily complex schema create view parse_json as select ks_uid, from_json( json, named_struct( “kscore”, 0.0, “moving_avg”, array(0.0), “start_date”, “”, “end_date”, “”) ) from json_table; Confidential
sketch_set KMV (K-min value) sketch implementation Estimate number of uniques in large sets with fixed amount of space. select select estimated_reach( count(distinct ks_uid) sketch_set(ks_uid)) as reach from from actor_action actor_action where where some_condition() = true; some_condition() = true; Confidential
sketch_set Easy to do set unions. Can aggregate incremental results. insert overwrite table select estimated_reach( daily_sketch union_sketch(ss)) partition(dt=’20140323’) from select daily_sketch sketch_set(ks_uid) ss where from dt>=days_add(today(),-30 ); actor_action; Confidential
distributed_map Uses distributed-cache to access values in-memory. Avoids join/resort of large datasets. select bt.ks_uid, add file ‘celeb_map’; bt.my_value from big_table bt select * join from big_table bt ( select * where distributed_map( from celeb ks_uid, ‘celeb_map’) where is_celeb=true)cb is not null; on ( bt.ks_uid = cb.ks_uid); Confidential
Future Roadmap ● Continued support/maintenance/cleanup ● More streaming UDF’s ○ Top K ○ Representative sample ● More “Big Data Science-ey” UDFs ○ Machine Learning ○ Bag-O’-Words UDFs ○ Text Analysis, NLP ● Ideas ??? Contributions ??? Confidential
Thank you! http://github.com/klout/brickhouse http://brickhouseconfessions.wordpress.com
Recommend
More recommend