pigfarm las sponsored computer science senior design
play

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project - PowerPoint PPT Presentation

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software


  1. PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS

  2. What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software to coordinate among servers to analyze the data Obviously this changes over time

  3. What is Hadoop/MapReduce? Hadoop is the defacto open source Big Data platform Fault tolerant distributed file system Based on [1] a 2003 Paper from Google about their internal file system Map/Reduce – a parallel computing paradigm that stresses low memory Usage… a Map step is executed on local nodes and the results are sent Over the network to Reducers which complete the task. [2] Another famous Google Paper. If you want to query data – use a database If you want to make a database – use Map/Reduce

  4. What is Pig? Instead of all this (java) … import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; /** …. * RUN import org.apache.hadoop.util.Tool; */ import org.apache.hadoop.util.ToolRunner; @Override public int run(String[] args) throws Exception { public class L2 extends Configured implements Tool { if (args.length != 3) { /** System.err * MAPPER .println("Usage: wordcount <input_dir> <output_dir> <reducers>"); */ return -1; public static class Join extends Mapper<LongWritable, Text, Text, Text> { } private Set<String> hash; Job job = new Job(getConf(), "PigMix L2"); job.setJarByClass(L2.class); @Override public void setup(Context context) { job.setInputFormatClass(TextInputFormat.class); try { job.setOutputKeyClass(Text.class); Path[] paths = DistributedCache.getLocalCacheFiles(context job.setOutputValueClass(Text.class); .getConfiguration()); job.setMapperClass(Join.class); if (paths == null || paths.length < 1) { throw new RuntimeException("DistributedCache no work."); Properties props = System.getProperties(); } Configuration conf = job.getConfiguration(); for (Map.Entry<Object, Object> entry : props.entrySet()) { // Open the small table conf.set((String) entry.getKey(), (String) entry.getValue()); BufferedReader reader = new BufferedReader( } new InputStreamReader(new FileInputStream( paths[0].toString()))); DistributedCache.addCacheFile(new URI(args[0] + "/pigmix_power_users"), String line; conf); hash = new HashSet<String>(500); FileInputFormat.addInputPath(job, new Path(args[0] while ((line = reader.readLine()) != null) { + "/pigmix_page_views")); if (line.length() < 1) FileOutputFormat.setOutputPath(job, new Path(args[1] + "/L2out")); continue; job.setNumReduceTasks(0); String[] fields = line.split(""); if (fields[0].length() != 0) return job.waitForCompletion(true) ? 0 : -1; hash.add(fields[0]); } } } catch (IOException ioe) { /** throw new RuntimeException(ioe); * @param args } */ } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new L2(), args); @Override System.exit(res); public void map(LongWritable k, Text val, Context context) } throws IOException, InterruptedException { } List<Text> fields = Library.splitLine(val, ''); if (hash.contains(fields.get(0).toString())) { context.write(fields.get(0), fields.get(6)); } } }

  5. This (Pig Latin)*! rmf /PIGFARM/pigmixout/l2out register /proj/PIGFARM/PIGMIX/pigperf.jar; A = LOAD '/PIGFARM/pigmix/pigmix_page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() AS (user, action, timespent, query_term,ip_addr, timestamp, estimated_revenue, page_info,page_links); B = FOREACH A GENERATE user, estimated_revenue; alpha = LOAD '/PIGFARM/pigmix/pigmix_users' using PigStorage('\u0001') AS (name, phone, address, city, state, zip); beta = FOREACH alpha GENERATE name; C = JOIN B BY user, beta BY name; STORE C INTO '/PIGFARM/pigmixout/l2out'; * This is PIGMIX Benchmark script L2.pig

  6. PIGFARM Multiple Query Optimization (MQO) – The idea that several queries onto a single database can be made more efficient if combined together and issued at the same time When large firms have data scientists throughout their business units writing Pig scripts against common data sets in an uncoordinated manner… there is an opportunity to use MQO to improve the analytical bandwidth of these systems.

  7. The Real Idea I only like yellow data Farmer CPU PIGSCRIPT 1 Big Data “feed” NOOPS /dev/null

  8. The Real Idea I only like blue data Farmer CPU PIGSCRIPT 2 Big Data “feed” NOOPS /dev/null

  9. The Real Idea I only like red data Farmer CPU PIGSCRIPT N Big Data “feed” NOOPS /dev/null

  10. The Real Idea – Instead of this 1 N 2 N N N

  11. this – fuse the initial map 1 N 2 N N N

  12. fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables….and let Apache Pig work its magic --Script determines the number of distinct pred/obj pairs that have math in them --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test001.gz rmf /PIGFARM/Merged/test003.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj, period); filt1 = filter table by (obj matches '.*math.*') or (pred matches '.*math.*'); filt = filter table by (obj matches '.*"North Carolina".*'); unduped = DISTINCT filt1; objs = foreach filt generate obj; store unduped into '/PIGFARM/Merged/test001.gz' using PigStorage('\t'); uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); --Script computes the average height of people for each subject joined = union count, grouped_users; rmf /PIGFARM/Merged/test002.gz store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t'); table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); filt1 = filter table BY (pred matches '.*"people.person.height_meters".*'); removeQuotes = FOREACH filt1 GENERATE sub, REGEX_EXTRACT(obj, '"(.*)"',1) as num; casted = FOREACH removeQuotes GENERATE sub, (double)num; grouped = GROUP casted BY sub; avged = FOREACH grouped GENERATE casted.sub, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t');

  13. fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables….and let Apache Pig work its magic -- An LAS PIGFARM Compiled Pig Script -- Compiled on: 23/02/17-07:09 -- The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t') -- 1: table from /proj/PIGFARM/Script_Farm/ToMerge/test002.pig -- 2: table from /proj/PIGFARM/Script_Farm/ToMerge/test001.pig -- 3: table from /proj/PIGFARM/Script_Farm/ToMerge/test003.pig boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci ); -- Below is the remainder of: /proj/PIGFARM/Script_Farm/ToMerge/test002.pig --Script computes the average height of people for each subject rmf /PIGFARM/Merged/test002.gz filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); removeQuotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removeQuotes GENERATE laughing_wing, (double)num; grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t'); -- Below is the remainder of: /proj/PIGFARM/Script_Farm/ToMerge/test003.pig --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test003.gz filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*'); objs = foreach filt generate jovial_golick; uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t'); …..

  14. fuse the LOAD statement But this didn’t work. Pig just submitted the job as if it were the 3 sequential pig jobs. (Although it might still work with TEZ) Decided to move the store statements to the end… This actually caused very large temporary files to be created….. A performance killer Decided to identify the initial Map portions of the scripts, STORE them compressed and then read them back in… essentially explicit temporary files… this seems to work

Recommend


More recommend