An Introduction to DryadLINQ Christophe Poulain Microsoft Research Microsoft Research Virtual School of Computational Science and Engineering Big Data For Science Course, July 28, 2010
The Fourth Paradigm: Data The Fourth Paradigm: Data- -Intensive Science Intensive Science http://research.microsoft.com/fourthparadigm Scientific discovery is increasingly driven by exploration of large amounts of data from many sources. Scientific breakthrough will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. 2
Data Data- -intensive computing is increasingly prevalent intensive computing is increasingly prevalent Powered by powerful multi-core workstations, readily available commodity clusters and cloud services platforms 112 containers x 2000 servers = 224000 servers Programming data analyses that scale from desktop to a large number of compute nodes remains challenging 3
Dryad and DryadLINQ Dryad and DryadLINQ Research programming models for writing distributed data-parallel applications that scale from a small cluster to a large data-center. A DryadLINQ programmer can use thousands of machines, each of them with multiple processors or cores, without prior knowledge in parallel programming. 4
Availability Availability Dryad/DryadLINQ on Windows HPC 2008 (SP1) is available as a free download from: http://research.microsoft.com/collaboration/tools/dryad.aspx – DryadLINQ (in source) & Dryad (in binary) – With tutorials, programming guides, sample codes, libraries, and a community site: http://connect.microsoft.com/dryad – Windows HPC Server licenses freely available through your department’s subscription to MSDN Academic Alliance 5
Outline Outline • DryadLINQ programming model • Dryad and DryadLINQ overview • Applications
Dryad DryadLINQ LINQ Experience Experience Use a cluster as if it were a single computer • Sequential, single machine programming abstraction • Same program runs on single-core, multi-core, or cluster • Familiar programming languages • C#, VB, F#, IronPython… • C#, VB, F#, IronPython… • Familiar development environment • .NET, Visual Studio or other IDE
LINQ LINQ • Microsoft’s Language INtegrated Query – Released with .NET Framework 3.5, Visual Studio optional • A set of operators to manipulate datasets in .NET – Support traditional relational operators • Select, Join, GroupBy, Aggregate, etc. – Integrated into .NET programming languages • Programs can call operators • Programs can call operators • Operators can invoke arbitrary .NET functions • Data model – Data elements are strongly typed .NET objects – Much more expressive than SQL tables • Extremely extensible – Add new custom operators – Add new execution providers
Example of a Example of a LINQ Query LINQ Query IEnumerable<string> logs = GetLogLines(); var logentries = Go through logs and keep only lines from line in logs that are not comments. Parse each where !line.StartsWith("#") line into a LogEntry object. select new LogEntry(line); var user = Go through logentries and keep from access in logentries where access.user.EndsWith(@"\ulfar") only entries that are accesses by select access; select access; ulfar. ulfar. var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = Group ulfar’s accesses according to from access in accesses what page they correspond to. For where access.page.EndsWith(".htm") each page, count the occurrences. orderby access.count descending select access; Sort the pages ulfar has accessed according to access frequency. 9
DryadLINQ Data Model DryadLINQ Data Model .Net objects Partition PartitionedTable<T> PartitionedTable<T> implements IQueryable<T> and IEnumerable<T> PartitionedTable exposes metadata information: • type, partition, compression scheme, etc. 10
A complete DryadLINQ program public class LogEntry { PartitionedTable<string> logs = PartitionedTable.Get<string>( @”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\logfile.pt” public string user; public string ip; ); public string page; var logentries = from line in logs public LogEntry(string line) { where !line.StartsWith("#") string[] fields = line.Split(' '); select new LogEntry(line); this.user = fields[8]; this.ip = fields[9]; var user = this.page = fields[5]; from access in logentries where access.user.EndsWith(@"\ulfar") } select access; select access; } var accesses = public class UserPageCount { from access in user public string user; group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); public string page; var htmAccesses = public int count; from access in accesses public UserPageCount( string user, string page, int where access.page.EndsWith(".htm") count) { orderby access.count descending select access; this.user = user; htmAccesses.ToPartitionedTable( this.page = page; @”file:\\MSR-SCR-DRYAD01\DryadData\cpoulain\results.pt” this.count = count; } ); }
���� Executing the log query DryadLINQ for Dryad on Windows Server 2008 HPC Cluster DryadLINQ for Dryad on Windows Server 2008 HPC Cluster � 12 �
MapReduce MapReduce in DryadLINQ in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { { var map = source. SelectMany (mapper); var group = map. GroupBy (keySelector); var result = group. SelectMany (reducer); return result; // sequence of Rs } 13
Outline Outline • DryadLINQ programming model • Dryad and DryadLINQ overview • Applications
Software Stack Software Stack … Machine Image Graph Data Learning Processing Analysis Mining Applications Other Applications DryadLINQ DryadLINQ Other Languages Other Languages Dryad CIFS/NTFS SQL Servers Azure Storage Cosmos DFS Cluster Services (Azure, HPC, or Cosmos) Windows Windows Windows Windows Server Server Server Server 15
Dryad Dryad • Provides a general, flexible execution layer – Dataflow graph as the computation model – Higher language layer supplies graph, vertex code, channel types, hints for data locality, … • Automatically handles execution • Automatically handles execution – Distributes code, routes data – Schedules processes on machines near data – Masks failures in cluster and network – Fair scheduling of concurrent jobs
Dryad Job Structure Dryad Job Structure Channels Input files Stage Output files sort grep awk sed perl sort sort grep grep awk sed grep sort Vertices (processes) Channel is a finite streams of items Channel is a finite streams of items • NTFS files (temporary) • NTFS files (temporary) • TCP pipes (inter-machine) • TCP pipes (inter-machine) • Memory FIFOs (intra-machine) • Memory FIFOs (intra-machine)
Dryad System Architecture Dryad System Architecture data plane job manager Files, TCP, FIFO Job 1 V V V PD PD PD PD PD PD control plane Job 1 : v 11 , v 12 , … Job 2 : v 21 , v 22 , … New jobs Job 3 : … cluster scheduler
���� Fault tolerance DryadLINQ for Dryad on Windows Server 2008 HPC Cluster DryadLINQ for Dryad on Windows Server 2008 HPC Cluster � 19 �
Fault Tolerance Fault Tolerance
Consider an embarrassingly parallel problem Consider an embarrassingly parallel problem public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args) { int count = 50; var seeds = Enumerable.Range(1, count); var pairs = from seed in seeds select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } } 21
An embarrassingly parallel problem An embarrassingly parallel problem Many cores, one machine with PLINQ Many cores, one machine with PLINQ public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(200); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args) { int count = 50; var seeds = Enumerable.Range(1, count); var pairs = from seed in seeds .AsParallel() select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } } 22
An embarrassingly parallel problem An embarrassingly parallel problem Many cores, many machines with DryadLINQ (& PLINQ) Many cores, many machines with DryadLINQ (& PLINQ) public static Pair<int, string> DoWork(int index) { System.Threading.Thread.Sleep(2000); return new Pair<int, string>(index, System.Environment.MachineName); } public static void Main(string[] args) { { int count = 50; var seeds = Enumerable.Range(1, count); int[] ranges = seeds.Take(count - 1).ToArray(); var pairs = from seed in seeds.ToPartitionedTable("tmp.pt").RangePartition(i => i, ranges) select DoWork(seed); foreach (Pair<int, string> pair in pairs) { Console.WriteLine("{0} => {1}", pair.Key, pair.Value.ToString()); } } 23
Recommend
More recommend