Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar Generic - PowerPoint PPT Presentation

Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar

Generic View of System Get the article I/O Processing Workload count and word & File size Distribution count management Create one Create one Redistribute files Redistribute files Perform a Perform a final Suffix for calculating Binomial Array per P suffix arrays Reduction Retrieve the top LEGEND: Done!!! R interesting Siddharth ngrams Varun Pavan All

FOSTER’S DESIGN IN OUR PROJECT � Partitioning: Domain Decomposition � Communication : Broadcasting, Point to Point Communication and Customized Communication � Agglomeration: Gathering of suffix arrays � Agglomeration: Gathering of suffix arrays � Mapping: Cyclic Mapping Strategy

Data Structure � Customized suffix array1 to hold the following data � Position of ngram in the file � File index to identify the file � Term Frequency � Term Frequency � Document Frequency � Customized suffix array2 to hold the following data � Position of ngram in the file � File index to identify the file � Term Frequency � TF*IDF value

Algorithm � I/O processing � Reading directory and storing file information � File size Management � Partitioning files Partitioning files � Communication � Workload Distribution � Interleaved Allocation

Contd….. � Alpha Requirement: � Calculating the number of words and articles � Reduction � MPI_Reduce() � MPI_Reduce()

Contd… � Suffix Array Calculation � Every word has a suffix array associated with it � Allocating memory to suffix array based on the alpha output output � Filling the details of suffix arrays of all words � Getting the position of the word in the file � Getting the file index of the file the word is in � Assigning term frequency � Assigning document frequency

Contd… � Sorting the Suffix arrays � Based on Quick sort algorithm � Timing Complexity of quick sort : O(NlogN) (average case) � Memory Requirement : O(NlogN) � Memory Requirement : O(NlogN)

Contd… � Finding Distinct terms in same article Cat(TF=1, DF=1) Cat(TF=1, DF=1) House(TF=6,DF=1) House(TF=1,DF=1) House(TF=2,DF=1) House(TF=3,DF=1)

Contd… � Finding Distinct terms in different articles Cat(TF=1, DF=1) Cat(TF=1, DF=1) House(TF=3,DF=2) House(TF=1,DF=1) House(TF=2,DF=1)

Contd… � Merging Suffix Arrays � Input: Two sorted suffix arrays � Reading ngrams from file � Output: One sorted suffix array � Output: One sorted suffix array

MERGE EXAMPLE F C C F C C F C C F F I I I H H H H R R R Z Z Z S S S F C C F C C F C C F C C F F F F I I I I H H H H H H H H R R R R Z Z Z Z I I I I S S S S R R R S S Z

Contd… � Communication Strategies � Reading and Writing files (Strategy 1 - deprecated) � Binomial Tree Reduction and Nomenclature � Use of MPI_Barrier � Use of MPI_Barrier � Single file corresponding to suffix array � Communicating Structures (Strategy 2) � Binomial Tree Reduction � Use of MPI_Pack, MPI_Unpack()

Contd… � Binomial Tree Reduction 3 �� 1 1 �� 1 1 ��

Contd… � Finding top R interesting terms � Calculation and Storage � New suffix array structure with IDFTF measure � Sorting � Sorting � Merging

Analysis Alpha � Alpha 20 15 �� Time in seconds ds �� 10 Alpha �� 5 �� 0 �� 16 32 64 128 256 Number of Processors

�� 10000 � �� 9000 8000 7000 seconds 6000 �� 0.6mb 5000 � � �� Time in sec 80mb 80mb 4000 � �� 120mb 3000 2000 1000 0 �� 2 4 8 16 32 64 � �� Number of Processors ��

ngram = 1 1200 �� 1000 � �� ata in mb 800 � �� 600 600 Data i � � �� ngram = 1 �� 400 �� 200 �� 0 2 4 8 16 32 64 Number of Processors

Formula � Amdahl’s Law � Ψ <= 1/f+(1-f)/p � where f is the serial component and p is the number of processors processors � Ψ is the speedup � Gustafson’s Law � Ψ <= p+(1-p)s � Ψ is the scaled speed up � s is the serial component and p is the number of processors

Contd… � Using our results for data of size 120 MB � Speed up = 7680/3156=2.4 � Considering the case where 4 processors as serial and 16 processors as parallel processors as parallel � Using the formula for Amdahl’s Law and substituting Ψ as 2.4 we get f = 0.22 � According to Gustafson’s Law using s = 0.22, Ψ (scaled speed up) = 3.34

Contact Info � Project web page: giga word corpus � Email � Pavan Poluri: polur007@d.umn.edu � Siddharth Deokar: deoka001@d.umn.edu Siddharth Deokar: deoka001@d.umn.edu � Varun Sudhakar: sudha002@d.umn.edu

Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar Generic - PowerPoint PPT Presentation

Pavan Poluri Pavan Poluri Siddharth Deokar Varun Sudhakar Generic View of System Get the article I/O Processing Workload count and word & File size Distribution count management Create one Create one Redistribute files

WEIGHTED K NEAREST NEIGHBOR Siddharth Deokar CS 8751 04/20/2009 deoka001@d.umn.edu Outline

In Root we Trust Pavan Chander Lisa Bui OWASP Toronto: Feb 20, 2019 Who are we? Pavan Chander

Siddharth S Saxena Siddharth S Saxena Quantum Matter Group Cavendish Laboratory University of

IPGA PU IP PULSE SES S CO CONC NCLAVE VE 19th 21 st February, 2014, Goa , India INDIA

SEBI (Listing Obligations and Disclosure Requirements) Regulations, 2015 S.SUDHAKAR

Privacy and Anonymity in Graph Data Michael Hay, Siddharth Srivastava, Philipp Weis May 2006

Hacking XPATH 2.0 Tom Forbes Sumit Siddharth 7Safe, UK

A sustainability-based approach to resource allocation in the Smart Grid Siddharth Suryanarayanan

Distributed Training for Large-scale Logistic Models Siddharth Gopal Carnegie Mellon Univeristy

Weather Client Your AM Profile Your AM Profile Weather Client By By Ajay Kang, Varun

A NIGHTMARE FOR THE INTERVENTIONAL CARDIOLOGIST-DES STENT ANEURYSM! DR VARUN CHAWLA MD,DM

Automatic design of trustworthy sine-wave oscillators using genetic algorithms Varun Aggarwal 1

Analytic resummation for TMD observables Varun Vaidya 1 1 Los Alamos National Lab In collaboration

Did Executive Compensation Encourage Excessive Risktaking in Financial Institutions? Sudhakar

Mr. T Sudhakar Deputy Manager Indian Oil Corporation Limited, India Abstract: Novel approach of

Behind-the-Meter PV Forecast 2019 Revised Forecast Sudhakar Konala California Energy Commission

Capturing and Processing One Million Network Flows Per Second with SiLK: Challenges and

11-11032 Approved for public release; distribution is unlimited. Title: VISUALIZATION AND DATA

FLiMS: Fast Lightweight Merge Sorter 2018 International Conference on Field-Programmable

DAVE James HyunSeung Hong (hh2473) Min Woo Kim (mk3351) Fan Yang (fy2207) Chen Yu (cy2415)

Beta Presentation Security Analytics Suite: Dataset Merger Tool The Capstone Experience Team

Alpha Presentation Security Analytics Suite: Dataset Merger Tool The Capstone Experience Team

n? There are pros and cons of merger, there are some additional courses and

In a 2013 publication by the National Research Council (NCR), Reforming Juvenile Justice: A