Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 - PowerPoint PPT Presentation

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 Michael P Jiaqi Tan 2 , Rajeev Gandhi 1 , Priya Narasimhan 1 1 Carnegie Mellon University 2 DSO National Labs, Singapore Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1

Problem Diagnosis Goals To diagnose problems in off-the-shelf parallel file systems Environmental performance problems: disk & network faults Target file systems: PVFS & Lustre To develop methods applicable to existing deployments Application transparency: avoid code-level instrumentation Minimal overhead, training, and configuration Support for arbitrary workloads: avoid models, SLOs, etc. Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 2

Motivation: Real Problem Anecdotes Problems motivated by PVFS developers’ experiences From Argonne’s Blue Gene/P PVFS cluster Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Motivation: Real Problem Anecdotes Problems motivated by PVFS developers’ experiences From Argonne’s Blue Gene/P PVFS cluster “Limping-but-alive” server problems No errors reported, can’t identify faulty node with logs Single faulty server impacts overall system performance Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Motivation: Real Problem Anecdotes Problems motivated by PVFS developers’ experiences From Argonne’s Blue Gene/P PVFS cluster “Limping-but-alive” server problems No errors reported, can’t identify faulty node with logs Single faulty server impacts overall system performance Storage-related problems: Accidental launch of rogue processes, decreases throughput Buggy RAID controller issues patrol reads when not at idle Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Motivation: Real Problem Anecdotes Problems motivated by PVFS developers’ experiences From Argonne’s Blue Gene/P PVFS cluster “Limping-but-alive” server problems No errors reported, can’t identify faulty node with logs Single faulty server impacts overall system performance Storage-related problems: Accidental launch of rogue processes, decreases throughput Buggy RAID controller issues patrol reads when not at idle Network-related problems: Faulty-switch ports corrupt packets, fail CRC checks Overloaded switches drop packets but pass diagnostic tests Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Outline Introduction 1 Experimental Methods 2 Diagnostic Algorithm 3 Results 4 Conclusion 5 Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 4

Target Parallel File Systems Aim to support I/O-intensive applications Provide high-bandwidth, concurrent access Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 5

Parallel File System Architecture clients network ios0 ios1 ios2 iosN mds0 mdsM metadata servers I/O�servers One or more I/O and metadata servers Clients communicate with every server No server-server communication Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 6

Parallel File System Data Striping Logical File: 0 1 2 3 4 5 … Server 1 0 3 6 … Physical Server 2 1 4 7 … Files Server 3 2 5 8 … Client stripes local file into 64 kB–1 MB chunks Writes to each I/O server in round-robin order Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 7

Parallel File Systems: Empirical Insights (I) Server behavior is similar for most requests Large requests are striped across all servers Small requests, in aggregate, equally load all servers Hypothesis: Peer-similarity Fault-free servers exhibit similar performance metrics Faulty servers exhibit dissimilarities in certain metrics Peer-comparison of metrics identifies faulty node Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 8

Example: Disk-Hog Fault 100000 Sectors Read (/s) Faulty server Peer-asymmetry 60000 Non-faulty servers 20000 0 0 200 400 600 Elapsed Time (s) Strongly motivates peer-comparison approach Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 9

Parallel File Systems: Empirical Insights (II) Faults manifest asymmetrically only on some metrics Ex: A disk-busy fault manifests . . . Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Parallel File Systems: Empirical Insights (II) Faults manifest asymmetrically only on some metrics Ex: A disk-busy fault manifests . . . Asymmetrically on latency metrics ( ↑ on faulty, ↓ on fault-free) 2000 Faulty server I/O Wait Time (ms) Non−faulty servers 1000 Pee r-asymmetry 0 0 200 400 600 800 Elapsed Time (s) Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Parallel File Systems: Empirical Insights (II) Faults manifest asymmetrically only on some metrics Ex: A disk-busy fault manifests . . . Asymmetrically on latency metrics ( ↑ on faulty, ↓ on fault-free) Symmetrically on throughput metrics ( ↓ on all nodes) 80000 Sectors Read (/s) No asymmetry 40000 Faulty server Non−faulty servers 0 0 200 400 600 800 Elapsed Time (s) Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Parallel File Systems: Empirical Insights (II) Faults manifest asymmetrically only on some metrics Ex: A disk-busy fault manifests . . . Asymmetrically on latency metrics ( ↑ on faulty, ↓ on fault-free) Symmetrically on throughput metrics ( ↓ on all nodes) Faults distinguishable by which metrics are peer-divergent Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

System Model Fault Model: Non-fail-stop problems “Limping-but-alive” performance problems Problems affecting storage & network resources Assumptions: Hardware is homogeneous, identically configured Workloads are non-pathological (balanced requests) Majority of servers exhibit fault-free behavior Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 12

Instrumentation Sampling of storage & network performance metrics Sampled from /proc once every second Gathered from all server nodes Storage-related metrics of interest: Throughput: Bytes read/sec, bytes written/sec Latency: I/O wait time Network-related metrics of interest: Throughput: Bytes received/sec, transmitted/sec Congestion: TCP sending congestion window Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 13

Workloads ddw & ddr ( dd write & read) Use dd to write/read many GB to/from file Large (order MB) I/O requests, saturating workload iozonew & iozoner (IOzone write & read) Ran in either write/rewrite or read/reread mode Large I/O requests, workload transitions, fsync postmark (PostMark) Metadata-heavy, small reads/writes (single server) Simulates email/news servers Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 14

Fault Types Susceptible resources: Storage: Access contention Network: Congestion, packet loss (faulty hardware) Manifestation mechanism: Hog: Introduces new visible workload (server-monitored) Busy/Loss: Alters existing workload (unmonitored) Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

Fault Types Susceptible resources: Storage: Access contention Network: Congestion, packet loss (faulty hardware) Manifestation mechanism: Hog: Introduces new visible workload (server-monitored) Busy/Loss: Alters existing workload (unmonitored) Storage Network Hog disk-hog write-network-hog read-network-hog Busy/Loss disk-busy receive-packet-loss send-packet-loss Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

Experiment Setup PVFS cluster configurations: 10 clients, 10 combined I/O & metadata servers 6 clients, 12 combined I/O & metadata servers Luster cluster configurations: 10 clients, 10 I/O servers, 1 metadata server 6 clients, 12 I/O servers, 1 metadata server Each client runs same workload for ≈ 600 s Faults injected on single server for 300 s All workload & fault combinations run 10 times Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 16

Diagnostic Algorithm Phase I: Node Indictment Histogram-based approach (for most metrics) Time series-based approach (congestion window) Both use peer-comparison to indict faulty node Phase II: Root-Cause Analysis Ascribes to root cause based on affected metrics Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 18

Phase I: Node Indictment (Histogram-Based) Peer-compare metric PDFs (histograms) across servers Michael P . Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 - PowerPoint PPT Presentation

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 Michael P Jiaqi Tan 2 , Rajeev Gandhi 1 , Priya Narasimhan 1 1 Carnegie Mellon University 2 DSO National Labs, Singapore Michael P . Kasick Problem Diagnosis in Parallel File

Paradoxes in Probability How probability continues to amuse me! Let's play a game! Box A Box B

Parallel File Systems John White Lawrence Berkeley National Lab Topics Defining a File

File Management What is a file? Elements of file management File organization

A recipe for black box functors Maru Sarazola and Brendan Fong What is a black box functor? In

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Black Box Scanning Tool + White Box Testing Tool Toshis Black Box Scanning Tool Same

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Kid s Box American English Level 1 Presentation Plus: Kid s Box American English Kid s Box

Flux Box Flux Box A concept by Flux Laboratory Flux box : concept Flux box : concept What is Flux

Diagnosis (01) Definitions Alban Grastien alban.grastien@rsise.anu.edu.au Presentation 1

[7] Gaussian Elimination Starting to peek inside the black box So far sol ve( A, b) is a black

Box Overview Box Enables Secure File Sync and Sharing from Any Device Intuitive end-user web,

CPSC 410/611: File Management What is a file? Elements of file management

Keynote Speech Of Andrew J. Donohue Partner, Morgan, Lewis & Bockius LLP At J.P. Morgan

Security of IPv6 and DNSSEC for penetration testers Vesselin Hadjitodorov Master education

Economic Feasibility Statement Josephine Community Libraries, Inc. (JCLI) November 2, 2016

R& E R& E I ssu ssue 1 1: Proj ec ect Approvals I ssu ssue 2 2: : RE Progr

During the Cold War, Japan clearly understood that it was the anchor in Americas East Asian

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Living the Spirit-Led Life WEEK 2: OPENING MY HEART TO GODS WAYS 1 Weekly Materials Living

Climate and Energy Policy in Germany and the EU R. Andreas Kraemer Director Ecologic.eu

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 - PowerPoint PPT Presentation

Black-Box Problem Diagnosis in Parallel File Systems . Kasick 1 Michael P Jiaqi Tan 2 , Rajeev Gandhi 1 , Priya Narasimhan 1 1 Carnegie Mellon University 2 DSO National Labs, Singapore Michael P . Kasick Problem Diagnosis in Parallel File

Paradoxes in Probability How probability continues to amuse me! Let's play a game! Box A Box B

Parallel File Systems John White Lawrence Berkeley National Lab Topics Defining a File

File Management What is a file? Elements of file management File organization

A recipe for black box functors Maru Sarazola and Brendan Fong What is a black box functor? In

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Black Box Scanning Tool + White Box Testing Tool Toshis Black Box Scanning Tool Same

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

Kid s Box American English Level 1 Presentation Plus: Kid s Box American English Kid s Box

Flux Box Flux Box A concept by Flux Laboratory Flux box : concept Flux box : concept What is Flux

Diagnosis (01) Definitions Alban Grastien alban.grastien@rsise.anu.edu.au Presentation 1

[7] Gaussian Elimination Starting to peek inside the black box So far sol ve( A, b) is a black

Box Overview Box Enables Secure File Sync and Sharing from Any Device Intuitive end-user web,

CPSC 410/611: File Management What is a file? Elements of file management

Keynote Speech Of Andrew J. Donohue Partner, Morgan, Lewis &amp; Bockius LLP At J.P. Morgan

Security of IPv6 and DNSSEC for penetration testers Vesselin Hadjitodorov Master education

Economic Feasibility Statement Josephine Community Libraries, Inc. (JCLI) November 2, 2016

R&amp; E R&amp; E I ssu ssue 1 1: Proj ec ect Approvals I ssu ssue 2 2: : RE Progr

During the Cold War, Japan clearly understood that it was the anchor in Americas East Asian

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Living the Spirit-Led Life WEEK 2: OPENING MY HEART TO GODS WAYS 1 Weekly Materials Living

Climate and Energy Policy in Germany and the EU R. Andreas Kraemer Director Ecologic.eu

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Keynote Speech Of Andrew J. Donohue Partner, Morgan, Lewis & Bockius LLP At J.P. Morgan

R& E R& E I ssu ssue 1 1: Proj ec ect Approvals I ssu ssue 2 2: : RE Progr