introduction to plink
play

Introduction to PLINK Scott Hazelhurst Sydney Brenner Institute for - PowerPoint PPT Presentation

Introduction Introduction to PLINK Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Engineering University of the Witwatersrand Johannesburg 2014 Scott Hazelhurst Sydney Brenner


  1. Introduction Introduction to PLINK Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Engineering University of the Witwatersrand Johannesburg 2014 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK

  2. Introduction Data format Standard tool for manipulating genotype data vcftools PLINK/PSEQ Plink has multiple data formats Other tools for converting to/from other formats pngu.mgh.harvard.edu/~purcell/plink/ https://www.cog-genomics.org/plink2/ Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  3. Introduction PLINK in transition to PLINK 2 Current version of plink: 1.90b2 Previous version: 1.07 New version: Much faster Has more features Missing some features Data compatible Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  4. Introduction PLINK primarily aimed at genotype data SNPs “short” indels Some support for CNV A leading tool for GWAS, structure analysis – many other tools support format. Not appropriate for many SVs, or when great variability Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  5. Introduction PED format PED files with individuals information, MAP file with SNP information PED file One row per individual. Columns are Family ID, Individual ID Paternal ID, Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype Missing: − 9 , 0; Control: 1; Case 2. (or QT) Pair of columns per SNP: different encodings possible Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK

  6. Introduction HCB181 1 0 0 1 1 2 2 2 2 2 2 1 2 2 2 2 2 HCB182 1 0 0 1 1 2 2 1 2 2 2 1 2 1 2 2 2 HCB183 1 0 0 1 2 2 2 1 2 2 2 1 2 1 1 2 2 HCB184 1 0 0 1 1 2 2 1 2 2 2 1 1 2 2 2 2 HCB185 1 0 0 1 1 2 2 1 2 2 2 2 2 2 2 2 2 HCB186 1 0 0 1 1 2 2 2 2 2 2 1 1 2 2 2 2 HCB187 1 0 0 1 1 2 2 2 2 2 2 1 2 1 2 2 2 HCB188 1 0 0 1 1 2 2 1 2 2 2 1 1 2 2 2 2 HCB189 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 HCB190 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 HCB191 1 0 0 1 2 1 2 2 2 2 2 1 2 1 2 2 2 HCB192 1 0 0 1 1 2 2 2 2 2 2 1 1 2 2 2 2 HCB193 1 0 0 1 1 1 2 2 2 2 2 2 2 2 2 2 2 HCB194 1 0 0 1 1 2 2 2 2 2 2 1 2 2 2 2 2 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng HCB195 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 Introduction to PLINK

  7. Introduction Can be used to model family studies AFAM 1 0 0 . . . AFAM 2 0 0 . . . AFAM 3 1 2 . . . AFAM 4 1 2 . . . AFAM 5 0 0 . . . AFAM 6 1 5 . . . AFAM 7 0 0 . . . AFAM 8 3 0 . . . AFAM 9 0 4 . . . Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  8. Introduction NB: Some commands/toolsx: Expect sex information by default --allow-no-sex / --must-have-sex Want phenotype data Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  9. Introduction MAP file MAP file has one row per SNP Chromosome: 1..26 (X, Y, XY, MT) SNP id genetic distance (morgans) base pair (which build!) Newer versions of PLINK have support for some non-human genomes Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK

  10. Introduction 1 rs3094315 0 742429 1 rs3131972 0 742584 1 rs12562034 0 758311 1 rs12124819 0 766409 1 rs11240777 0 788822 1 rs6681049 0 789870 1 rs4970383 0 828418 1 rs4475691 0 836671 1 rs7537756 0 844113 1 rs13302982 0 851671 1 rs1110052 0 863421 1 rs2272756 0 871896 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  11. Introduction Binary PED format Faster, more compact FAM file: one row per individual – identification information (first 6 columns of PED file). Human readable BIM file: one row per SNP. MAP file + two variants for that SNP. Human readable. BED file: one row per individual – genotype information (rest of the columns of the PED file). Not human readable Don’t confuse with UCSC BED format for genomic data – can have both in a study Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  12. Introduction BIM file 1 rs2185539 0 566875 C C 1 rs11510103 0 567753 A A 1 rs11240767 0 728951 C C 1 rs3131972 0 752721 G G 1 rs3131969 0 754182 G G 1 rs1048488 0 760912 T T 1 rs12562034 0 768448 A G 1 rs12124819 0 776546 A A 1 rs4040617 0 779322 A A 1 rs2905036 0 792480 T T 1 rs4245756 0 799463 C C 1 rs12086311 0 808769 G G Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  13. Introduction Other formats: transposed long format Not commonly used – typically when you need to import from another format. May be easy to write a script that does the conversion. Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  14. Introduction Transposed data tped/tfam files. tped: one row per SNP with SNP info followed by genotype of each individual; tfam: info about individuals Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  15. Introduction Long format Very inefficient – but may be useful in conversion MAP file FAM file LGEN file containing genotypes LGEN family ID, individual ID SNP ID two alleles Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK

  16. Introduction A 1 rs123 A C A 2 rs28782 C G A 3 rs919878 T T A 2 rs123 A C B 7 rs123 A C B 8 rs123 A C B 9 rs28782 C T Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  17. Introduction Phenotype/Cluster file FID IID PHE FID IID PHE FID IID PHE FID IID PHE Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  18. Introduction Can have multiple phenotypes FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  19. Introduction Tri-allelic alleles PLINK can represent tri-allelic alleles Only very limited ability to analyse them Same SNP may appear several times in the MAP or BIM file Usually filter out tri-allelic alleles Often an issue when merging data sets Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

  20. Introduction Strandedness Different chips or experiments may record a SNP using different strand A C T G See when merge data — appears to be multi-allelic PLINK reports apparently multi-allelic SNPs You can flip them – create a new data set Try merge again – if really multi-allelic should work Filter out remaining May incorrectly flip a few Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK

Recommend


More recommend