Introduction Introduction to PLINK Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Engineering University of the Witwatersrand Johannesburg 2014 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK
Introduction Data format Standard tool for manipulating genotype data vcftools PLINK/PSEQ Plink has multiple data formats Other tools for converting to/from other formats pngu.mgh.harvard.edu/~purcell/plink/ https://www.cog-genomics.org/plink2/ Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction PLINK in transition to PLINK 2 Current version of plink: 1.90b2 Previous version: 1.07 New version: Much faster Has more features Missing some features Data compatible Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction PLINK primarily aimed at genotype data SNPs “short” indels Some support for CNV A leading tool for GWAS, structure analysis – many other tools support format. Not appropriate for many SVs, or when great variability Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction PED format PED files with individuals information, MAP file with SNP information PED file One row per individual. Columns are Family ID, Individual ID Paternal ID, Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype Missing: − 9 , 0; Control: 1; Case 2. (or QT) Pair of columns per SNP: different encodings possible Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK
Introduction HCB181 1 0 0 1 1 2 2 2 2 2 2 1 2 2 2 2 2 HCB182 1 0 0 1 1 2 2 1 2 2 2 1 2 1 2 2 2 HCB183 1 0 0 1 2 2 2 1 2 2 2 1 2 1 1 2 2 HCB184 1 0 0 1 1 2 2 1 2 2 2 1 1 2 2 2 2 HCB185 1 0 0 1 1 2 2 1 2 2 2 2 2 2 2 2 2 HCB186 1 0 0 1 1 2 2 2 2 2 2 1 1 2 2 2 2 HCB187 1 0 0 1 1 2 2 2 2 2 2 1 2 1 2 2 2 HCB188 1 0 0 1 1 2 2 1 2 2 2 1 1 2 2 2 2 HCB189 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 HCB190 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 HCB191 1 0 0 1 2 1 2 2 2 2 2 1 2 1 2 2 2 HCB192 1 0 0 1 1 2 2 2 2 2 2 1 1 2 2 2 2 HCB193 1 0 0 1 1 1 2 2 2 2 2 2 2 2 2 2 2 HCB194 1 0 0 1 1 2 2 2 2 2 2 1 2 2 2 2 2 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng HCB195 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 Introduction to PLINK
Introduction Can be used to model family studies AFAM 1 0 0 . . . AFAM 2 0 0 . . . AFAM 3 1 2 . . . AFAM 4 1 2 . . . AFAM 5 0 0 . . . AFAM 6 1 5 . . . AFAM 7 0 0 . . . AFAM 8 3 0 . . . AFAM 9 0 4 . . . Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction NB: Some commands/toolsx: Expect sex information by default --allow-no-sex / --must-have-sex Want phenotype data Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction MAP file MAP file has one row per SNP Chromosome: 1..26 (X, Y, XY, MT) SNP id genetic distance (morgans) base pair (which build!) Newer versions of PLINK have support for some non-human genomes Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK
Introduction 1 rs3094315 0 742429 1 rs3131972 0 742584 1 rs12562034 0 758311 1 rs12124819 0 766409 1 rs11240777 0 788822 1 rs6681049 0 789870 1 rs4970383 0 828418 1 rs4475691 0 836671 1 rs7537756 0 844113 1 rs13302982 0 851671 1 rs1110052 0 863421 1 rs2272756 0 871896 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Binary PED format Faster, more compact FAM file: one row per individual – identification information (first 6 columns of PED file). Human readable BIM file: one row per SNP. MAP file + two variants for that SNP. Human readable. BED file: one row per individual – genotype information (rest of the columns of the PED file). Not human readable Don’t confuse with UCSC BED format for genomic data – can have both in a study Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction BIM file 1 rs2185539 0 566875 C C 1 rs11510103 0 567753 A A 1 rs11240767 0 728951 C C 1 rs3131972 0 752721 G G 1 rs3131969 0 754182 G G 1 rs1048488 0 760912 T T 1 rs12562034 0 768448 A G 1 rs12124819 0 776546 A A 1 rs4040617 0 779322 A A 1 rs2905036 0 792480 T T 1 rs4245756 0 799463 C C 1 rs12086311 0 808769 G G Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Other formats: transposed long format Not commonly used – typically when you need to import from another format. May be easy to write a script that does the conversion. Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Transposed data tped/tfam files. tped: one row per SNP with SNP info followed by genotype of each individual; tfam: info about individuals Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Long format Very inefficient – but may be useful in conversion MAP file FAM file LGEN file containing genotypes LGEN family ID, individual ID SNP ID two alleles Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information E Introduction to PLINK
Introduction A 1 rs123 A C A 2 rs28782 C G A 3 rs919878 T T A 2 rs123 A C B 7 rs123 A C B 8 rs123 A C B 9 rs28782 C T Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Phenotype/Cluster file FID IID PHE FID IID PHE FID IID PHE FID IID PHE Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Can have multiple phenotypes FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 FID IID PHE1 PHE2 PHE3 Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Tri-allelic alleles PLINK can represent tri-allelic alleles Only very limited ability to analyse them Same SNP may appear several times in the MAP or BIM file Usually filter out tri-allelic alleles Often an issue when merging data sets Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Introduction Strandedness Different chips or experiments may record a SNP using different strand A C T G See when merge data — appears to be multi-allelic PLINK reports apparently multi-allelic SNPs You can flip them – create a new data set Try merge again – if really multi-allelic should work Filter out remaining May incorrectly flip a few Scott Hazelhurst Sydney Brenner Institute for Molecular Bioscience and School of Electrical & Information Eng Introduction to PLINK
Recommend
More recommend