PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL L SF F Day Jan 2020 Anson Abraham Data Architect at Envisagenics Inc.
Omics Predicted to be “biggest of big data” by 2025 Omics Astronomy Video Twitter 17 PB 1 EB 1-2 EB 40 EB Source: Challenges For Genomics In The Age of Big Data, July 2015 , Forbes
The “Big Leap” of Biology: from molecule to computerized DNA The DNA molecule Computerized DNA sequence
Omics data made from human biopsies is used for therapeutics development ▪ Omics data can be generated from any human tissues ▪ Tissue-specific omics is used to compare across individuals (e.g., cancer patients v. control) ▪ Omics data can be stored, then analyzed by different algorithms (e.g., to find mutations, to find gene level changes) ▪ Biopsies and data can be stored, data last t longer er and take e less s space! ce!
Sequencing technology to computerize omics data Sequenci encing ng facili lity y at CSHL L - 2.5 T erabyt bytes es of genome me The cost of sequenc ncing ng a g genom ome e went from data a produced uced every y week! k!! ~$100M 0M in 2000 0 to <$1K nowadays!! days!!
Hardwar dware e for sequencer quencers s is getting tting sma maller ler but t data ta is getting tting larger ger
Storing omics data is important for therapeutic development ▪ Data sharing through partnerships ▪ Pharma company A extracts value from data, then partner B extracts additional value from same dataset ▪ Pharma company A makes more data, brings old data from archive to compare ▪ T o be shared as open access to the scientific community ▪ “Wisdom of the crowds”: thousands of brains working to cure cancer, and genetic diseases ▪ E.g. The Cancer Genome Atlas (TCGA), a public data repository, facilitates cures for cancer. ▪ For personalized medicine ▪ Use your genome for wellness improvement ▪ Use your genome to treat cancer, ALS, Alzheimer's, etc.
OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
FASTQ are large and tightly governed raw data files App App App App App App FASTQ FASTQ App App New Paradig radigm: data have grown in Old Paradig radigm: data uploaded size & number, governance is to different apps for analysis tighter. Incentives to have Apps are deployed to data
FASTQ file format @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT Sequence read + Quality ASCII !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file
SAM file format QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL TAG1 TAG2 readID43GYAX15:7:1:1202:19894/1,256,contig43,613960,1,65M,*,0,0,CCAGCGCGAACGAAATCCGCA TGCGTCTGGTCGTTGCACGGAACGGCGGCGGTGTGATGCACGGC,EDDEEDEE=EE?DE??DDDBADEBEFFFD BEFFEBCBC=?BEEEE@=:?::?7?:8-6?7?@??#,AS:i:0,AA:3:4 CREATE TABLE sam ( qname varchar(100) ,flag int ,rname varchar(10) ,pos int ,mapq int ,cigar varchar(5) ,rnext varchar(1) ,pnext int ,tlen int ,seq text ,qual text ,tag jsonB );
SAM files can be stored in DBMS Columns description QNAME Query template name FLAG bitwise flag RNAME References sequence name POS 1- based leftmost mapping position MAPQ mapping quality CIGAR cigar (concise idiosyncratic gapped alignment report) string RNEXT Reference seq name of the primary alignment if the next read PNEXT Position of the primary alignment of the next read TLEN observed Template length SEQ segment sequence QUAL Phred-scaled base QUALITY TAG Tag:Type:Value
Omics data have coordinal data ▪ Chromosome, start, and end are the coordinates for Omics data on the chromosome ▪ Genome Browser is a tool querying an RDBMS of SAM data RNA sequences as presented in the UCSC Genome Browser, the “google map” of the genome
OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome
VCF file format #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. CREATE TABLE vcf ( chrom int ,pos int ,var_id VARCHAR(25) ,ref varchar(10) ,alt varchar(10) ,qual int ,filter varchar(10) ,info varchar(50) ,format varchar(25) ,sample JSONB );
VCF file stored in database Columns description CHROM The chromosome POS The 1-based position of the variation on the given sequence. VAR_ID the variation identifier REF the reference base at the given position on the reference sequence ALT the alternate alleles for this position QUAL A quality score for the inference of the given alleles. A flag indicating which of a given set of filters the variation has FILTER passed. INFO list of k-v pairs (fields) describing the variation. FORMAT list of fields for describing the samples SAMPLEs sample described in the file
Compact & informative VCF files are ready to use for research : open-access database with thousands of VCF ▪ datasets from cancer patients ▪ Cosmic Applications: ▪ Study cancer inheritance ▪ Study cancer progression ▪ Develop biomarkers ▪ Develop therapeutic compounds ▪ Up and coming: CRISPR as therapeutic to correct DNA mutations
OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome
Example WorkFlow to convert from FASTQ to VCF
RDBM’s are useful for therapeutic AI applications Current Clinical Trial Data Modeling New Patients Recruited New Clinical Trial Low response rate High response rate Omics-based Responder to cancer treatment predictive features Non-responders
OMICS data file formats and PG… ▪ You can create Foreign Data Wrappers. New formats always arise, some maybe unstructured ▪ FDW to read VCF Files directly from Postgres ▪ https://github.com/smithijk/vcf_fdw_postgresql ▪ There is no Foreign Data Wrapper for FastQ files. ▪ Should there be one? ▪ PostBIS: Michael Schneider ▪ TileDB.IO and Snowflake are examples that can query directly to scale, VCF files stored in S3 .
Questions or Thoughts? Anson Abraham Sr. Data/Cloud Architect at Envisagenics, Inc. anson.abraham@gmail.com @therealansonism
Recommend
More recommend