All about Bioinformatics: Variant discovery and analysis pipeline for whole genome/exome/targeted genome sequencing

SNPs (Single Nucleotide Polymorphisms) are genetic markers of choice for both linkage and association mapping and for population structure and evolution analysis. They are virtually unlimited, evenly distributed along the genome, bi-allelic and co-dominant.
Single nucleotide polymorphism (SNP) detection technologies are used to scan for new polymorphisms and to determine the allele(s) of a known polymorphism in target sequences.Identification of SNPs is helpful in identifying the actual cause of genetic diseases..Basic pipeline for identification of SNPs is given below and I hope it will help you in your projects and assignments.

1) To convert SRA file downloaded from NCBI into fastQ format

fastq-dump.2.3.5 --origfmt --split-3 SRR1915421.sra

where

· fastq-dump.2.3.5 is the command to convert sra into fastq

· SRR1915421.sra is file downloaded from NCBI SRA

2) Making index files of reference sequence

bwa index -a bwtsw -p hg19.fa

· where hg19.fa is the human reference sequence downloaded from UCSC browser

· bwa index command is used to make indexes of reference

3) Alignment is done using BWA aligner

bwa aln -t 30 -f SRR.sai ../../indexes/hg19/hg19 SRR1915421.fastq

where

· bwa aln is command to align the reas on reference genome

· SRR.sai is the name of file that will be obtained as a result of alignment

· SRR1915421.fastq is the file that is to be aligned on reference genome

4) Formation of SAM file

samse -f ERR.sam -r '@RG\tID:D3NH4HQ1\tPL:illumina\tPU:D3NH4HQ1:139:C153UACXX\tLB:6\tSM:SRR1915421\tCN=nad' ../../indexes/hg19/hg19 SRR.sai SRR1915421.fastq

where

· samse is used when we have single end data sampe is used when paired end data

· ERR.sam is the name of sam file that will be generated using the above mentioned command

· '@RG\tID:D3NH4HQ1\tPL:illumina\tPU:D3NH4HQ1:139:C153UACXX\tLB:6\tSM:SRR1915421\tCN=nad' is the format of the NGS read file

· SRR.sai is the file obtained as a result of alignment

· SRR1915421.fastq is the read data

5) Validation of SAM file and SAM to BAM conversion

java -jar picard.jar ValidateSamFile I=ERR.sam MODE=SUMMARY

· where I = input and ERR.sam is the sam file

java -Xmx4g -Djava.io.tmpdir=/tmp -jar ../../../softwares/picard- tools-2.1.1/picard.jar SortSam SO=coordinate INPUT=ERR.sam OUTPUT=SRR.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

where

· CREATE_INDEX argument to true we automatically create an index file for the generated bam file.

· SRR.bam is the bam file obtained as a result of conversion of sam

· By setting the validation stringency to lenient, picard ignores some validation errors which frequently occur at the alignment step.

6) Marking duplicates

java -Xmx4g -Djava.io.tmpdir=/tmp -jar ../../../softwares/picard- tools- 2.1.1/picard.jar MarkDuplicates INPUT=SRR.bam OUTPUT=SRR.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT

where

· SRR.marked.bam is the file that contains the duplicates

· Again the validation stringency is set to lenient but no index is created as we will perform more changes on the output bam file.

7) Local realignment around indels

Indels within reads often lead to false positive SNPs at the end of sequence reads. To prevent this artifact, local realignment around indels is done. It is done in two steps

java -Xmx4g -jar ../../../softwares/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../../indexes/hg19/hg19.fa -o SRR.bam.list -I SRR.marked.bam

This step creates the file in the form of table format which is SRR.bam.list

Where

· SRR.bam.list is the file in table format that can be used further for realigning step

· SRR.marked.bam is the marked duplicate file obtained in the previous step

2^nd step

java -Xmx4g -Djava.io.tmpdir=/tmp -jar ../../../softwares/GenomeAnalysisTK.jar -I SRR.marked.bam -R ../../indexes/hg19/hg19.fa -T IndelRealigner -targetIntervals SRR.bam.list -o SRR.marked.realigned.bam

This step will generate SRR.marked.realigned.bam file

While working on paired end data

Java -Djava.io.tmpdir=/tmp/flx-auswerter\ -jar picard/FixMateInformation.jar \

INPUT=input_bam.marked.realigned.bam \

OUTPUT=input_bam.marked.realigned.fixed.bam \

SO=coordinate \

VALIDATION_STRINGRNCY=LENIENT \

CREATE_INDEX=true

8) Quality score recalibration

Quality data generated from the sequencer isn't always very accurate and for obtaining good SNP calls (which rely on base quality scores), recalibration of these scores is necessary and is performed in two steps

Java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R chr15.fa \ --DBSNP dbsnp132.txt \ -I input.marked.realigned.fixed.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -recalFile input.recal_data.csv

This step creates a .csv file which is needed for the next step and requires a dbSNP file, which can be downloaded at the UCSC Genome browser homepage(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/).

Download the dbsnp132.txt.gz file and unzip it using gunzip

gunzip dbsnp132.txt.gz

2^nd step

Java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R chr15.fa \ -I input.marked.realigned.fixed.bam \ -T TableRecalibration \ --out input.marked.realigned.fixed.recal.bam \ -recalFile input.recal_data.csv

9) SNP calling

java -Xmx4g -jar ../../../softwares/GenomeAnalysisTK.jar -glm BOTH -R ../../indexes/hg19/chr15.fa -T UnifiedGenotyper -I SRR.marked.realigned.bam -D ../bwa/snp144.vcf -o SRR.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A Coverage -A AlleleBalance -L target_intervals.bed

where

· The “glm both” argument enables SNP and Indel calling.

· chr15.fa is the reference genome file

· SRR.marked.realigned.bam is the realigned bam file obtained in previous steps

· snp144.vcf is the raw snp file obtained as a result of the above mentioned command

· The metrics file (snps.metrics ) specified by the metrics command contains statistics for the SNP calling step

· The -L command contains a list of the targeted intervals we wanted to sequence. To get a bed file for specific gene locus:

http://genome.ucsc.edu/cgi-bin/hgTables

While downloading. bed file Choose the hg19 assembly of the human genome and set the track to RefSeq genes and the table to refGene. Use BED format as output format and assign the file an appropriate name (I use target_intervals.bed).

10) SNP filtering

Convert the VCF (variant call format file) into excel format by just opening vcf file and copy it into excel sheets.

Apply filters on specific columns by selecting filter option in the excel sheets.

Some of the filters are

· QUAL<30 à REMOVE THESE SNPs

· AF>1 à remove these as well as frequency greater than 1 are not SNPs they are known as polymorphisms

· QD (Quality by depth) <6 à remove these as well

To apply more filters to remove false positive results check the link given below.

http://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set

11) Annotation using SNPnexus

In order to perform functional annotation of SNPs go to the link http://snp-nexus.org/

The SNPnexus is useful for the functional annotation of SNPs and predicts whether the SNP is damaging or tolerating. For this purpose, it gives polyphen as well as SIFT scores. Apart from these scores this online webserver predicts the change in amino acids at specific positions, disease linked with the snp and predicts whether it is known or Novel snp.

To do the annotation simply input the SNP details manually

Chromosome 1 190641277 A G 1

Chromosome 1 190641388 A C 1

Chromosome 1 190641422 A C 1

This format is the easiest way to do annotation. Many other formats are also available but this is the simple one.

Chromosome 1 à chromosome number

1980875 à chr position

A à Ref allele

G àALT allele

1 à chr band

12) Further validation of annotated SNPs

To validate further the annotated SNPs PROVEAN webserver is used. It gives SIFT as well as PROVEAN scores. These scores predict that whether these SNPs are damaging, tolerating or benign. If the score is less than 0.05 in case of SIFT the SNPs are damaging. In case of PROVEAN if cutoff value is below -2.5 then the SNPs are declared damaging.

On the basis of these Scores get the appropriate SNPs that are damaging and linked with the disease onset.

Here is the link for PROVEAN

http://provean.jcvi.org/genome_submit_2.php

format for input:

2,230633386,G,C

1,100382265,C,A

This is the simplest format and at a time you can upload number of SNPs.

Here 2 is the chromosome number, 234566 is the position of SNP G is the reference allele and C is the alt allele.

All about Bioinformatics

Pages

Thursday, 17 August 2017

Variant discovery and analysis pipeline for whole genome/exome/targeted genome sequencing

No comments:

Post a Comment