Thursday, 17 August 2017

Variant discovery and analysis pipeline for whole genome/exome/targeted genome sequencing

SNPs (Single Nucleotide Polymorphisms) are genetic markers of choice for both linkage and association mapping and for population structure and evolution analysis. They are virtually unlimited, evenly distributed along the genome, bi-allelic and co-dominant.
Single nucleotide polymorphism (SNPdetection technologies are used to scan for new polymorphisms and to determine the allele(s) of a known polymorphism in target sequences.Identification of SNPs is helpful in identifying the actual cause of genetic diseases..Basic pipeline for identification of SNPs is given below and I hope it will help you in your projects and assignments.

1)     To convert SRA file downloaded from NCBI into fastQ format

fastq-dump.2.3.5 --origfmt --split-3 SRR1915421.sra
·         fastq-dump.2.3.5 is the command to convert sra into fastq
·         SRR1915421.sra is file downloaded from NCBI SRA
2)     Making index files of reference sequence
bwa index -a bwtsw -p hg19.fa
·         where hg19.fa is the human reference sequence downloaded from UCSC browser
·         bwa index command is used to make indexes of reference

3)     Alignment is done using BWA aligner
bwa aln -t 30 -f SRR.sai ../../indexes/hg19/hg19 SRR1915421.fastq
·         bwa aln is command to align the reas on reference genome
·         SRR.sai is the name of file that will be obtained as a result of alignment
·         SRR1915421.fastq is the file that is to be aligned on reference genome
4)     Formation of SAM file
samse -f ERR.sam -r '@RG\tID:D3NH4HQ1\tPL:illumina\tPU:D3NH4HQ1:139:C153UACXX\tLB:6\tSM:SRR1915421\tCN=nad' ../../indexes/hg19/hg19 SRR.sai SRR1915421.fastq
·         samse is used when we have single end data sampe is used when paired end data
·         ERR.sam is the name of sam file that will be generated using the above mentioned command
·         '@RG\tID:D3NH4HQ1\tPL:illumina\tPU:D3NH4HQ1:139:C153UACXX\tLB:6\tSM:SRR1915421\tCN=nad' is the format of the NGS read file
·         SRR.sai is the file obtained as a result of alignment
·         SRR1915421.fastq is the read data
5)     Validation of SAM file and SAM to BAM conversion
java -jar picard.jar ValidateSamFile I=ERR.sam MODE=SUMMARY
·        where I = input and ERR.sam is the sam file
java -Xmx4g -jar ../../../softwares/picard-  tools-2.1.1/picard.jar SortSam SO=coordinate INPUT=ERR.sam OUTPUT=SRR.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
·         CREATE_INDEX argument to true we automatically create an index file for the generated bam file.
·         SRR.bam is the bam file obtained as a result of conversion of sam
·         By setting the validation stringency to lenient, picard ignores some validation errors which frequently occur at the alignment step.
6)     Marking duplicates
         java -Xmx4g -jar ../../../softwares/picard-  tools-    2.1.1/picard.jar MarkDuplicates INPUT=SRR.bam OUTPUT=SRR.marked.bam METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT
·         SRR.marked.bam is the file that contains the duplicates
·         Again the validation stringency is set to lenient but no index is created as we will perform more changes on the output bam file.
7)     Local realignment around indels
Indels within reads often lead to false positive SNPs at the end of sequence reads. To prevent this artifact, local realignment around indels is done. It is done in two steps
java -Xmx4g -jar ../../../softwares/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../../indexes/hg19/hg19.fa -o SRR.bam.list -I SRR.marked.bam
This step creates the file in the form of table format which is SRR.bam.list
·         SRR.bam.list is the file in table format that can be used further for realigning step
·         SRR.marked.bam is the marked duplicate file obtained in the previous step

2nd step

java -Xmx4g -jar ../../../softwares/GenomeAnalysisTK.jar -I SRR.marked.bam -R ../../indexes/hg19/hg19.fa -T IndelRealigner -targetIntervals SRR.bam.list -o SRR.marked.realigned.bam

This step will generate SRR.marked.realigned.bam file
While working on paired end data
Java\ -jar picard/FixMateInformation.jar \
INPUT=input_bam.marked.realigned.bam \
OUTPUT=input_bam.marked.realigned.fixed.bam \
SO=coordinate \

8)     Quality score recalibration
Quality data generated from the sequencer isn't always very accurate and for obtaining good SNP calls (which rely on base quality scores), recalibration of these scores is necessary and is performed in two steps
Java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R chr15.fa \ --DBSNP dbsnp132.txt \ -I input.marked.realigned.fixed.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -recalFile input.recal_data.csv

This step creates a .csv file which is needed for the next step and requires a dbSNP file, which can be downloaded at the UCSC Genome browser homepage(
Download the dbsnp132.txt.gz file and unzip it using gunzip
gunzip dbsnp132.txt.gz

2nd step
Java -Xmx4g -jar GenomeAnalysisTK.jar \  -l INFO \ -R chr15.fa \ -I input.marked.realigned.fixed.bam \ -T TableRecalibration \ --out input.marked.realigned.fixed.recal.bam \ -recalFile input.recal_data.csv

9)     SNP calling
java -Xmx4g -jar ../../../softwares/GenomeAnalysisTK.jar -glm BOTH -R ../../indexes/hg19/chr15.fa -T UnifiedGenotyper -I SRR.marked.realigned.bam -D ../bwa/snp144.vcf -o SRR.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A Coverage -A AlleleBalance -L target_intervals.bed
·         The “glm both” argument enables SNP and Indel calling.
·         chr15.fa is the reference genome file
·         SRR.marked.realigned.bam is the realigned bam file obtained in previous steps
·         snp144.vcf is the raw snp file obtained as a result of the above mentioned command
·         The metrics file (snps.metrics ) specified by the metrics command contains statistics for the SNP calling step
·         The -L command contains a list of the targeted intervals we wanted to sequence. To get a bed file for specific gene locus:
While downloading. bed file Choose the hg19 assembly of the human genome and set the track to RefSeq genes and the table to refGene. Use BED format as output format and assign the file an appropriate name (I use target_intervals.bed).
10)     SNP filtering
Convert the VCF (variant call format file) into excel format by just opening vcf file and copy it into excel sheets.
Apply filters on specific columns by selecting filter option in the excel sheets.
Some of the filters are
·         QUAL<30 à REMOVE THESE SNPs
·         AF>1 à remove these as well as frequency greater than 1 are not SNPs they are known as polymorphisms
·         QD (Quality by depth) <6 à remove these as well
To apply more filters to remove false positive results check the link given below.

11)                        Annotation using SNPnexus
In order to perform functional annotation of SNPs go to the link
The SNPnexus is useful for the functional annotation of SNPs and predicts whether the SNP is damaging or tolerating. For this purpose, it gives polyphen as well as SIFT scores. Apart from these scores this online webserver predicts the change in amino acids at specific positions, disease linked with the snp and predicts whether it is known or Novel snp.
To do the annotation simply input the SNP details manually
Chromosome  1          190641277      A          G          1
Chromosome  1          190641388      A          C          1
Chromosome  1          190641422      A          C          1
This format is the easiest way to do annotation. Many other formats are also available but this is the simple one.
Chromosome 1 à chromosome number
1980875 à chr position
A à Ref allele
G àALT allele
1 à chr band
12)                        Further validation of annotated SNPs
To validate further the annotated SNPs PROVEAN webserver is used. It gives SIFT as well as PROVEAN scores. These scores predict that whether these SNPs are damaging, tolerating or benign. If the score is less than 0.05 in case of SIFT the SNPs are damaging. In case of PROVEAN if cutoff value is below -2.5 then the SNPs are declared damaging.
On the basis of these Scores get the appropriate SNPs that are damaging and linked with the disease onset.
Here is the link for PROVEAN
format for input:
This is the simplest format and at a time you can upload number of SNPs.

Here 2 is the chromosome number, 234566 is the position of SNP G is the reference allele and C is the alt allele.