SNPs (Single Nucleotide Polymorphisms) are genetic markers of choice for both linkage and association mapping and for population structure and evolution analysis. They are virtually unlimited, evenly distributed along the genome, bi-allelic and co-dominant.
Single nucleotide polymorphism (SNP) detection technologies are used to scan for new polymorphisms and to determine the allele(s) of a known polymorphism in target sequences.Identification of SNPs is helpful in identifying the actual cause of genetic diseases..Basic pipeline for identification of SNPs is given below and I hope it will help you in your projects and assignments.
Single nucleotide polymorphism (SNP) detection technologies are used to scan for new polymorphisms and to determine the allele(s) of a known polymorphism in target sequences.Identification of SNPs is helpful in identifying the actual cause of genetic diseases..Basic pipeline for identification of SNPs is given below and I hope it will help you in your projects and assignments.
1) To convert SRA file
downloaded from NCBI into fastQ format
fastq-dump.2.3.5
--origfmt --split-3 SRR1915421.sra
where
·
fastq-dump.2.3.5 is the command to convert
sra into fastq
·
SRR1915421.sra is file downloaded from NCBI
SRA
2) Making index files of
reference sequence
bwa
index -a bwtsw -p hg19.fa
·
where hg19.fa is the human reference sequence
downloaded from UCSC browser
·
bwa index command is used to make indexes of
reference
3) Alignment is done using BWA
aligner
bwa
aln -t 30 -f SRR.sai ../../indexes/hg19/hg19 SRR1915421.fastq
where
·
bwa aln is command to align the reas on
reference genome
·
SRR.sai is the name of file that will be
obtained as a result of alignment
·
SRR1915421.fastq is the file that is to be
aligned on reference genome
4) Formation of SAM file
samse
-f ERR.sam -r
'@RG\tID:D3NH4HQ1\tPL:illumina\tPU:D3NH4HQ1:139:C153UACXX\tLB:6\tSM:SRR1915421\tCN=nad'
../../indexes/hg19/hg19 SRR.sai SRR1915421.fastq
where
·
samse is used when we have single end data
sampe is used when paired end data
·
ERR.sam is the name of sam file that will be
generated using the above mentioned command
·
'@RG\tID:D3NH4HQ1\tPL:illumina\tPU:D3NH4HQ1:139:C153UACXX\tLB:6\tSM:SRR1915421\tCN=nad'
is the format of the NGS read file
·
SRR.sai is the file obtained as a result of
alignment
·
SRR1915421.fastq is the read data
5) Validation of SAM file and SAM to BAM
conversion
java -jar picard.jar ValidateSamFile
I=ERR.sam MODE=SUMMARY
·
where I =
input and ERR.sam is the sam file
java -Xmx4g -Djava.io.tmpdir=/tmp -jar
../../../softwares/picard-
tools-2.1.1/picard.jar SortSam SO=coordinate INPUT=ERR.sam
OUTPUT=SRR.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
where
·
CREATE_INDEX argument to true we automatically create an
index file for the generated bam file.
·
SRR.bam is the bam file obtained as a result of conversion of
sam
·
By setting the validation stringency to lenient, picard
ignores some validation errors which frequently occur at the alignment step.
6) Marking
duplicates
java -Xmx4g
-Djava.io.tmpdir=/tmp -jar ../../../softwares/picard- tools-
2.1.1/picard.jar MarkDuplicates INPUT=SRR.bam OUTPUT=SRR.marked.bam
METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT
where
·
SRR.marked.bam is the file that contains
the duplicates
·
Again the validation
stringency is set to lenient but no index is created as we will perform more
changes on the output bam file.
7)
Local realignment around indels
Indels within reads often lead to false positive SNPs at the
end of sequence reads. To prevent this artifact, local realignment around
indels is done. It is done in two steps
java -Xmx4g -jar ../../../softwares/GenomeAnalysisTK.jar -T
RealignerTargetCreator -R ../../indexes/hg19/hg19.fa -o SRR.bam.list -I
SRR.marked.bam
This step creates the file in the form of
table format which is SRR.bam.list
Where
·
SRR.bam.list is the file in table format
that can be used further for realigning step
·
SRR.marked.bam is the marked duplicate file
obtained in the previous step
2nd step
java -Xmx4g -Djava.io.tmpdir=/tmp -jar
../../../softwares/GenomeAnalysisTK.jar -I SRR.marked.bam -R
../../indexes/hg19/hg19.fa -T IndelRealigner -targetIntervals SRR.bam.list -o
SRR.marked.realigned.bam
This step will generate
SRR.marked.realigned.bam file
While working on paired end data
Java -Djava.io.tmpdir=/tmp/flx-auswerter\ -jar
picard/FixMateInformation.jar \
INPUT=input_bam.marked.realigned.bam \
OUTPUT=input_bam.marked.realigned.fixed.bam \
SO=coordinate \
VALIDATION_STRINGRNCY=LENIENT \
CREATE_INDEX=true
8) Quality
score recalibration
Quality data generated from the sequencer isn't always very
accurate and for obtaining good SNP calls (which rely on base quality scores),
recalibration of these scores is necessary and is performed in two steps
Java -Xmx4g -jar GenomeAnalysisTK.jar \ -l
INFO \ -R chr15.fa \ --DBSNP dbsnp132.txt \ -I input.marked.realigned.fixed.bam
\ -T CountCovariates \ -cov ReadGroupCovariate \ -cov CycleCovariate \ -cov
DinucCovariate \ -recalFile input.recal_data.csv
This step creates a .csv file which is needed for the next
step and requires a dbSNP file, which can be downloaded at the UCSC Genome
browser homepage(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/).
Download the dbsnp132.txt.gz file and unzip it using gunzip
gunzip dbsnp132.txt.gz
2nd step
Java -Xmx4g -jar
GenomeAnalysisTK.jar \ -l
INFO \ -R chr15.fa \ -I input.marked.realigned.fixed.bam \ -T
TableRecalibration \ --out input.marked.realigned.fixed.recal.bam \ -recalFile
input.recal_data.csv
9) SNP calling
java -Xmx4g -jar ../../../softwares/GenomeAnalysisTK.jar
-glm BOTH -R ../../indexes/hg19/chr15.fa -T UnifiedGenotyper -I
SRR.marked.realigned.bam -D ../bwa/snp144.vcf -o SRR.vcf -metrics snps.metrics
-stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A Coverage -A AlleleBalance
-L target_intervals.bed
where
·
The “glm both”
argument enables SNP and Indel calling.
·
chr15.fa is the reference genome file
·
SRR.marked.realigned.bam is the realigned
bam file obtained in previous steps
·
snp144.vcf is the raw snp file obtained as
a result of the above mentioned command
·
The metrics file (snps.metrics ) specified
by the metrics command contains statistics for the SNP calling step
·
The -L command
contains a list of the targeted intervals we wanted to sequence. To get a bed file for specific gene locus:
http://genome.ucsc.edu/cgi-bin/hgTables
While downloading. bed
file Choose the hg19 assembly of the
human genome and set the track to RefSeq genes and the table to refGene. Use
BED format as output format and assign the file an appropriate name (I use
target_intervals.bed).
10) SNP filtering
Convert the VCF (variant call format file) into excel format
by just opening vcf file and copy it into excel sheets.
Apply filters on specific columns by selecting filter option
in the excel sheets.
Some of the filters are
·
QUAL<30 à REMOVE THESE SNPs
·
AF>1 à remove these as well as frequency greater
than 1 are not SNPs they are known as polymorphisms
·
QD (Quality by depth)
<6 à remove these as well
To apply more filters
to remove false positive results check the link given below.
http://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set
11)
Annotation using SNPnexus
In order
to perform functional annotation of SNPs go to the link http://snp-nexus.org/
The SNPnexus is useful for the functional
annotation of SNPs and predicts whether the SNP is damaging or tolerating. For
this purpose, it gives polyphen as well as SIFT scores. Apart from these scores
this online webserver predicts the change in amino acids at specific positions,
disease linked with the snp and predicts whether it is known or Novel snp.
To do the annotation simply input the SNP
details manually
Chromosome 1 190641277 A G 1
Chromosome 1 190641388 A C 1
Chromosome 1 190641422 A C 1
This format is the easiest way to do
annotation. Many other formats are also available but this is the simple one.
Chromosome 1 à chromosome number
1980875 à chr position
A à Ref allele
G àALT allele
1 à chr band
12)
Further validation of annotated SNPs
To
validate further the annotated SNPs PROVEAN webserver is used. It gives SIFT as
well as PROVEAN scores. These scores predict that whether these SNPs are
damaging, tolerating or benign. If the score is less than 0.05 in case of SIFT
the SNPs are damaging. In case of PROVEAN if cutoff value is below -2.5 then
the SNPs are declared damaging.
On the
basis of these Scores get the appropriate SNPs that are damaging and linked
with the disease onset.
Here is
the link for PROVEAN
format for input:
2,230633386,G,C
1,100382265,C,A
This is
the simplest format and at a time you can upload number of SNPs.
Here 2 is
the chromosome number, 234566 is the position of SNP G is the reference allele
and C is the alt allele.
No comments:
Post a Comment