NGS檢測SNP

時間 2019-12-10

標籤 ngs 檢測 snp 简体版

原文原文鏈接

1，Fastq數據質控html

2,Fastq轉化成bam，包含頭文件java

bwa aln ref.fa test_1.fq > test_1.sai
bwa aln ref.fa test_2.fq > test_2.sai
bwa sampe ref.fa -r "@RG\tID:<ID>\tLB:<LIBRARY_NAME>\tSM:<SAMPLE_NAME>\tPL:ILLUMINA" test_1.sai test_2.sai test_1.fq test_2.fq > test.sam

3,sam 轉化成bam，若是SAM文件中有header @SQ lines。app

samtools view -b -S test.sam > test.bam
##若是沒有header時： samtools faidx ref.fa   
##                 samtools view -bt ref.fa.fai test.sam > test.bam

4,sort bam spa

samtools sort test.bam > test.sorted.bam

或者：

java -jar picard.jar SortSam I=test.bam O=test.sorted.bam SORT_ORDER=coordinate

5, 標記重複code

java -jar picard.jar MarkDuplicate ....

6, index 一下orm

samtools index test.sorted.repeatmark.bam

7,Base Quality Score Recalibrationhtm

....

8, 使用GATK檢測SNPblog

java -jar GenomeAnalysisTK.jar glm SNP -R ref.fa -T UnifiedGenotyper -I test.sorted.repeatmark.bam -o test.raw.vcf

使用samtools和bcftools檢測SNPip

samtools faidx ref.fa
samtools mpileup -d 1000 -DSugf test.sorted.repeatmark.bam > test.raw.vcf    ##（samtools mpileup -vf 。。。）

bcftools view -Nvcg -d 1000 test.raw.vcf > test.snp.vcf                     ##(個人軟件運行這步會出錯,用下面兩行代碼代替)
bcftools call -mv test.raw.vcf > test.raw_varient.vcf
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' test.raw_varient.vcf > test.filt_varient.vcf

##也有直接用perl 腳本實現。在使用bcftools 獲得variant calling變異後的結果後。須要對結果再次進行過濾，主要依據對比結果中的第8列消息，其中的DP4最爲重要，對應的提供了四列：1, 比對結果和正鏈一致的reads數；2, 比對結果和負鏈一致的reads數；3, 比對結果在正鏈的variant 上的reads數；4, 比對結果在負鏈的variant上的reads數。當設定（value3+value4）大於某一閾值，纔算是variant。字符串

bcftools檢測生成的vcf格式有10列。1，參考序列名。2，variant所在的left-most位置。3，variant的ID，（默認未設置，用「.」表示）。4，參考序列的allele。5，variant的allele（有多個alleles，則用「，」分隔）。6,variant/reference Quality。7，FILTers applied。8，varient的信息，使用分號隔開。9，Format of the genotype fields， seperated by colon （optional）。10，Sample genotypes and per-sample information（optional）。

bcftools 的第8列中顯示了對variants的信息描述，其中Tag的描述以下：

Tad　　Format　　Description

AF1　　double　　Max-likelihood estimate of the site allele Frequency （AF）of the first ALT allele

DP　　　int　　　　Raw read depth (without quality filtering)

DP4　　int[4]　　　　high-quality reference forward base, ref reverse, alternate for and alt rev bases

FQ　　int　　　　　consensus quality. Positive: sample genotypes different; negative: otherwise

MQ	int	Root-Mean-Square mapping quality of covering reads
PC2	int[2]	Phred probability of AF in group1 samples being larger (,smaller) than in group2
PCHI2	double	Posterior weighted chi^2 P-value between group1 and group2 samples
PV4	double[4]	P-value for strand bias, baseQ bias, mapQ bias and tail distance bias
QCHI2	int	Phred-scaled PCHI2
RP	int	# permutations yielding a smaller PCHI2
CLR	int	Phred log ratio of genotype likelihoods with and without the trio/pair constraint
UGT	string	Most probable genotype configuration without the trio constraint
CGT	string	Most probable configuration with the trio constraint

使用bcftools過濾掉不可靠的位點：

bcftools filter的參數：

-e -exclude 主要用於表達式方式去除匹配上的位點，這個參數很關鍵，過濾須要此表達式

-g -SnpGap 過濾INDEL附近的snp位點，好比-SnpGap 5 則過濾INDEL附近5個鹼基距離內的SNP

-G -IndelGap 過濾INDEL附近的INDEL位點

-o -output 輸出文件的名稱

-O -output-type 輸出的格式，通常z和v都行

-s -soft-filter 將過濾掉的位點用字符串註釋

-S -set-GTs setgenotypes of failed samples to missing value (.) or reference allele (0) （將不符合要求的個體基因改成"."）

eg:過濾QUAL小於10，DP值小於5，INDEL附近的位點

bcftools filter -O v  -o test.filter_variant.vcf -s LOWQUAL -e 'QUAL<10 || FMT/DP < 5' --SnpGap 5 --set-GTs .  test.vcf.gz

提取過濾後的SNP位點

bcftools view -v snps test.filter_variance.vcf > test.snp_filter.vcf

或者在vcf文件中的INFO列裏，若是是INDEL的話，會標註出INDEL，所以提取SNP也能夠：

grep -v 'INDEL'  test.filter_variance.vcf > test.snp_filter.vcf

注意：

|| 與 | 區別：都表示「或」運算，可是 || 運算符第一個表達式成立的話，後面的表達式不運算，直接返回。而 | 對全部表達式都判斷。 && 與 & 的區別同理

參考：https://www.cnblogs.com/xiaofeiIDO/p/6857745.html

　　　http://www.bioinfo-scrounger.com/archives/248

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。