variant的過濾 | filtering and prioritizing genetic variants

WGS和WES測序和分析會產生大量的variant數據。php

顯然直接分析所有的variant是很是不靠譜的。html

作疾病的話,有一些經常使用的過濾套路。java

 

variant做用於基因表達主要分兩大類:ios

1. coding,能夠直接影響RNA的造成,以及後面蛋白的摺疊組裝;web

2. non-coding,如今最流行的就是enhancer這個媒介,已經有比較好的結果了。數據庫

 

過濾的必要性bash

首先GWAS已經作了,要理解GWAS產生了哪些結果,GWAS的侷限性在哪?app

Our previous meta-analysis of genome-wide association studies estimated that common variants together account for a small proportion of heritability estimated from family studies.4 Rare variants might therefore contribute significantly to the missing heritability. 框架

Most of these variants (77.5%) were novel or rare (MAF < 1%).ide

common variants是很容易經過GWAS分析找到的,由於出現的頻率較高,不多的樣本就有很大的power來把它們檢測出來,但common variant一般都是在非編碼區的,經過很是複雜的調控來影響疾病,並且common variant的解釋度很低,並非疾病的主導因素。因此,目前都轉向了rare variants,rare的一般都在編碼區,直接改變了蛋白,影響疾病的方式比較直接,但顯然咱們須要很是大的樣本量纔有足夠的power來檢測出rare variants。 

The analysis showed the strongest association of 328 variants with HSCR (P < 5  10–8), all of which mapped to the known disease susceptibility loci of RET and NRG1 (Figure 1A, upper panel).

GWAS直接找到了328個顯著的variants,但顯然它們的LD高度相關,最終也就是兩個gene而已。並且這兩個基因早就已知了,因此這個GWAS在初級層面沒有任何新的有價值的發現。

Among the 936 WGS samples, a total of 4985 protein-truncating URVs were detected. 這基本就是我須要用到的數據了。

 

關於PCR擴增時候產生的錯誤,以及測序質量產生的錯誤。

用DP、GQ能夠過濾一大部分,還有後面的BQSR也能夠矯正。 

 

可能用到的數據庫:

1. 1000 genome,測得人太少,才千把個,到某個羣體就更少了

2. gnomAD,125,748 exome sequences and 15,708 whole-genome sequences,感覺一下這個霸氣的測序量

3. ExAC,外顯子測序,60,706 unrelated individuals

4. ensemble

 

注意的問題:

1. 疾病的人羣,咱們關注的是East Asian

2. 疾病的發病率,highest among Asians (2.8/10,000 live births),通常設在千分之5比較靠譜

 

比較好用的變異註釋工具(不一樣工具註釋出來的結果差別仍是很大的,見paper

ANNOVAR

gene-based annotation

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/
convert2annovar.pl -format vcf4 HK152C.vcf > HK152C.avinput
annotate_variation.pl -out HK152C -build hg19 HK152C.avinput /home/lizhixin/softwares/annovar/humandb/

這是實際的時候,須要把vcf轉成特定的格式。  

註釋出來的functional consequences結果:nonsynonymous SNV, synonymous SNV, frameshift insertion, frameshift deletion, nonframeshift insertion, nonframeshift deletion, frameshift block substitution, nonframshift block substitution 

什麼是nonframeshift deletion?看這裏,就是以3個爲一組,刪除了,並無影響閱讀框架。

annovar也能夠用來過濾variants

annotate_variation.pl -downdb -webfrom annovar -build hg19 gnomad211_exome humandb/
annotate_variation.pl -downdb -webfrom annovar -build hg19 gnomad211_genome humandb/

 

VEP

 

KGGSeq

java -jar /home/lizhixin/softwares/kggseqhg19/kggseq.jar --buildver hg19 --vcf-file HSCR.WGS.2_5.variants.vcf.gz --db-filter 1kgeas201305,gadexome,gadgenome --rare-allele-freq 0.005 --o-vcf

'--rare-allele-freq c' will excluded variants with alternative allele frequency EQUAL to or over c in the reference datasets

 

 

過濾的標準

  • allele frequency,如:把高於千分之5的過濾掉
  • 已知基因集
  • 雜合純合
  • protein-truncating (stopgain, splicing, or frameshift) 

 

example: rs2435357

gnomAD,這還能用allele frequency來過濾嗎?這個是common variants,在非編碼區,effect size是很是小的。

 

Variant Annotation 參見paper

Annotation was done using KGGseq for protein function against the RefGene, pathogenicity, and population frequencies.

We defined protein-truncating variants as those that lead to (1) gain of the stop codon, (2) frameshift and (3) alteration of the essential splice sites.

Damaging variants include all proteintruncating variants and missense or in-frame variants predicted to be deleterious by KGGseq. Benign variants are missense variants or in-frame variants predicted benign by KGGseq.

Finally, protein-altering variants comprise both damaging and benign variants. Rare variants are those whose minor allele frequency (MAF) is <0.01 in public databases. Ultra-rare variants (URVs) are defined as a singleton variant, that is, one that appeared only once in our whole data set, not present in dbSNP138 or public databases

參見KGGseq的這個命令:Gene feature filtering

 

variant的類型:

  • Putative LoF variants
  • Nonsynonymous and missense variants
  • Synonymous variants
  • Exonic variants

A frameshift mutation is a genetic mutation caused by a deletion or insertion in a DNA sequence that shifts the way the sequence is read.

a transcript is defined by its exons, introns and UTRs and their locations

 

牢記經典的基因結構模型很是重要:

梳理一下:

在基因組上,有promoter和enhancer,他們在轉錄因子的做用下啓動轉錄過程,而後就進入基因的結構,基因的先後都有UTR,就是不轉錄的區域,而後就是由Exon和Intron交替排列的核心區域。intron裏面每每有不少調控元件,如enhancer。

 

 

參考:

KGGSeq: A biological Knowledge-based mining platform for Genomic and Genetic studies using Sequence data

A practical guide to filtering and prioritizing genetic variants

Choice of transcripts and software has a large effect on variant annotation

Gene Structure - mRNA和蛋白是如何轉化而來的

Regulation of Gene Expression: Operons, Epigenetics, and Transcription Factors - 調控是如何進行的

Eukaryotic Gene Regulation part 1

 

細節操做:

vcftools的下載和安裝

Extract subset of samples from multigenome vcf file

拆分樣本,獨立註釋:

for i in HK152C  HK154C  HK162C  HK175C  HK180C; do
echo $i
vcf-subset -e -c $i  hscr2zxl.sel.vcf.gz > ${i}.vcf # | bgzip  -c
done

  

無義介導的mRNA降解(nonsense-mediated mRNA decay,NMD)

Nonsense-mediated RNA decay in the brain: emerging modulator of neural development and disease

相關文章
相關標籤/搜索