問題:python
SNP的本質屬性是什麼?廣義上講是變異:most common type of genetic variation,平級的還有indel、CNV、SV。Each SNP represents a difference in a single DNA building block, called a nucleotide. 狹義上講是標記:biological markers,由於SNP是單鹼基的,因此SNP又是一個locus,標記了染色體上的一個位置。大部分人的基因組,99%都是如出一轍的,還有些SNP的位點,就是一些可變的位點,在人羣中有差別。這些差別/標記能夠用於疾病的分析,根據統計學原理,找出與疾病最相關的位點,從而肯定某個疾病的risk allele。ios
SNP array是如何工做的?SNP array測得不是單個鹼基,而是allele。因此GWAS的結果是三種:(1 - AA; 2 - AB; 3 - BB),也多是0、一、2.git
linkage disequilibrium (LD)和 pairwise correlation的區別?github
如何鑑定Somatic vs Germline Mutations?In multicellular organisms, mutations can be classed as either somatic or germ-line。必須作一般須要trios或healthy tissue的測序才能肯定。最顯然的是cancer裏大部分都是somatic的variations。web
SNP、variant和mutation有什麼區別?SNP是中性的,mutation顯然和疾病相關;其次就是頻率,頻率很高的是SNP,mutation則很低。variant和variation是同義詞,所以和SNP是等價的。spring
爲何還須要haplotype?HapMap計劃的動機是什麼?The HapMap is valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to roughly 500,000 tag SNPs.api
common variant和rare variant是根據什麼來區別的?paper 怎麼理解這裏的common和rare?variant就是SNP,」常見的變異「,SNP就是位點,一個位點怎麼能說常見和不常見呢?這裏是有點反直覺的。這裏的common說的是minor allele,就是the second most common allele。好比一個SNP:rs78601809,它的位置可知,在不一樣人羣中的allele frequency可知,整體的MAF是0.39 (T)。一個SNP的MAF<1%,那就是rare variant。直覺理解就是這個位點的鹼基在人羣中不多發生變化。rare variants (MAF < 0.05) appeared more frequently in coding regions than common variants (MAF > 0.05) in this populationbash
Genetic variants that are outside the reach of the most statistically powered association studies [13] are thought to contribute to the missing heritability of many human traits, including common variants (here denoted by minor allele frequency [MAF] >5%) of very weak effect, low-frequency (MAF 1–5%) and rare variants (MAF <1%) of small to modest effect, or a combination of both, with several possible scenarios all deemed plausible in simulation studies [14]. app
common variants together account for a small proportion of heritability estimated from family studies,common variants一般都在非編碼區,佔總variants的很小一部分,同時effect size也比較低。
SNP的small effect和large effect是什麼意思?effect size
極其容易搞混的術語:SNP、mutation、variant、allele、genotype。Allele frequency、Genotype frequency,alternative allele frequency、MAF。必定要能快速區分這些術語的差別,不然你作的就是假的統計遺傳學。
gene-based rare-variant burden tests是用來幹什麼的?Increased Burden of Rare Variants Among S-HSCR。
epistatic effects是什麼?
爲何說L-HSCR是autosomal dominant?很難說是徹底的線性,顯隱性的關係是很是複雜的,存在不徹底和劑量效應。
DNA序列角度如何看待等位基因,顯隱性的關係?關於Allele(等位基因)的理解,allele在基因上的組合,傳統的等位基因是很是抽象的概念。Dominant vs. Recessive 咱們是兩倍體,對每一個基因來講,咱們都有兩個等位基因,雜合的話,這兩個基因序列就不一樣,表達出來的蛋白也就不一樣,並且兩個等位基因有複雜的顯隱性關係。因此說咱們傳統的基因表達分析實際上是很粗糙的,最好要作到isoform層次的表達,畢竟基因離蛋白仍是有一段距離。如今之因此還沒作到isoform水平,大部分緣由是咱們對蛋白的研究還不夠。
一個新的課題,全球範圍內,人種是如何逐步分化到今天,哪些核心的遺傳因素決定了人種的表型差別;其次,不一樣的人種在某些疾病上爲何會出現顯著的頻率差別,爲何亞洲人的HSCR發病率會更高?遺傳因素在其中發揮了什麼做用?
遺傳效應:Additive genetic effects occur when two or more genes source a single contribution to the final phenotype, or when alleles of a single gene (in heterozygotes) combine so that their combined effects equal the sum of their individual effects.[1][2] Non-additive genetic effects involve dominance (of alleles at a single locus) or epistasis (of alleles at different loci). 就是risk allele的數量和患病率之間成正比。
人類基因組裏有多少個variant/SNP? 1000 genome裏的數據是84.4 million,這是保守數據,由於只包括了2504我的,至關於每一個population只測了100我的,雖然具備必定的表明,性,但實際確定更多,那就保守估計一下300 million吧,那就真是百分之一了,也就是100個鹼基裏就有一個variant。算到個體,就是3 million左右,也就是萬分之一。
先從直覺上理解一下GWAS的原理:
核心就是SNP與表型的關聯,對於每個genome位點,若是某個SNP老是與某疾病同時出現 SNP與phenotype這兩個維度協同變化,那咱們就能夠推測這個SNP極有可能與此phenotype(疾病)相關。
規範點講就是看某個SNP在case和control兩個population間是否有allel frequency的顯著差別。
而現實狀況是,咱們樣本數有限,並且有時候control和case樣本不平衡,樣本還分男女、人羣,並且咱們須要對3億個鹼基位點都作統計檢驗。
咱們應該設計哪些指標來評價一個snp與表型的關聯呢?
思考:若是一個位點有多個SNP,而只有其中的一個SNP與疾病相關怎麼辦?錯誤認知,一個基因組位點只能有一個SNP,能夠有不少種allele。
牢記:曼哈頓圖中的點表明的不是樣品,而是SNP。
思考:曼哈頓圖中,顯著的SNP並非鶴立雞羣的冒出來,而是彷佛被捧出來的,就像高樓大廈同樣,從底下逐步冒出來的。這一座大廈其實就是連鎖在一塊兒的SNP,具備很高的LD score。
思考:雖然曼哈頓圖裏每一個點是SNP,可是一般都會把最顯著的SNP指向某個基因,由於你們最關注的仍是SNP的致病根源,但這樣找出來的只有編碼區的SNP。
注意:最突出的SNP極有可能不是causal SNP,它只是near the causal SNP。問題就來了,怎麼找causal SNP呢?fine mapping
基本背景
什麼是SNP?進化過程當中隨機產生的單點突變,並能穩定的在羣體中遺傳。
什麼是allele frequency in population?每個genome位點都有兩個或多個allele,不一樣allel之間有明顯的頻率上的差別,簡單點理解就是A和a兩個性質的頻率,但這裏是鹼基位點,而不是性狀基因。
GWAS分析的前提
sample size足夠,學過統計的都知道sample size會影響power,沒有足夠的power是得不出正確結論的,GWAS一般須要大量的樣本,幾千是標配,幾百就太少,如今有的都達到了幾萬幾十萬級別;
一個大誤區就是GWAS會測全基因組WGS,其實不是的,那太貴了,大部分是作DNA chip DNA芯片(專業的叫SNP array),只包含了常見的10^6個SNP。稍微有錢的就會上WES,就會獲得全部編碼區的SNP;最有錢的就是WGS了,所有檢測,編碼非編碼,常見罕見,1000genome就是靠這個才NB的。
大體原理已經講了,其實還有統計原理,暫時略過,先看實操。
怎麼用PLINK來作GWAS?油管視頻:GWAS in Plink 裏面有paper、示例數據、代碼下載,能夠跑跑熟悉一下。
參考:
Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays
Discriminating somatic and germline mutations in tumor DNA samples without matching normals
The impact of rare and low-frequency genetic variants in common disease
發表了paper的,GWAS pipeline:A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis。
一下着重講解一下這個流程的操做細節:
主要是四方面的分析:
先看下PLINK的文本文件格式:
ped:行是個體,列是表型和SNP的基因型數據;
map:snp的特徵數據;
二進制有三個格式:
主要就是把ped拆成了fam和bed,map變成了bim。
一般要作covariate分析,因此還有個covariate文件。
QC:
Step | Command | Function |
---|---|---|
1: Missingness of SNPs and individuals | ‐‐geno | Excludes SNPs that are missing in a large proportion of the subjects. In this step, SNPs with low genotype calls are removed. |
‐‐mind | Excludes individuals who have high rates of genotype missingness. In this step, individual with low genotype calls are removed. | |
2: Sex discrepancy | ‐‐check‐sex | Checks for discrepancies between sex of the individuals recorded in the dataset and their sex based on X chromosome heterozygosity/homozygosity rates. |
3: Minor allele frequency (MAF) | ‐‐maf | Includes only SNPs above the set MAF threshold. |
4: Hardy–Weinberg equilibrium (HWE) | ‐‐hwe | Excludes markers which deviate from Hardy–Weinberg equilibrium. |
5: Heterozygosity | For an example script see https://github.com/MareesAT/GWA_tutorial/ | Excludes individuals with high or low heterozygosity rates |
6: Relatedness | ‐‐genome | Calculates identity by descent (IBD) of all sample pairs. |
‐‐min | Sets threshold and creates a list of individuals with relatedness above the chosen threshold. Meaning that subjects who are related at, for example, pi‐hat >0.2 (i.e., second degree relatives) can be detected. | |
7: Population stratification | ‐‐genome | Calculates identity by descent (IBD) of all sample pairs. |
‐‐cluster ‐‐mds‐plot k | Produces a k‐dimensional representation of any substructure in the data, based on IBS. |
一個常識就是GWAS是2007年纔出現得,因此2017年纔出了篇有名的綜述ten years of GWAS,fine mapping是GWAS後纔出現得。
實驗室很早就開始研究fine mapping了:2009 - Fine mapping of the 9q31 Hirschsprung’s disease locus
看一下introduction,什麼是fine mapping?
目的很簡單:GWAS找到的大多不是causal variants,fine mapping就是就fill這個gap。
GWAS獲得大致的SNP後,必須作兩方面的深刻分析:
第一步就是對SNP給一個機率上的causality,這就是fine-mapping;第二步就是根據功能註釋來肯定該SNP確實能致使某個基因。
The first is to assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping. The second step is to try to connect these variants to likely genes whose perturbation leads to altered disease risk by functional annotation.
基本原理:
Strategies for fine-mapping complex traits -
Although eQTLs are increasingly used to provide mechanistic interpretations for human disease associations, the cell type specificity of eQTLs presents a problem. Because the cell type from which a given physiological phenotype arises may not be known, and because eQTL data exist for a limited number of cell types, it is critical to quantify and understand the mechanisms generating cell type specific eQTLs. For example, if a GWAS identifies a set of SNPs associated with risk of type II diabetes, the researcher must choose a target cell type to develop a mechanistic model of the molecular phenotype that causes the gross physiological change. One can imagine that the relevant cell type might be adipose tissue, liver, pancreas, or another hormone-regulating tissue. Furthermore, if the GWAS SNP produces a molecular phenotype (i.e., is an eQTL) in lymphoblastoid cell lines (LCLs), it is not necessarily the case that the SNP will generate a similar molecular phenotype in the cell type of interest. Furthermore, there are many examples of cell types with particular relevance to common diseases, for example dopaminergic neurons and Parkinson's disease, that lack comprehensive eQTL data or catalogs of CREs. The utility of eQTLs for complex trait interpretation will therefore be improved by a more thorough annotation of their cell type specificity.
eQTL最大的問題仍是celltype的特異性不夠,關鍵仍是要celltype的定義足夠精準!
如今GWAS已經屬於比較古老的技術了,主要是碰到嚴重的瓶頸了,單純的snp與表現的關聯已經不夠,須要具體的生物學解釋,這些snp是如何具體致使疾病的發生的。
並且,大多數病找到的都不是個別顯著的snp,大多數都找到了不少的snp,並且snp都落在非編碼區了,這就致使對這些snp的解讀很是的困難。
經典解讀看這篇新英格蘭雜誌上的文章:FTO Obesity Variant Circuitry and Adipocyte Browning in Humans
GWAS的核心結果就兩個,曼哈頓圖和QQ-plot,看懂就夠了。
單純會跑GWAS pipeline已經沒什麼價值了,如今重在下游的分析,有幾個熱點:
The International HapMap Project (http://hapmap. ncbi.nlm.nih.gov/; Gibbs et al., 2003) described the patterns of com- mon SNPs within the human DNA sequence whereas the 1000 Genomes (1KG) project (http://www.1000genomes.org/; Altshuler et al., 2012) provided a map of both common and rare SNPs.
common和rare就是根據allele frequency來界定的,可是彷佛沒有明確界限。
HapMap用的是array,全部測得都是一些人爲挑的點,因此就是common snps;而1000 genomes是WGS,因此包含了全部的點,因此有common和rare一塊兒。
GWAS和核心就是LD,目前大部分的GWAS都是測得array,由於便宜。
GWAS會漏掉不少點,因此纔會有fine-mapping,根據haplotype來作一些imputation。
Linkage disequilibrium (LD)連鎖不平衡:不一樣基因座位的各等位基因在人羣中以必定的頻率出現。在某一羣體中,不一樣座位某兩個等位基因出如今同一條染色體上的頻率高於預期的隨機頻率的現象。(就是孟德爾的分離不是隨機的,在染色體上越靠近的allele越傾向於綁在一塊兒,屬於物質性的限制。)
例如兩個相鄰的基因A B, 他們各自的等位基由於a b. 假設A B相互獨立遺傳,則後代羣體中觀察獲得的單倍體基因型 AB 中出現的P(AB)的機率爲 P(A) * P(B). 實際觀察獲得羣體中單倍體基因型 AB 同時出現的機率爲P(AB)。 計算這種不平衡的方法爲: D = P(AB)- P(A) * P(B).
事實上,能夠檢測遍及基因組中的大量遺傳標記位點snp,或者候選基因附近的遺傳標記來尋找到由於與致病位點距離足夠近而表現出與疾病相關的位點,這就是等位基因關聯分析或連鎖不平衡定位基因的基本思想。
待看的paper:Strategies for fine-mapping complex traits
assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping.
還有一些很是重要的概念:
effect size:效應量
power:功效,power analyses
Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes
在語境裏理解:One explanation of the missing heritability is that complex diseases are caused by a large number of causal variants with small effect sizes.
PRS combines the effect sizes of multiple SNPs into a single aggregated score that can be used to predict disease risk
haplotype phasing單倍體分型
Positions with 00 and 11 are called homozygous positions. Positions with 10 or 01 are called heterozygous positions. We note that the reference genome is neither the paternal nor the maternal genome but the genome of an un-related human (or more precisely the mixture of genomes of a few individuals). An individual’s haplotype is the set of variations in that individual’s chromosomes. We note that as any two human haplotypes are 99.9% similar, the mapping problem can be solved quite easily.
Haplotype phasing is the problem of inferring information about an individual’s haplotype. To solve this problem, there are many methods.
Lecture 10: Haplotype Phasing - Community Recovery
參考:PLINK | File format reference
vcftools
plink的主要功能:數據處理,質量控制的基本統計,羣體分層分析,單位點的基本關聯分析,家系數據的傳遞不平衡檢驗,多點連鎖分析,單倍體關聯分析,拷貝數變異分析,Meta分析等等。
首先必須瞭解plink的三種格式:bed、fam和bim。(注意:這裏的bed和咱們genome裏的區域文件bed徹底不一樣)
plink須要的格式通常能夠從vcf文件轉化而來 (順便了解一下ped和map兩種格式):
PED: Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file. 譜系信息和基因型信息。每一行是一我的。
MAP: Variant information file accompanying a .ped text pedigree + genotype table. 變異信息。每一行是一個變異 | snp。
# PED 1 1 0 0 1 0 G G 2 2 C C 1 2 0 0 1 0 A A 0 0 A C 1 3 1 2 1 2 0 0 1 2 A C 2 1 0 0 1 0 A A 2 2 0 0 2 2 0 0 1 2 A A 2 2 0 0 2 3 1 2 1 2 A A 2 2 A A
# MAP 1 snp1 0 1 1 snp2 0 2 1 snp3 0 3
# vcf轉ped和map plink --vcf file.vcf --recode --out file # ped和map轉bed、bim和fam plink --file test --make-bed --out test
bed文件(真實的bed文件是二進制的,比較難讀)
bed:Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. Loaded with --bfile; generated in many situations, most notably when the --make-bed command is used. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. 基因型信息。因此轉換後就是一個matrix,每一行是一個個體,每一列就是一個變異。其中0、一、2分別對應了aa、Aa或aA和AA。不考慮鹼基型,由於咱們不關注ATGC的變化。
fam:Sample information file accompanying a .bed binary genotype table. 樣本信息。每一行就是一個樣本。
bim:Extended variant information file accompanying a .bed binary genotype table. 每一行是一個變異,及其註釋信息。
rs4970383 rs3748592 rs9442373 rs1571150 rs6687029 2431:NA19916 2 0 0 0 1 2424:NA19835 1 0 1 2 0 2469:NA20282 1 0 1 0 1 2368:NA19703 0 0 0 2 0 2425:NA19901 1 0 1 2 2
OR # xxd -b test.bed 00000000: 01101100 00011011 00000001 11011100 00001111 11100111 l..... 00000006: 00001111 01101011 00000001 .k.
fam文件
1 2431 NA19916 0 0 1 2 2424 NA19835 0 0 2 3 2469 NA20282 0 0 2 4 2368 NA19703 0 0 1 5 2425 NA19901 0 0 2
OR 1 1 0 0 1 0 1 2 0 0 1 0 1 3 1 2 1 2 2 1 0 0 1 0 2 2 0 0 1 2 2 3 1 2 1 2
bim文件
1 1 rs4970383 0 828418 A 2 1 rs3748592 0 870101 A 3 1 rs9442373 0 1052501 C 4 1 rs1571150 0 1464167 A 5 1 rs6687029 0 1508931 C
OR 1 snp1 0 1 G A 1 snp2 0 2 1 2 1 snp3 0 3 A C
關聯分析:就是AS的中文,全稱是GWAS。應用基因組中數以百萬計的單核苷酸多態;SNP爲分子遺傳標記,進行全基因組水平上的對照分析或相關性分析,經過比較發現影響複雜性狀的基因變異的一種新策略。在全基因組範圍內選擇遺傳變異進行基因分析,比較異常和對照組之間每一個遺傳變異及其頻率的差別,統計分析每一個變異與目標性狀之間的關聯性大小,選出最相關的遺傳變異進行驗證,並根據驗證結果最終確認其與目標性狀之間的相關性。
連鎖不平衡:LD,P(AB)= P(A)*P(B)。不連鎖就獨立,若是不存在連鎖不平衡——相互獨立,隨機組合,實際觀察到的羣體中單倍體基因型 A和B 同時出現的機率。P (AB) = D + P (A) * P (B) 。D是表示兩位點間LD程度值。
曼哈頓圖:在生物和統計學上,作頻率統計、突變分佈、GWAS關聯分析的時候,咱們常常會看到一些很是漂亮的manhattan plot,可以對候選位點的分佈和數值一目瞭然。位點座標和pvalue。map文件至少包含三列——染色體號,SNP名字,SNP物理位置。assoc文件包含SNP名字和pvalue。haploview便可畫出。
CMplot:一個R包,畫曼哈頓圖的。
q-q plot:分位數-分位數圖,assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential.
GCAT(Genome-wide Complex Trait Analysis):在分析的時候計算LD,PCA以及關聯分析。
BLUP:即最佳線性無偏預測(Best Linear Unbiased Prediction),該方法普遍用於GWAS中對多年多點表型數據分析當中,R語言中的lme4包能夠對此進行分析。
常識:世界範圍的人類羣體,在表型上可謂千差萬別,可是基因組上的差別卻很是小,並且這種差別大多數表現爲SNP (Single nucleotide polymorphism , 單核苷酸多態性)。
IBS:在兩個或兩個以上的個體當中,若是一個DNA片斷具備相同的核苷酸序列,就說這個DNA片斷是IBS。
IBD: 若是IBS片斷是遺傳自同一個祖先且中間過程沒有發生太重組事件,就說這個片斷是IBD。
數據表示模型:由1 和2 組成的2n個序列,每個SNP 基因型對應兩個序列。對於任意一個個體的SNP 基因型數據進行處理(忽略ACGT 的差異)如22,21,12,11 分別對應於SNP 基因型,aa aA Aa AA。而後把這些序列轉換爲 由0、一、2 組成的數量爲n的SNP 序列,表示爲:
第i個個體的SNP基因型爲:
這兩個個體間的第K個snp的IBS狀態爲:
個體i和個體j的SNP的IBS 狀態值非0的區域知足必定閾值就做爲候選IBD片斷,能夠表示爲:
把N個體的數據分紅case和control兩組進行分析,其中case包含個l個體,control包含m個個體,而後對這兩組數據分別進行評價分析,對每一個SNP 獲得各自的S值。差別值最大的snp位點就可能爲咱們的候選位點。
這些文件中的0,1,2是什麼意思?
plink --bfile --pheno --pheno-name t16 --linear hide-covar --covar --covar-name AGE,SEX,PC1,PC2,PC3,PC4 --ci 0.95 --out
--bfile 將snp文件變成二進制格式 --pheno 這裏導入咱們剛剛處理的性狀文件 --pheno-name t16 要處理的性狀名字是t16 --linear hide-covar 使用線性模型,hide-covar指的是不要對我沒加入的協變量進行分析 --covar --covar-name AGE,SEX,PC1,PC2,PC3,PC4 把咱們選取的協變量加入線性迴歸模型中,咱們選的協變量有:AGE,SEX,PC1,PC2,PC3,PC4 --ci 0.95 設置置信區間
使用vcftools過濾: 1. MAF<0.05 vcftools --vcf test.vcf --maf 0.05 --out XX 2.完整度大於90% vcftools --vcf test.vcf --max-missing 0.9 --OUT XX 3.平均深度大於5 vcftools --vcf test.vc --min-meanDP 5 --out xx 注: 使用--gvcf更爲快捷 使用plink過濾 1.vcf轉化plink格式 vcftools --vcf test.vcf --plink --out xxx 2.plink --noweb --file plink --geno 0.05 --maf 0.05 --hwe 0.0001 --make-bed
跟一個官網的教學,無需寫代碼,教學材料:Resources available for download 很是通俗,容易入門。
ped文件:譜系信息和基因型;
Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file.
The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.
前6行就和fam文件同樣,家庭id,家庭內id,性別,表型。
後面兩個一組,好比第7和第8就是map中第一個snp的等位基因(人有兩條染色體,每條DNA都是雙鏈的,不考慮雙鏈,由於有互補配對)。
fam文件:樣本信息;
- Family ID ('FID')
- Within-family ID ('IID'; cannot be '0')
- Within-family ID of father ('0' if father isn't in dataset)
- Within-family ID of mother ('0' if mother isn't in dataset)
- Sex code ('1' = male, '2' = female, '0' = unknown)
- Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
map文件:突變信息;
- Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
- Variant identifier
- Position in morgans or centimorgans (optional; also safe to use dummy value of '0')
- Base-pair coordinate
bim文件:額外的突變信息;
- Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
- Variant identifier
- Position in morgans or centimorgans (safe to use dummy value of '0')
- Base-pair coordinate (normally 1-based, but 0 ok; limited to 231-2)
- Allele 1 (corresponding to clear bits in .bed; usually minor)
- Allele 2 (corresponding to set bits in .bed; usually major)
MAF, Minor allele frequency: SNPs with a minor allele frequency of 0.05 or greater were targeted by the HapMap project. 最小等位基因頻率
The SNPs are currently coded according to NCBI build 36 coordinates on the forward strand.
Data quality control in genetic case-control association studies
plink能夠對snp進行QC過濾,根據一些指標,好比MAF。。。
plink的結果必需要有了解,
1. 將文本的ped和map文件轉化爲二進制的bed、bim和fam文件;
2. 關聯分析的結果,其實就是給每一個人賦值一個表型,而後就作關聯分析,獲得每個snp與表型的相關性,用p-value來表示,最終能夠畫曼哈頓圖;
CHR SNP BP A1 F_A F_U A2 CHISQ P OR 1 rs3094315 792429 G 0.1489 0.08537 A 1.684 0.1944 1.875 1 rs4040617 819185 G 0.1354 0.08537 A 1.111 0.2919 1.678 1 rs4075116 1043552 C 0.04167 0.07317 T 0.8278 0.3629 0.5507 1 rs9442385 1137258 T 0.3723 0.4268 G 0.5428 0.4613 0.7966
參考:
GWAS的基本原理 講得比較通俗
QQ plot圖——評價你的統計模型是否合理 講得比較清楚
基於全基因組snp數據如何進行主成分分析(PCA)- GCTA
2019年02月16日更新