GWAS Catalog數據庫簡介

GWAS Catalogphp

The NHGRI-EBI Catalog of published genome-wide association studieshtml

EBI負責維護的一個收集已發表的GWAS研究的數據庫ios

Catalog statsgit

  • Last data release on 2019-09-24
  • 4220 publications
  • 107486 SNPs
  • 157336 associations
  • Genome assembly GRCh38.p12
  • dbSNP Build 151
  • Ensembl Build 96

基本的搜索方法github

搜索表型:如breast carcinoma,會獲得相關的很是規範的表型信息,EFO,就像GO同樣,是一套表型分類規則。而後還會獲得表型相關的基因。web

搜索SNP:如rs7329174,會獲得變異的詳細信息,和對應的基因。shell

搜索人名:Yao,會獲得相關的文獻數據庫

搜索染色體位置:如2q37.1,Cytogenetic regionapi

搜索基因:如HBS1Lbash

搜索區域:如6:16000000-25000000

 

說是數據庫,其實就是一個table,從這裏下載,不過100MB

表裏面有這些數據:

DATE ADDED TO CATALOG* +: Date a study is published in the catalog

PUBMEDID* +: PubMed identification number

FIRST AUTHOR* +: Last name and initials of first author

DATE* +: Publication date (online (epub) date if available)

JOURNAL* +: Abbreviated journal name

LINK* +: PubMed URL

STUDY* +: Title of paper

DISEASE/TRAIT* +: Disease or trait examined in study

INITIAL SAMPLE DESCRIPTION* +: Sample size and ancestry description for stage 1 of GWAS (summing across multiple Stage 1 populations, if applicable)

REPLICATION SAMPLE DESCRIPTION* +: Sample size and ancestry description for subsequent replication(s) (summing across multiple populations, if applicable)

REGION*: Cytogenetic region associated with rs number

CHR_ID*: Chromosome number associated with rs number

CHR_POS*: Chromosomal position associated with rs number

REPORTED GENE(S)*: Gene(s) reported by author

MAPPED GENE(S)*: Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen.

UPSTREAM_GENE_ID*: Entrez Gene ID for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_ID*: Entrez Gene ID for nearest downstream gene to rs number, if not within gene

SNP_GENE_IDS*: Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts

UPSTREAM_GENE_DISTANCE*: distance in kb for nearest upstream gene to rs number, if not within gene

DOWNSTREAM_GENE_DISTANCE*: distance in kb for nearest downstream gene to rs number, if not within gene

STRONGEST SNP-RISK ALLELE*: SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype.

SNPS*: Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype)

MERGED*: denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)

SNP_ID_CURRENT*: current rs number (will differ from strongest SNP when merged = 1)

CONTEXT*: SNP functional class

INTERGENIC*: denotes whether SNP is in intergenic region (0 = no; 1 = yes)

RISK ALLELE FREQUENCY*: Reported risk/effect allele frequency associated with strongest SNP in controls (if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted.

P-VALUE*: Reported p-value for strongest SNP risk allele (linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit (for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7).

PVALUE_MLOG*: -log(p-value)

P-VALUE (TEXT)*: Information describing context of p-value (e.g. females, smokers).

OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1. Appropriate unit and increase/decrease are included for beta coefficients.

95% CI (TEXT)*: Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available.

PLATFORM (SNPS PASSING QC)*: Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable

CNV*: Study of copy number variation (yes/no)

ASSOCIATION COUNT+: Number of associations identified for this study

 

一些問題:

什麼是Genotyping technology?

什麼是Experimental Factor Ontology trait?

什麼是Cytogenetic region?karyotype

什麼是trait + risk allele?這裏要分清SNP和allele的概念,SNP是位點,而allele則是該位點上鹼基。考慮一下DNA雙鏈,以及多倍體。

什麼是risk/effect allele frequency?

odds ratio在GWAS裏是個什麼指標?wiki

The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele.

As an example, suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by 'A' and the number of individuals in the control group having allele T is represented by 'B'. Similarly, the number of individuals in the case group having allele C is represented by 'X' and the number of individuals in the control group having allele C is represented by 'Y'. In this case the odds ratio for allele T is A:B (meaning 'A to B', in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y).

When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is typically calculated using a simple chi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[18]

什麼是MAF?the frequency of the minor allele

GWAS數據能夠有哪些註釋?phenotype annotation、population and linkage disequilibrium (LD) information

什麼是CP loci?an effective region associated with at least two phenotypes

什麼是genotype-calling?

GWAS的最基本的QC有哪些?

Quality Control Procedures for Genome Wide Association Studies

Data quality control in genetic case-control association studies

  • minor allele frequency (MAF) > 0.01; statistical power is extremely low for rare SNPs,很好理解,若是一個很是罕見的SNP,須要很是大的樣本量纔能有足夠的power
  • Hardy-Weinberg equilibrium (HWE) test p-value > 5E-05;
  • missing genotypes rate < 10%; Genotypes are classified as missing if the genotype-calling algorithm cannot infer the genotype with sufficient confidence. Can be calculated across each individual and/or SNP.

什麼是Experimental Factor Ontology?

什麼是LD information (r2 and D’ values)?

Mathematical properties of the r2 measure of linkage disequilibrium

the square of the correlation coefficient between two indicator variables – one representing the presence or absence of a particular allele at the first locus and the other representing the presence or absence of a particular allele at the second locus. the frequency dependence of r2,也就是r2是MAF的函數。

Introduction to different measures of linkage disequilibrium (LD) and their calculation 兩種常見的計算方法

 

 

NLM Catalog
The NLM Catalog provides access to NLM bibliographic data for journals, books, audiovisuals, computer software, electronic resources and other materials. Links to the library's holdings in LocatorPlus, NLM's online public access catalog, are also provided.

NCBI和本數據庫裏的期刊名字都是縮寫,如何轉化爲全名呢?

在NCBI數據庫裏下載對應的信息,NLM

用sublime處理一下格式便可獲得對應的關係

 

 

怎麼計算這些變異在特定羣裏裏面的LD score?

有現成的數據庫能夠用,LDlink

LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups

還有R包能夠直接調用,LDlinkR: Access LDlink API with R

問題:

  • 沒法一次性提取出所有數據,SNP數十萬,讀取很困難,該函數最好一次只讀100個數據
  • 版本問題,GWAS數據庫裏的SNP不必定都有LD score,寫代碼時要注意

 

 

如何學會提出問題,並用統計和simulation來檢驗問題?

一個最重要的問題:咱們觀測到的結果是否是隨機的?

這裏就須要將咱們的observe做simulation和shuffling。

這部分很是重要,也很是有意思。

 

 

如何過濾千人基因組裏的SNP

quality control (QC)

  • Variants with minor allele frequency (MAF) > 0.01;
  • Variants with Hardy-Weinberg equilibrium (HWE) test p-value > 5E-05;
  • Variants with missing genotypes rate < 10%;

1000 genome數據庫裏使用的是VCF4.1的格式

快速批量下載ftp目錄裏的文件:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/*.gz

vcf轉plink格式:

#for i in `seq 2 22`
#do
#plink --vcf ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --set-missing-var-ids @:#  --make-bed --out chr$i.phase3.v5a.plink
#done

#plink --vcf ALL.chrX.phase3_shapeit2_mvncall_integrated_v1a.20130502.genotypes.vcf.gz --missing-var-code NA  --make-bed --out chrX.phase3.v5a.plink
#plink --vcf ALL.chr17.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --missing-var-code NA --make-bed --out chr17.phase3.v5a.plink
#plink --vcf ALL.chr8.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --missing-var-code NA --make-bed --out chr8.phase3.v5a.plink

#plink --bfile ./chr1.phase3.v5a.plink  --merge-list merge.txt --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp --make-bed --out chr1_22_XY.v5a.plink

for i in `seq 1 22`
do
plink --bfile  chr$i.phase3.v5a.plink --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp  --make-bed --out chr$i.phase3.v5a.rmdup.plink
done

plink --bfile  chrX.phase3.v1a.plink --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp  --make-bed --out chrX.phase3.v1a.rmdup.plink
plink --bfile  chrY.phase3.v1a.plink --exclude chr1_22_XY.v5a.plink-multiple-allele.missnp  --make-bed --out chrY.phase3.v1a.rmdup.plink
plink --bfile ./chr1.phase3.v5a.rmdup.plink  --merge-list merge.txt --make-bed --out chr1_22_XY.v5a.plink  

plink文檔 - Whole genome association analysis toolset

PLINK 2.00 alpha

  

##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=CS,Number=1,Type=String,Description="Source call set.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate of this variant">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=MC,Number=.,Type=String,Description="Merged calls.">
##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this call, will be NULL NULL for START and E
##INFO=<ID=MEND,Number=1,Type=Integer,Description="Mitochondrial end coordinate of inserted sequence">
##INFO=<ID=MLEN,Number=1,Type=Integer,Description="Estimated length of mitochondrial insert">
##INFO=<ID=MSTART,Number=1,Type=Integer,Description="Mitochondrial start coordinate of inserted sequence">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="SV length. It is only calculated for structural variation MEIs. For other types of SVs, one may calculate the SV length by INFO:END-START+1
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=TSD,Number=1,Type=String,Description="Precise Target Site Duplication for bases, if unknown, value will be NULL">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth; only low coverage data were counted towards the DP, exome data were not used">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alternate Allele, IndelType:Type of Indel (REF,
##INFO=<ID=VT,Number=.,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=MULTI_ALLELIC,Number=0,Type=Flag,Description="indicates whether a site is multi-allelic">
##INFO=<ID=OLD_VARIANT,Number=1,Type=String,Description="old variant location. Format chrom:position:REF_allele/ALT_allele">

其中AF(Estimated allele frequency in the range (0,1))就是成天的MAF

如何隨機讀取VCF:Introduction to vcfR

 

其實上面那三個指標沒有那麼簡單,須要本身計算:

Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons," drive an enormous amount of selection. Single nucleotide polymorphisms (SNPs) with a minor allele frequency of 0.05 (5%) or greater were targeted by the HapMap project.

How can I get the allele frequency of my variant? 

you can calculate your frequency by dividing AC (allele count) by AN (allele number).

 

Allele frequency calculator

perl -MCPAN -e shell

install IPC::System::Simple

這是個perl腳本,太老了,跑出來的結果不太好,因此不用,折騰了我很久,仍是用上面官方的方便。

perl calculate_allele_frq_from_vcf.pl  -vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -out_dir .  -sample_panel integrated_call_samples_v3.20130502.ALL.panel   
-pop EAS,CHB -region 1:10177-565255 -tabix /Users/surgery/LZXworkdir/bin/miniconda3/bin/tabix -vcftools_dir /Users/surgery/LZXworkdir/bin/vcftools_0.1.13

 

官網推薦的方法:這裏是它的網頁版本

grep EAS integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EAS.samples.list
grep EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > EUR.samples.list

vcf-subset -c EAS.samples.list ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | fill-an-ac |
    bgzip -c > EAS.chr1.vcf.gz

  

須要把基本的統計遺傳學知識串一下了:

參考基因組

比對

SNP

allele

Population

Allele frequency

Genotype frequency

Allele frequency和MFA聯繫和區別

 

接下來:

如何獲取SNP的HWE p-value?

A genome-wide study of Hardy–Weinberg equilibrium with next generation sequence data

Evolution and the tree of life - 不錯的遺傳學公開課

Allele Frequencies and Hardy‐Weinberg Equilibrium - 關於HWE講得比較透徹

Quality control analysis of the 1000 Genomes Project Omni2.5 genotypes - 實操借鑑

It is therefore of interest to test whether a population is in HWE at a locus.  We will discuss the two most popular ways of testing HWE

Hardy‐Weinberg Assumptions

  • infinite population
  • discrete generations
  • random mating
  • no selection
  • no migration in or out of population
  • no mutation
  • equal initial genotype frequencies in the two sexes

用HWE來過濾,是不想選到過於離譜的SNP,也就是咱們只想選出大體符合HWE假設的SNP 

 

如何獲取missing genotype rate?

non-missing genotypes (call rate), Call rates were calculated using PLINK1.90.

 

install.packages("LDlinkR")

必須是R3.5.2版本及以上才能安裝

最好用並行來作,否則這個真的是太慢了。

 

SNP的過濾一行命令搞定:

plink --file EAS.pop.founder.phase3 --maf 0.05 --geno 0.1 --hwe 0.001 
plink --file EUR.pop.founder.phase3 --maf 0.05 --geno 0.1 --hwe 0.001 

  

 

待續~

相關文章
相關標籤/搜索