VCF文件格式詳解

時間 2019-12-25

標籤 vcf 文件格式詳解简体版

原文原文鏈接

VCF文件全稱爲Variant Call Format，表示基因組的變異信息，一般爲GATK和Samtools軟件處理所獲得。
VCF文件大體能夠分爲兩個部分：express

一、以##開頭的頭文件信息

##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.5-0-g36282e4,Date="Tue Apr 03 19:35:05 CST 2018",Epoch=1522755305379,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio//son/son.recal.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=[/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio/ccds.interval.list] excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false out=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Data_AshkenazimTrio/son/son.raw.vcf likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[RMSMappingQuality, BaseCounts] excludeAnnotation=[] group=[Standard, StandardHCAnnotation] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=NONE bamOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=false gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING pair_hmm_sub_implementation=ENABLE_ALL always_load_vector_logless_PairHMM_lib=false phredScaledGlobalReadMismappingRate=45 noFpga=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=true keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine.SelectVariants=<ID=SelectVariants,Version=3.5-0-g36282e4,Date="Wed Jun 06 09:33:03 CST 2018",Epoch=1528248783862,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Hardfilter_optimize/son.raw.vcf) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/opt/NfsDir/UserDir/wujh/Project/PrecisionFDA/Hardfilter_optimize/SNP/HG002_SNP.vcf sample_name=[] sample_expressions=null sample_file=null exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=false excludeFiltered=false preserveAlleles=false removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[SNP] selectTypeToExclude=[] keepIDs=null excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false forceValidOutput=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
......
##contig=<ID=chrUn_gl000248,length=39786,assembly=hg19>
##contig=<ID=chrUn_gl000249,length=38502,assembly=hg19>
##reference=file:///opt/NfsDir/PublicDir/reference/ucsc.hg19.fasta
##source=SelectVariants
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  son

頭文件信息主要包括vcf文件版本、FORMAT、INFO、參考基因組以及執行程序等信息。
表頭各列含義詳解：app

1. CHROM(chromosome):染色體
2. POS  變異位點在參考基因組中的位置
3. ID - identifier: variant的ID。好比在dbSNP中有該SNP的id，則會在此行給出；若沒有，則用’.'表示其爲一個novel variant。
4. REF - reference base(s):參考鹼基，染色體上面的鹼基，必須是ATCGN中的一個，N表示不肯定鹼基
5. ALT - alternate base(s):與參考序列比較發生突變的鹼基
6. QUAL - quality: Phred格式(Phred_scaled)的質量值，表 示在該位點存在variant的可能性；該值越高，則
           variant的可能性越大；計算方法：Phred值 = -10 * log (1-p) p爲variant存在的機率; 經過計算公式
           能夠看出值爲10的表示錯誤機率爲0.1，該位點爲variant的機率爲90%。
7. FILTER - _filter status: 使用上一個QUAL值來進行過濾的話，是不夠的。GATK能使用其它的方法來進行過濾，過濾結果中經過則該值爲」PASS」;若variant不可靠，則該項不爲」PASS」或」.」。
8. INFO - additional information:  這一行是variant的詳細信息，具體以下：
  #DP-read depth：樣本在這個位置的reads覆蓋度。是一些reads被過濾掉後的覆蓋度。DP4:高質量測序鹼基，位於REF或者ALT先後
  #QD：經過深度來評估一個變異的可信度。Variant call confidence normalized by depth of sample reads supporting a variant         
  #MQ：表示覆蓋序列質量的均方值RMS Mapping Quality
  #FQ：phred值關於全部樣本類似的可能性
  #AC，AF 和 AN：AC(Allele Count) 表示該Allele的數目；AF(Allele Frequency) 表示Allele的頻率； AN(Allele Number) 表示Allele的總數目。
      對於1個diploid sample而言：則基因型 0/1 表示sample爲雜合子，Allele數爲1(雙倍體的sample在該位點只有1個等位基因發生了突變)，
       Allele的頻率爲0.5(雙倍體的sample在該位點只有50%的等位基因發生了突變)，總的Allele爲2； 基因型 1/1 則表示sample爲純合的，Allele數爲2，Allele的頻率爲1，總的Allele爲2。
  #MLEAC：Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed
  #MLEAF：Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed
  #BaseQRankSum   比較支持變異的鹼基和支持參考基因組的鹼基的質量，負值表示支持變異的鹼基質量值不及支持參考基因組的，
       正值則相反，支持變異的質量值好於參考基因組的。0表示二者無明顯差別。
  #FS  使用F檢驗來檢驗測序是否存在鏈偏好性。鏈偏好性可能會致使變異等位基因檢測出現錯誤。輸出值Phred-scaled p-value，值越大越可能出現鏈偏好性。
  #InbreedingCoeff    使用似然法檢驗樣本間的近交係數（又或者稱爲近親關係）。值越高越多是近親繁殖。
  #MQRankSum  比較支持變異的序列和支持參考基因組的序列的質量，負值表示支持變異的鹼基質量值不及支持參考基因組的，只針對雜合。
       正值則相反，支持變異的質量值好於參考基因組的。0表示二者無明顯差別。實際應用中通常過濾掉較小的負值。
  #BaseCounts   全部樣本在變異位點ATCG的數量
  #ClippingRankSum  同前面兩個相似，負值表示支持變異的read有更的的hard-clip鹼基，正值表示支持參考基因組的的read有更多的hard-clip。0最好，不管是正值仍是負值都表示可能可能存在人爲誤差。
  #ReadPosRankSum    檢測變異位點是否有位置偏好性（是否存在於序列末端，此時每每容易出錯）。最佳值爲0，表示變異與其在序列上的位置無關。負值表示變異位點更容易在末端出現，正值表示參考基因組中的等位基因更容易在末端出現。
  #ExcessHet   檢測這些樣本的相關性，與InbreedingCoeff類似，值越大越多是錯誤。
  #LikelihoodRankSum  評價支持變異和ref的序列與best hyplotype的匹配性，0爲最佳值。負值表示支持變異的read匹配度不及支持ref的匹配度，正值則相反。值越大表示越多是出現了錯誤。
  #HaplotypeScore    分數越高越可能出現錯誤。Higher scores are indicative of regions with bad alignments, typically leading to artifactual SNP and indel calls.
  #SOR：也是一個用來評估是否存在鏈偏向性的參數，至關於FS的升級版。The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It is an updated form of the Fisher Strand Test that is better at taking into account large amounts of data in high coverage situations. It is used to determine if there is strand bias between forward and reverse strands for the reference or alternate allele. The reported value is ln-scaled.
  #IS：插入缺失或部分插入缺失的reads容許的最大數量
  #G3：ML 評估基因型出現的頻率
  #HWE：chi^2基於HWE的測試p值和G3
  #CLR：在受到或者不受限制的狀況下基因型出現可能性log值
  #UGT：最可能不受限制的三種基因型結構
  #CGT：最可能受限制三種基因型的結構
  #PV4：四種P值得偏差，分別是（strand、baseQ、mapQ、tail distance bias）
  #INDEL：表示該位置的變異是插入缺失
  #PC2：非參考等位基因的phred（變異的可能性）值在兩個分組中大小不一樣
  #PCHI2：後加權chi^2，根據p值來測試兩組樣本之間的聯繫
  #QCHI2：Phred scaled PCHI2
  #PR：置換產生的一個較小的PCHI2
  #QBD：Quality by Depth，測序深度對質量的影響
  #RPB：序列的偏差位置（Read Position Bias）
  #MDV：樣本中高質量非參考序列的最大數目
  #VDB：Variant Distance Bias，RNA序列中過濾人工拼接序列的變異偏差範圍
  

9. FORMAT 和最後一列sample中的信息是對應的
  #AD 和 DP：AD(Allele Depth)爲sample中每一種allele的reads覆蓋度,在diploid中則是用逗號分割的兩個值，
      前者對應ref基因型，後者對應variant基因型； DP（Depth）爲sample中該位點的覆蓋度。
  #GT：樣品的基因型（genotype）。兩個數字中間用’/'分 開，這兩個數字表示雙倍體的sample的基因型。0 表示樣品中有ref的allele； 
       1表示樣品中variant的allele； 2表示有第二個variant的allele。所以： 0/0 表示sample中該位點爲純合的，和ref一致； 0/1 表示sample中該位點爲雜合的，有ref和variant兩個基因型； 1/1 表示sample中該位點爲純合的，和variant一致。
  #GQ：即第二可能的基因型的PL值，相對於最可能基因型的PL值（其PL=0）而言，大於99時，其信息量已不大，所以大於99的所有賦值99。當GQ值很小時，意味着第二可能基因型與最可能基因型差異不大。
  #GL：三種基因型（RR RA AA）出現的可能性，R表示參考鹼基，A表示變異鹼基
  #DV：高質量的非參考鹼基
  #SP：phred的p值偏差線
  #PL：指定的三種基因型的可能性(provieds the likelihoods of the given genotypes)。這三種指定的基因型爲(0/0,0/1,1/1)，這三種基因型的機率總和爲1。
       和以前不一致，該值越大，代表爲該種基因型的可能性越小。 Phred值 = -10 * log (p) p爲基因型存在的機率。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。