Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
算法
2.1 Aligning a single query sequence 單序列比對app
2.1.1 Seeding and re-seeding
BWA-MEM follows the canonical seed-and-extend paradigm. It initially seeds an alignment with supermaximal exact matches (SMEMs) using an algorithm we found previously (Li, 2012,Algorithm 5), which essentially finds at each query position the longest exact match covering the position. However, occasionally the true alignment may not contain any SMEMs. To reduce mismappings caused by missing seeds, we introduce re-seeding. Suppose we have a SMEM of length l with k
occurrences in the reference genome. If l is too large (over 28bp by default), we re-seed with the longest exact matches that cover the middle base of the SMEM and occur at least k + 1 times in the genome. Such seeds can be found by requiring a minimum occurrence in the original SMEM algorithm.ide
BWA-MEM算法基於seed-and-extend。首先使用之前的算法,用 supermaximal exact matches (SMEMs)來seeds一個比對,找到每一個查詢位點的 longest exact match。 ui
(SMEMs: Formally, a maximal exact match (MEM)最大精確匹配 is a an exact match that cannot be extended in either direction of the match. An SMEM is a MEM that is not contained in other MEMs on the query sequence. https://academic.oup.com/bioinformatics/article/28/14/1838/218887)this
然而,有時候真實的比對不必定包含任何SMEMs,爲了減小丟失seeds引發的錯誤匹配,咱們引入了re-seeding。假設有個SMEM,長度爲l,在標準基因組中出現了k次。若是SMEM過長(超過28bp),咱們對覆蓋了SMEM中間鹼基且至少出現k+1次的longest exact matches進行re-seed。這樣的seeds能夠經過設定最小出現次數在原始的SMEM算法中找到。spa
2.1.2 Chaining and chain filteringorm
We call a group of seeds that are colinear and close to each other as a chain. We greedily chain the seeds while seeding and then filter out short chains that are largely contained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain). Chain filtering aims to reduce unsuccessful seed extension at a later step. Each chain may not always correspond to a final hit. Chains detected here do not need to be accurate.排序
咱們把一組線性或相近的seeds稱做chain。seeding過程當中儘量將chain變長,同時過濾掉太短的chains,這些chains一般大量包含在長chain中。過濾chain的目的是爲了減小下一步中不成功的seed extension。每一個chain並不老是有最終的hit,這時候的chains不須要很是精確(?)ip
2.13 Seed extensionci
We rank a seed by the length of the chain it belongs to and then by the seed length. For each seed in the ranked list, from the best to the worst, we drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic programming (DP) if it potentially leads to a new alignment.
咱們依據seed所從屬的chain的長度和seed的長度對seed進行排序。對於每個在排好序的list中的seed,捨棄掉已經比對過的seed,或若是該seed可能產生一個新的alignment,利用某算法(a banded affine-gap-penalty dynamic programming (DP))對其進行extend。??
BWA-MEM’s seed extension differs from the standard seed extension in two aspects.
BWA-MEM的seed extension同標準的seed extension不一樣。
Firstly, suppose at a certain extension step we come to reference position x with the best extension score achieved at query position y. BWAMEM will stop extension if the difference between the best extension score and the score at (x, y) is larger than Z+|x−y|×pgapExt, where pgapExt is the gap extension penalty and Z is an arbitrary cutoff. This heuristic avoids extension through a poorly aligned region with good flanking alignment. It is similar to the X-dropoff heuristic in BLAST (Altschul et al., 1997), but does not penalize long gaps in one of the reference or the query sequences.
首先,設想一次可能的extension步驟,在標準基因組的位置(x,y)時達到最大擴展分數,若是最大擴展分數和(x,y)位置的分數差距大於Z+|x−y|×pgapExt時,BWAMEM中止擴展。pgapExt是gap 擴展扣分,Z是一個任意截斷(?)。這種擴展式搜索使用旁側比對能夠迴避掉很差的比對區域。這種方法同BLAST的X-dropoff擴展式搜索相似,可是在標準序列和比對序列上出現長gaps時不進行扣分。
Secondly, while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence. If the difference between the best score reaching the end and the best local alignment score is below a threshold, the local alignment will be rejected even if it has a higher score. BWA-MEMuses this strategy to automatically choose between local and end-to-end alignments. It may align through true variations towards the end of a read and thus reduces reference bias, while avoids introducing excessive mismatches and gaps which may happen when we force a chimeric read into an end-to-end alignment.
其次,擴展seed時,BWA-MEM記錄到達序列末端的最大擴展得分,若是末端最大得分和局部比對得分之間的差距低於閾值時,將局部比對捨棄,即使局部得分更高。BWA-MEM使用這個策略來自動選擇使用局部比對仍是末端比對。算法略過真實的變異朝read末端比對,減小標準誤差,同時避免帶來更多錯誤和gaps,這些錯誤和gaps發生在咱們強制一個隨機read執行端對端比對時。
Fig. 1. Percent mapped reads as a function of the false alignment rate under different mapping quality cutoff. Alignments with mapping quality 3 or lower are excluded. An alignment is wrong if after correcting clipping, its start position is within 20bp from the simulated position. 106 pairs of
101bp reads are simulated from the human reference genome using wgsim (http://bit.ly/wgsim2) with 1.5% substitution errors and 0.2% indel variants. The insert size follows a normal distribution N(500, 502). The reads are aligned back to the genome either as single end (SE; top panel) or as paired end (PE; bottom panel). GEM is configured to allow up to 5 gaps and to output suboptimal alignments (option ‘–e5 –m5 –s1’ for SE and ‘–e5 –m5 –s1 –pb’ for PE). GEM does not compute mapping quality. Its mapping quality is estimated with a BWA-like algorithm with suboptimal alignments available. Other mappers are run with the default setting except for specifying the insert size distribution. The run time in seconds on a single CPU core is shown in the parentheses.
2.2 Paired-end mapping雙端映射
2.2.1 Rescuing missing hits 恢復丟失的hits
Like BWA (Li and Durbin, 2009), BWAMEM processes a batch of reads at a time. For each batch, it estimates the mean µ and the variance σ2 of the insert size distribution from reliable single-end hits. For the top 100 hits (by default) of either end, if the mate is unmapped in a window [µ - 4σ, µ + 4σ] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment (SW; Farrar 2007) for the mate within the window. The second best SW score is recorded to detect potential mismapping in a long tandem repeat. Hits found from both the single-sequence alignment and SW rescuing will be used for pairing.
同BWA相似,BWA-MEM一次處理一包reads。
2.2.2 Pairing
Given the i-th hit for the first read and j-th hit for the second, BWA-MEM computes their distance dij if the two hits are in the right orientation, or sets dij to infinity otherwise. It scores the pair (i, j) with Sij = Si + Sj - min{-a log4 P (dij), U}, where Si and Sj are the SW score of the two hits, respectively, a is the matching score and P (d) gives the probability of observing an insert size larger than d assuming a normal distribution. ‘log4’ arises when we interpret SW score as odds ratio (Durbin et al., 1998). U is a threshold that controls pairing: if dijis small enough such that -a log4 P (dij) < U, BWA-MEM prefers to pair the two ends; otherwise it prefers the unpaired alignments. BWA-MEM chooses the pair (i, j) that maximizes Sij as the final alignments for both ends. This pairing strategy jointly considers the alignment scores, the insert size and the possibility of chimeric read pairs.