Boyer-Moore字符串搜索算法是一種很是高效的字符串搜索算法。它由Bob Boyer和J Strother Moore設計於1977年,最初的定義1975年就給出了,後續纔給出構造算法以及算法證實。 算法
先假定部分定義: 數組
一、pattern 爲模式字符串,長度爲patLen; less
二、Text爲目標查找字符串,長度爲n; 函數
二、當前不匹配字符在pattern中位置爲 j(0≤ j ≤patLen -1); ui
三、已經匹配的長度爲 m(0≤ m <patLen); spa
四、先假設不匹配字符在pattern中位置爲 Δ(*),其中*能夠是任何字符; 設計
不少資料裏面講解原理時說的數組位置都是從1開始的,這裏爲了好理解code,都是從0開始; 3d
首先來看下壞字符規則: 指針
1、壞字符規則(bad character rule ):讓不匹配字符和pattern中最右邊出現的該字符對齊匹配,若是沒有則所有跳過; code
>假設1:遇到不匹配字符,若是該字符在pattern 中不存在,有:(以下圖示跳轉)
字符指針右移:patLen 長度 後和 pattern 右對齊;
Pattern 右移:patLen – m;
>假設2:遇到不匹配字符,若是該字符在pattern 中存在,這裏也分兩種狀況:
a>.在pattern最右邊出現的該字符在當前不匹配字符左邊,有:(以下圖示跳轉)
字符指針右移:j–Δ(‘-’) +m = (j + m)–Δ(‘-’) = (patlen – 1) -Δ(‘-’) = (7-1)-2 = 4
Pattern 右移:字符指針偏移 - m = 4 – m = 2;
b>.在pattern中最右邊出現的該字符在當前不匹配字符右邊,有:(以下圖示跳轉)
字符指針右移: (patlen-1) – Δ(‘T’) = (7-1) – 6 = 0
Pattern右移:字符指針偏移 – m = 0 – 2 = -2
能夠看出,pattern 居然回退比較了,這是不該該出現的,這時候直接日後移動1位就好了:
總結上面三種狀況,咱們定義壞字符函數delta1() 爲字符指針的偏移:
Delta1($) = patLen;(不匹配字符在pattern中不存在)
= patLen–1-Δ(*);(不匹配字符存在pattern中,且在pattern中最右邊出現的位置在當前不匹配字符左邊)
= 1;( 不匹配字符存在pattern中,且在pattern中最右邊出現的該字符在當前不匹配字符右邊)
二、好後綴規則(good suffix rule):根據已經匹配的部分字串(subpat),在pattern中尋找是否有和 subpat 所有或者部分匹配的字串,直接對齊匹配,避免無效的移動;
先約定幾點:
一、 假設 $ 爲pattern中沒有出現過的字符,有pat[i] = $ 當i < 0;
二、 兩個序列[C1 … Cn] 和[d1… dn] 是一致的, 當且僅且cj = dj 或者 cj = $ 或者 dj = $;其中(0≤j<n)
三、 最右邊可能從新出現的subpat (p[j+1 ~ patLen-1])的位置爲rpr(j)(rightmost plausible reoccurrence), 是使[pat[j + 1] ... pat[patlen]] 和 [pat[k] ... pat[k + patlen - j – 1] ]一致的最大K值,其中k≤0 或者pat[k – 1] != pat[j].
上圖寫出了pattern 「ABXYCDEXY」 的rpr()值計算結果:咱們來解析下
a>.當j = 8 時,已經匹配字串p[j+1 … patLen-1] 爲空,參照rpr()定義,可知,pattern最右邊可能和空串一致的,就是p[8 ~ PatLen-1], 可知rpr(8) = 8.
b>.當j = 7時,已經匹配字串subpat爲」Y」, 能夠看到p[3 ~ 3] = subpat , 此時k=3>0, 可是pat(k-1) == pat[j] = 「X」不知足條件,再往右找,能夠知道該 subpat 只可能存在 pattern 頭部-1位置,即rpr(7) = -1.
c>.當j = 6 時,已經匹配字串subpat爲」XY」, 能夠看到p[2 ~ 3] = subpat, 同時知足p[k-1] != pat[j] ,可知rpr(6) = 2.
d>.當j = 5 時,已經匹配字串subpat爲」EXY」, pattern中沒有對應字串和subpat一致,只可能存在pattern頭部,可知rpr(5) = -3;
其餘狀況依次類推,上面的幾種狀況應該包含了全部的rpr() 求法,從上面分析能夠得出個規律:
rpr[patLen-1] = patLen-1.
能夠得出 good suffix rule 的偏移值, 讓pat[k] 和 pat[j+1] 對齊匹配:
Pattern 右移:j + 1 - rpr(j)
字符指針右移: m + j + 1 - rpr(j) = (patLen - 1 - j) + j + 1 – rpr(j) = patLen – rpr(j)
下面咱們定義好後綴規則偏移算法:
delta2(j) = patLen - rpr(j); (0≤j<patLen)
*讀者若是有看過別的BM算法資料,有地方 delta2(j) = patLen – 1 – rpr(j), 仍是開頭的這句話,咱們這裏數組索引從0開始,因此rpr(j) 的值也比索引從1開始的小1;
下面給出完整的實現代碼:
#include <string.h> // strlen() #include <stdlib.h> // __max() #define ALPHABET_SIZE (1 << (sizeof(char)*8)) // Enable any/all to trace intermediate results //#define TRACE_DELTA1 //#define TRACE_DELTA2 //#define TRACE_BM #if defined TRACE_DELTA1 || defined TRACE_DELTA2 || defined TRACE_BM #include <stdio.h> #include <ctype.h> #endif void calc_delta1(const char *pat, int patlen, int delta1[]) { int j = 0; for (j = 0; j < ALPHABET_SIZE; j++) delta1[j] = patlen; for (j = 0; j < patlen; j++) { // By scanning pat from left to right, the final // value in delta1[char] is the *rightmost* occurrence of // char in pat delta1[pat[j]] = patlen - 1 - j; } #ifdef TRACE_DELTA1 printf("Starting dump delta1[]>>>>>>>>>>>>>>>>>>>>>>>>>\n"); for (j = 0; j < ALPHABET_SIZE; j++) { if (delta1[j] != patlen) { printf(" %c:%d\n", (char)j, delta1[j]); } } printf(" others:%d\n", patlen); #endif } void calc_delta2(const char *pat, int patlen, int * delta2) { int i = 0, j = 0, s = 0, m = 0, n = 0; // rpr[j] : where we can find rightmost plausible recurrence of pat[j+1 .. patlen-1] int *rpr = new int[patlen]; // Mark each uninitialized rpr value with a large negative index const int def = -2*patlen; for (i = 0; i != patlen; i++) { rpr[i] = def; } // r: number of uninitialized entries in rpr[] int r = patlen; // Scan pattern from right-to-left until all rpr[] are initialized. // s: scan position. // Examine all substrings that end at pat[s] including null string pat[s .. s] for (s = patlen - 1; r > 0; s--) { // m: length of substring pat[s-m .. s] for (m = 0; m <= patlen - 1 && r > 0; m++) { // Introduce j and k (as used in the BM paper) // j: index of leftmost character of suffix int j = patlen - m - 1; // k: index of leftmost character of (possible) recurrence. int k = s - m; #ifdef TRACE_DELTA2 const int indent = patlen; printf("\ns:%d m:%d j:%d k:%d\n", s, m, j, k); printf("p :%*s%s\n", indent, "", pat); printf("j :%*s%*.*s\n", indent+j, "", m+1, m+1, &pat[j] ); printf("k-1:%*s", indent+k-1, ""); for (n = 0; n <= m; n++) { printf("%c", (k-1+n < 0 ? pat[j+n] : pat[k-1+n]) ); } printf("\n"); #endif // We have a match of pat[j+1 .. j+1+m] with pat[k .. k+m] // Compare pat[j] to pat[k-1]. // Match: extend the substring to the left by increasing m // Mismatch: terminate the substring and check if plausible RPR bool mismatch = false; if (k > 0) { if (pat[j] == pat[k-1]) // extend substring continue; mismatch = true; } // else preceding char, pat[k-1] lies to the left of pat[0] // which terminates the substring // We have a match of m (possibly zero) characters. // pat[j+1 .. j+1+m] matches pat[k .. k+m] and // either pat[j] != pat[k-1] or k <= 0. // So rpr[j] = k (unless rpr[j] is already > k) if (rpr[j] < k) { #ifdef TRACE_DELTA2 printf("2 :%*s %c %*.*s %*s s:%d m:%d j:%d k:%d r:%d\n", indent+j, "", toupper(pat[j]), m, m, &pat[j+1], (patlen-j-1-m), "", s, m, j, k, r); #endif rpr[j] = k; r--; } #ifdef TRACE_DELTA2 else { printf("rpr[%d]=%d already inited\n", j, rpr[j]); } #endif // Once we have a mismatch (pat[j] != pat[k-1]) it is fruitless //to examine further substrings ending at pat[s]; //as Any subpat end with pat[s] will not be the rightmost plausible //recurrence of the terminal substring pat[j+1 ~ patlen-1] if (mismatch) { break; } } } for (j = 0; j != patlen; j++) { delta2[j] = patlen - rpr[j]; } #ifdef TRACE_DELTA2 printf("R:"); // trace rpr[] values for (j = 0; j != patlen; j++) { printf(" %3d", rpr[j] ); } printf("\n"); printf("D:"); // trace delta2[] values for (j = 0; j != patlen; j++) { printf(" %3d", delta2[j] ); } printf("\n"); #endif delete [] rpr; } /* * Boyer-Moore search algorithm */ const char *boyermoore_search(const char * string, const char *pat) { int i = 0, j = 0, stringlen = 0; const char *result = NULL; int patlen = strlen(pat); int *delta1 = NULL; int *delta2 = NULL; if (patlen == 0) goto out; stringlen = strlen(string); if (patlen > stringlen) goto out; delta1 = new int[ALPHABET_SIZE]; delta2 = new int[patlen]; #ifdef TRACE_BM printf("pattern: %s\n", pat); #endif calc_delta1(pat, patlen, delta1); calc_delta2(pat, patlen, delta2); #ifdef TRACE_BM printf("\nCalculating boyermoore_search>>>>>>>>>>>>>>>>>>>>>>>>>\n"); #endif // i: index of current string character for (i = patlen-1;;) { if (i > stringlen) { result = NULL; goto out; } // j: index of current pattern character j = patlen-1; for (;;) { if (j == 0) { result = &string[i]; goto out; } if (string[i] == pat[j]) { #ifdef TRACE_BM printf("p:%*s%*.*s%c%*.*s\n", \ (i-j), "", \ j, j, pat, \ toupper(pat[j]), // mark matched char with upcase patlen-j-1, patlen-j-1, &pat[j+1]); #endif j--; i--; continue; } break; } #ifdef TRACE_BM printf("p:%*s%*.*s%c%*.*s\n", (i-j), "", j, j, pat, L'?', // mark mismatch char patlen-j-1, patlen-j-1, &pat[j+1]); // which-finally-halts.--at-that-point ... printf("c:%s\n", string); #endif // bc: "bad character" shift amount int bc = delta1[string[i]]; // gs: "good suffix" shift amount int gs = delta2[j]; #ifdef TRACE_BM printf("j:%d bc:%d gs:%d\n\n", j, bc, gs); #endif i += __max(bc, gs); } /* not found */ out: delete [] delta1; delete [] delta2; return result; } void main(void) { char src_str[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT"; char pat_str[80] = "AT-THAT"; const char* find_str = NULL; find_str = boyermoore_search((const char *)src_str, (const char *)pat_str); if(NULL != find_str) { printf("\n Success find string : %s\n", find_str); } else { printf("no find pattern string !\n"); } }
Boyer Moore 算法時間複雜度是亞線性O(patLen+n), pattern 越長BM算法效率越高;
一、A Fast String Searching Algorithm
二、http://en.wikipedia.org/wiki/User:RMcPhillip/sandbox/boyer-moore