字符串匹配算法之"Boyer Moore"

時間 2019-11-08

標籤字符串匹配算法 boyer moore 简体版

原文原文鏈接

Boyer-Moore字符串搜索算法是一種很是高效的字符串搜索算法。它由Bob Boyer和J Strother Moore設計於1977年，最初的定義1975年就給出了，後續纔給出構造算法以及算法證實。算法

先假定部分定義：數組

一、pattern 爲模式字符串，長度爲patLen; less

二、Text爲目標查找字符串，長度爲n; 函數

二、當前不匹配字符在pattern中位置爲 j（0≤ j ≤patLen -1）; ui

三、已經匹配的長度爲 m（0≤ m ＜patLen）; spa

四、先假設不匹配字符在pattern中位置爲 Δ(*),其中*能夠是任何字符; 設計

不少資料裏面講解原理時說的數組位置都是從1開始的，這裏爲了好理解code，都是從0開始; 3d

首先來看下壞字符規則：指針

1、壞字符規則（bad character rule ）：讓不匹配字符和pattern中最右邊出現的該字符對齊匹配，若是沒有則所有跳過； code

>假設1：遇到不匹配字符，若是該字符在pattern 中不存在，有:（以下圖示跳轉）

字符指針右移：patLen 長度後和 pattern 右對齊;

Pattern 右移：patLen – m;

>假設2：遇到不匹配字符，若是該字符在pattern 中存在，這裏也分兩種狀況:

a>.在pattern最右邊出現的該字符在當前不匹配字符左邊,有:（以下圖示跳轉）

字符指針右移：j–Δ(‘-’) +m = (j + m)–Δ(‘-’) = (patlen – 1) -Δ(‘-’) = (7-1)-2 = 4

Pattern 右移：字符指針偏移 - m = 4 – m = 2;

b>.在pattern中最右邊出現的該字符在當前不匹配字符右邊,有:（以下圖示跳轉）

字符指針右移: (patlen-1) – Δ(‘T’) = (7-1) – 6 = 0

Pattern右移：字符指針偏移 – m = 0 – 2 = -2

能夠看出，pattern 居然回退比較了，這是不該該出現的，這時候直接日後移動1位就好了：

總結上面三種狀況,咱們定義壞字符函數delta1() 爲字符指針的偏移：

Delta1($) = patLen;(不匹配字符在pattern中不存在)

= patLen–1-Δ(*);(不匹配字符存在pattern中，且在pattern中最右邊出現的位置在當前不匹配字符左邊)

= 1;( 不匹配字符存在pattern中，且在pattern中最右邊出現的該字符在當前不匹配字符右邊)

二、好後綴規則（good suffix rule）：根據已經匹配的部分字串(subpat)，在pattern中尋找是否有和 subpat 所有或者部分匹配的字串，直接對齊匹配，避免無效的移動；

先約定幾點：

一、假設 $ 爲pattern中沒有出現過的字符，有pat[i] = $ 當i < 0;

二、兩個序列[C_{1 …}C_n] 和[d_1…d_n] 是一致的, 當且僅且c_j= d_j或者 c_{j =}$ 或者 d_j= $；其中(0≤j＜n)

三、最右邊可能從新出現的subpat (p[j+1 ~ patLen-1])的位置爲rpr(j)(rightmost plausible reoccurrence), 是使[pat[j + 1] ... pat[patlen]] 和 [pat[k] ... pat[k + patlen - j – 1] ]一致的最大K值，其中k≤0 或者pat[k – 1] != pat[j].

上圖寫出了pattern 「ABXYCDEXY」的rpr()值計算結果：咱們來解析下

a>.當j = 8 時，已經匹配字串p[j+1 … patLen-1] 爲空，參照rpr()定義，可知，pattern最右邊可能和空串一致的，就是p[8 ~ PatLen-1], 可知rpr(8) = 8.

b>.當j = 7時，已經匹配字串subpat爲」Y」, 能夠看到p[3 ~ 3] = subpat , 此時k=3>0, 可是pat(k-1) == pat[j] = 「X」不知足條件，再往右找，能夠知道該 subpat 只可能存在 pattern 頭部-1位置，即rpr(7) = -1.

c>.當j = 6 時，已經匹配字串subpat爲」XY」, 能夠看到p[2 ~ 3] = subpat, 同時知足p[k-1] != pat[j] ,可知rpr(6) = 2.

d>.當j = 5 時，已經匹配字串subpat爲」EXY」, pattern中沒有對應字串和subpat一致，只可能存在pattern頭部，可知rpr(5) = -3;

其餘狀況依次類推，上面的幾種狀況應該包含了全部的rpr() 求法，從上面分析能夠得出個規律：

rpr[patLen-1] = patLen-1.

能夠得出 good suffix rule 的偏移值, 讓pat[k] 和 pat[j+1] 對齊匹配：

Pattern 右移：j + 1 - rpr(j)

字符指針右移: m + j + 1 - rpr(j) = (patLen - 1 - j) + j + 1 – rpr(j) = patLen – rpr(j)

下面咱們定義好後綴規則偏移算法：

delta2(j) = patLen - rpr(j); (0≤j<patLen)

*讀者若是有看過別的BM算法資料，有地方 delta2(j) = patLen – 1 – rpr(j)，仍是開頭的這句話，咱們這裏數組索引從0開始，因此rpr(j) 的值也比索引從1開始的小1；

下面給出完整的實現代碼:

#include <string.h>  // strlen()
#include <stdlib.h>  // __max()

#define ALPHABET_SIZE (1 << (sizeof(char)*8))

// Enable any/all to trace intermediate results
//#define TRACE_DELTA1
//#define TRACE_DELTA2
//#define TRACE_BM

#if defined TRACE_DELTA1 || defined TRACE_DELTA2 || defined TRACE_BM
#include <stdio.h>
#include <ctype.h>
#endif

void calc_delta1(const char *pat, int patlen, int delta1[]) 
{
	int j = 0;
	for (j = 0; j < ALPHABET_SIZE; j++)
		delta1[j] = patlen;

	for (j = 0; j < patlen; j++)
	{
		// By scanning pat from left to right, the final 
		// value in delta1[char] is the *rightmost* occurrence of
		// char in pat
		delta1[pat[j]] = patlen - 1 - j;
	}

#ifdef TRACE_DELTA1
	printf("Starting dump delta1[]>>>>>>>>>>>>>>>>>>>>>>>>>\n");
	for (j = 0; j < ALPHABET_SIZE; j++)
	{
		if (delta1[j] != patlen)
		{
			printf("       %c:%d\n", (char)j, delta1[j]);
		}
	}
	printf("  others:%d\n", patlen);
#endif
}

void calc_delta2(const char *pat, int patlen, int * delta2)
{
	int i = 0, j = 0, s = 0, m = 0, n = 0;
	// rpr[j] : where we can find rightmost plausible recurrence of pat[j+1 .. patlen-1]
	int *rpr = new int[patlen];

	// Mark each uninitialized rpr value with a large negative index
	const int def = -2*patlen;
	for (i = 0; i != patlen; i++)
	{
		rpr[i] = def;
	}

	// r: number of uninitialized entries in rpr[]
	int r = patlen;

	// Scan pattern from right-to-left until all rpr[] are initialized.
	// s: scan position.
	// Examine all substrings that end at pat[s] including null string pat[s .. s]
	for (s = patlen - 1; r > 0; s--)
	{
		// m: length of substring  pat[s-m .. s]
		for (m = 0; m <= patlen - 1 && r > 0; m++)
		{
			// Introduce j and k (as used in the BM paper)
			// j: index of leftmost character of suffix
			int j = patlen - m - 1;
			// k: index of leftmost character of (possible) recurrence.
			int k = s - m;

		#ifdef TRACE_DELTA2
			const int indent = patlen;
			printf("\ns:%d m:%d j:%d k:%d\n", s, m, j, k);
			printf("p  :%*s%s\n", indent, "", pat);
			printf("j  :%*s%*.*s\n", indent+j, "", m+1, m+1, &pat[j] );
			printf("k-1:%*s", indent+k-1, "");
			for (n = 0; n <= m; n++)
			{
				printf("%c", (k-1+n < 0 ? pat[j+n] : pat[k-1+n]) );
			}
			printf("\n");
		#endif

			// We have a match of pat[j+1 .. j+1+m] with pat[k .. k+m]
			// Compare pat[j] to pat[k-1].
			// Match: extend the substring to the left by increasing m
			// Mismatch: terminate the substring and check if plausible RPR

			bool mismatch = false;
			if (k > 0)
			{
				if (pat[j] == pat[k-1]) // extend substring
					continue;
				mismatch = true;
			}
			// else preceding char, pat[k-1] lies to the left of pat[0]
			// which terminates the substring

			// We have a match of m (possibly zero) characters.
			// pat[j+1 .. j+1+m] matches pat[k .. k+m] and
			// either pat[j] != pat[k-1] or k <= 0.
			// So rpr[j] = k (unless rpr[j] is already > k)
			if (rpr[j] < k)
			{
			#ifdef TRACE_DELTA2
				printf("2  :%*s %c %*.*s %*s s:%d m:%d j:%d k:%d r:%d\n",
					indent+j, "",
					toupper(pat[j]),
					m, m, &pat[j+1],
					(patlen-j-1-m), "",
					s, m, j, k, r);
			#endif
				rpr[j] = k;
				r--;
			}
		#ifdef TRACE_DELTA2
			else
			{
				printf("rpr[%d]=%d already inited\n", j, rpr[j]);
			}
		#endif

			// Once we have a mismatch (pat[j] != pat[k-1]) it is fruitless 
			//to examine further substrings ending at pat[s];
			//as Any subpat end with pat[s] will not be the rightmost plausible 
			//recurrence of the terminal substring pat[j+1 ~ patlen-1]
			if (mismatch)
			{
				break;
			}
		}
	}

	for (j = 0; j != patlen; j++) 
	{
		delta2[j] = patlen - rpr[j];
	}

#ifdef TRACE_DELTA2
	printf("R:"); // trace rpr[] values
	for (j = 0; j != patlen; j++)
	{
		printf(" %3d", rpr[j] );
	}

	printf("\n");
	printf("D:"); // trace delta2[] values

	for (j = 0; j != patlen; j++)
	{
		printf(" %3d", delta2[j] );
	}
	printf("\n");
#endif

	delete [] rpr;
}

/*
* Boyer-Moore search algorithm
*/
const char *boyermoore_search(const char * string, const char *pat) 
{
	int i = 0, j = 0, stringlen = 0;
	const char *result = NULL;

	int patlen = strlen(pat);
	int *delta1 = NULL;
	int *delta2 = NULL;

	if (patlen == 0)
		goto out;

	stringlen = strlen(string);
	if (patlen > stringlen)
		goto out;

	delta1 = new int[ALPHABET_SIZE];
	delta2 = new int[patlen];

#ifdef TRACE_BM
	printf("pattern: %s\n", pat);
#endif
	calc_delta1(pat, patlen, delta1);
	calc_delta2(pat, patlen, delta2);

#ifdef TRACE_BM
	printf("\nCalculating boyermoore_search>>>>>>>>>>>>>>>>>>>>>>>>>\n");
#endif

	// i: index of current string character
	for (i = patlen-1;;) 
	{
		if (i > stringlen) 
		{
			result = NULL;
			goto out;
		}

		// j: index of current pattern character
		j = patlen-1;
		for (;;)
		{
			if (j == 0)
			{
				result = &string[i];
				goto out;
			}

			if (string[i] == pat[j])
			{
			#ifdef TRACE_BM
				printf("p:%*s%*.*s%c%*.*s\n", \
					(i-j), "", \
					j, j, pat, \
					toupper(pat[j]), // mark matched char with upcase
					patlen-j-1, patlen-j-1, &pat[j+1]);
			#endif
				j--;
				i--;
				continue;
			}
			break;
		}

	#ifdef TRACE_BM
		printf("p:%*s%*.*s%c%*.*s\n",
			(i-j), "",
			j, j, pat,
			L'?', // mark mismatch char
			patlen-j-1, patlen-j-1, &pat[j+1]); // which-finally-halts.--at-that-point ...
			printf("c:%s\n", string);
	#endif
		// bc: "bad character" shift amount
		int bc = delta1[string[i]];

		// gs: "good suffix" shift amount
		int gs = delta2[j];

	#ifdef TRACE_BM
		printf("j:%d bc:%d gs:%d\n\n", j, bc, gs);
	#endif
		i += __max(bc, gs);
	}

/* not found */
out:
	delete [] delta1;
	delete [] delta2;
	return result;
}

void main(void)
{
	char src_str[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT";
	char pat_str[80] = "AT-THAT";
	const char* find_str = NULL;

	find_str = boyermoore_search((const char *)src_str, (const char *)pat_str);
	if(NULL != find_str)
	{
		printf("\n Success find string : %s\n", find_str);
	}
	else
	{
		printf("no find pattern string !\n");
	}
}

Boyer Moore 算法時間複雜度是亞線性O(patLen+n), pattern 越長BM算法效率越高；

參考：

一、A Fast String Searching Algorithm

二、http://en.wikipedia.org/wiki/User:RMcPhillip/sandbox/boyer-moore