字符串匹配算法之Boyer-Moore-Horspool Algorithm

Boyer-Moore-Horspool 算法也稱Horspool 算法,由Nigel Horspool設計於1980年,是在BM算法上改進版,由於BM算法裏面的 好後綴規則較難理解,同時其效率與正確性的證實當時一直沒有獲得解決,因此Horspool 算法只用了一個BM裏的壞字符規則. 算法


借用「find a needle in a haystack」 典故,意爲"大海撈針",引意到咱們這裏就是 從haystack 字串中查找needle字串(needle 字串等同pattern字串),同時假定haystack字串長度n,needle字串長度爲m; this


基本原理:

Horspool算法 也是從右向左比較但Horspool算法相對於Boyer-Moore算法改進了壞字符規則;從右向左匹配,當遇到 不匹配字符(mismatch character) : spa

BM 跳轉規則: 當前不匹配字符和needle中最右邊出現的該字符對齊匹配; 設計

Horspool 跳轉規則:haystack 字串中與needle字串尾部字符對應的字符needle中最右邊出現的該字符匹配; code


壞字符規則跳轉表初始化和BM中同樣,理解了原理,code理解起來就容易了; orm

下面是實現代碼: ip

#include <stdio.h>
#include <string.h>		//
#include <limits.h>		//UCHAR_MAX
 
/* Returns a pointer to the first occurrence of "needle"
 * within "haystack", or NULL if not found. Works like
 * memmem() OR strstr().
 */
 
/* Note: In this example needle is a C string. The ending
 * 0x00 will be cut off, so you could call this example with
 * boyermoore_horspool_memmem(haystack, hlen, "abc", sizeof("abc"))
 */
const unsigned char *
boyermoore_horspool_memmem(const unsigned char* haystack, size_t hlen,
                           const unsigned char* needle,   size_t nlen)
{
    size_t scan = 0;
    size_t bad_char_skip[UCHAR_MAX + 1]; /* Officially called:
                                          * bad character shift */
 
    /* Sanity checks on the parameters */
    if (nlen <= 0 || !haystack || !needle)
        return NULL;
 
    /* ---- Preprocess ---- */
    /* Initialize the table to default value */
    /* When a character is encountered that does not occur
     * in the needle, we can safely skip ahead for the whole
     * length of the needle.
     */
    for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1)
        bad_char_skip[scan] = nlen;
 
    /* C arrays have the first byte at [0], therefore:
     * [nlen - 1] is the last byte of the array. */
    size_t last = nlen - 1;
 
    /* Then populate it with the analysis of the needle */
    for (scan = 0; scan < last; scan = scan + 1)
        bad_char_skip[needle[scan]] = last - scan;
 
    /* ---- Do the matching ---- */
 
    /* Search the haystack, while the needle can still be within it. */
    while (hlen >= nlen)
    {
        /* scan from the end of the needle */
        for (scan = last; haystack[scan] == needle[scan]; scan = scan - 1)
		{
            if (scan == 0) /* If the first byte matches, we've found it. */
                return haystack;
		}
 
        /* otherwise, we need to skip some bytes and start again.
           Note that here we are getting the skip value based on the last byte
           of needle, no matter where we didn't match. So if needle is: "abcd"
           then we are skipping based on 'd' and that value will be 4, and
           for "abcdd" we again skip on 'd' but the value will be only 1.
           The alternative of pretending that the mismatched character was
           the last character is slower in the normal case (E.g. finding
           "abcd" in "...azcd..." gives 4 by using 'd' but only
           4-2==2 using 'z'. */
        hlen     -= bad_char_skip[haystack[last]];
        haystack += bad_char_skip[haystack[last]];		//與BM中的壞字符區別主要在這
    }
 
    return NULL;
}

void main(void)
{
	char haystack[80] = "WHICH-FINALLY-HALTS.--AT-THAT-POINT";
	char needle[80] = "AT-THAT";
	const unsigned char* find_str = NULL;

	find_str = boyermoore_horspool_memmem((const unsigned char *)haystack, strlen(haystack), (const unsigned char *)needle, strlen(needle));
	if(NULL != find_str)
	{
		printf("Success find string : %s\n", find_str);
	}
	else
	{
		printf("no find pattern string !\n");
	}
}
相關文章
相關標籤/搜索