AC(Aho—Corasiek) 多模式匹配算法

時間 2019-11-07

標籤 aho corasiek 模式匹配算法简体版

原文原文鏈接

簡介:

AC多模式匹配算法產生於1975年的貝爾實驗室,最先使用於圖書館的書目查詢程序中。該算法以有限狀態自動機(FSA)，以及KMP前綴算法爲基礎.（有說法: ac自動機是KMP的多串形式,是一個有限自動機）html

AC定義:

AC有限自動機 M 是1個6元組：M =(Q，∑，g，f，qo，F)其中：node

一、Q是有限狀態集(模式樹上的全部節點).ios

二、∑是有限的輸入字符表(模式樹全部邊上的字符).算法

三、g是轉移函數.網絡

四、f是失效函數，不匹配時自動機的狀態轉移.框架

五、qo∈Q是初態(根節點);函數

六、F量Q是終態集(以模式爲標籤的節點集).this

AC有限狀態自動機實現:

首先假設模式集合{he,she his,hers}, 輸入字符串"ushers":spa

AC自動機算法主要創建三個函數，轉向函數goto，失效函數failure和輸出函數output(output 構造間雜在goto 構造以及failure構造中);.net

一、AC有限狀態自動機M 操做循環框架:

a> 若是g(s,a) = s', 則自動機M 繼續調用goto函數，以新狀態s',以及新字符x爲輸入;若是狀態s'，匹配了某個模式，則輸出;

b> 若是f(s,a) = failure, 則自動機M 調用failure狀態轉移f(s) = s',並以狀態s',字符a 調用步驟1;

構造M僞代碼:

二、構造goto函數及輸出函數output:

goto函數: 是一個狀態在接受一個字符後轉向另外一個狀態或者失敗的函數(對應於FSA裏的轉移函數);

定義以下：

g(S，a) 其中S ∈ Q, a ∈ Σ ：從當前狀態S開始，沿着邊上標籤爲a的路徑所到的狀態。假如狀態節點(U，V)邊上的標籤爲a，那麼g(U，a)=V；若是根節點上出來的邊上的標籤沒有a，則g(0，a)=O(失敗)，即若是沒有匹配的字符出現，自動機停留在初態;若是不是根節點，且該節點出來的標籤沒有字符a，則g(U,a) = failure,稱爲失敗;

下圖(a)是用goto函數以{he,she his,hers}模式集構造的字符串模式匹配機:

狀態0是初始狀態，在狀態0和狀態1間的有向邊標有字符'h'，表示g(0,a) = 1;缺失的有向邊表示失敗，當任意字符σ != e或i,有g(1,σ) = failure;

注意: 全部字符有 g(0,σ) != failure, 狀態0的這個屬性確保 M 會處理輸入的任意字符;任意字符σ不在以狀態0開始有向邊的字符，有g(0,σ) = 0;同時說明狀態0的失效函數(failure) 沒有意義,不用計算;

構造goto僞代碼:

三、構造失效函數failure及輸出函數output;

失效函數: 指的也是狀態和狀態之間一種轉向關係，在goto失敗(failure)的狀況下使用的轉換關係. 基本原理是KMP 算法的前綴函數；

下圖(b)是各狀態的失效函數值:

下圖(c)是各狀態i最終的output值:

首先，咱們定義狀態轉移圖(a)中狀態s的深度爲從狀態0到狀態s的最短路徑。例如圖(a)起始狀態的深度是0，狀態1和3的深度是1，狀態2，4，和6的深度是2，等等。

計算思路：先計算全部深度是1的狀態的失效函數值，而後計算全部深度爲2的狀態，以此類推，直到全部狀態（除了狀態0，由於它的失效函數沒有定義）的失效函數值都被計算出。

計算方法：用於計算某個狀態失效函數值的算法在概念上是很是簡單的。首先，令全部深度爲1的狀態s的函數值爲f(s) = 0。假設全部深度小於d的狀態的f值都已經被算出了，那麼深度爲d的狀態的失效函數值將根據深度小於d的狀態的失效函數值來計算。

具體步驟:

爲了計算深度爲d 狀態的失效函數值，假設深度爲d-1的狀態r，執行如下步驟:

Step1: 若是對全部字符a,有g(r, a) = fail，那麼什麼都不作;(好像是廢話,這若是成立，說明狀態r節點下面沒有節點了，根本不須要計算)

Step2: 不然，對每一個使g(r, a) = s成立的字符a，執行如下操做:

a) 記state = f(r);

b) 執行零次或者屢次令state = f(state)，直到出現一個state使得g(state, a) != fail; (注意到，由於對任意字符a，g(0, a) != fail，因此這種狀態必定可以找到);

c) 記f(s) = g(state, a);

注意: 這裏有點拗口,很差理解，一句話來講: 就是看當前狀態節點前一個狀態節點(父節點)的failure節點是否有當前字符的外向變,若是有,則當前狀態failure狀態就是對應外向變對應的節點;若是沒有,則根據對應failure狀態的failure狀態;

舉個例子:求圖(a)中狀態4 的failure 狀態, 已知其前一個(父節點)的f(1)= 0,且狀態0(根節點)有字符'h'的外向邊,該外向邊對應狀態1,則有f(4) = 1;相似前綴規則:求已經匹配字串"sh" 最大後綴,同時是某個模式串的前綴;

failure 函數僞代碼:

4、最後是遍歷搜索:

狀態機搜索過程當中會有一種特殊狀況：若是模式集中某個模式是另外一個模式的子串，爲了防止這種狀況下漏掉子串模式，須要在這種子串的終態添加到長模式中;代碼實現中就是某個狀態的failure狀態是某個終態，則當前狀態也是終態，須要輸出failure狀態匹配的模式;

具體實現代碼:

#include<iostream>
#include<string.h>
#include<malloc.h>
#include <queue>
using namespace std;


/* reallocation step for AC_NODE_t.outgoing array */
#define REALLOC_CHUNK_OUTGOING 2

struct ac_edge;

typedef struct node{
	unsigned int id; 		/* Node ID : just for debugging purpose */
	unsigned short depth;	/* depth: distance between this node and the root */
	
	struct node *parent;		/*parent node, for compute failure function*/
	struct node *failure_node;	/* The failure node of this node */

	short int final; 		/* 0: no ; 1: yes, it is a final node */
	int patternNo;		/*Accept pattern index: just for debugging purpose */

	/* Outgoing Edges */
	struct ac_edge* outgoing_edge;/* Array of outgoing character edges */
	unsigned short outgoing_num;	/* Number of outgoing character edges */
	unsigned short outgoing_max;	/* Max capacity of allocated memory for outgoing character edges */
}AC_NODE_t;

/* The Ougoing Edge of the Node */
struct ac_edge
{
    char alpha; /* Edge alpha */
    AC_NODE_t * next;	/* Target of the edge */
};


static void node_assign_id (AC_NODE_t * thiz);
static AC_NODE_t * node_find_next(AC_NODE_t * pAc_node, char ch);


/******************************************************************************
 * Create node
******************************************************************************/
AC_NODE_t *node_create()
{
	AC_NODE_t* pNode = (AC_NODE_t*)malloc(sizeof(AC_NODE_t));

	memset(pNode, 0, sizeof(AC_NODE_t));

	pNode->failure_node = NULL;
	pNode->parent = NULL;
	pNode->final = 0;

	/*init outgoing character edges*/
	pNode->outgoing_max = REALLOC_CHUNK_OUTGOING;
	pNode->outgoing_edge = (struct ac_edge *) malloc (pNode->outgoing_max*sizeof(struct ac_edge));

	node_assign_id(pNode);

	return pNode;
}

/******************************************************************************
 * assign a unique ID to the node (used for debugging purpose).
******************************************************************************/
static void node_assign_id (AC_NODE_t * thiz)
{
	static int unique_id = 0;
	thiz->id = unique_id ++;
}

/******************************************************************************
 * Establish an new edge between two nodes
******************************************************************************/
void node_add_outgoing (AC_NODE_t * thiz, AC_NODE_t * next, char alpha)
{
	if(thiz->outgoing_num >= thiz->outgoing_max)
	{
		thiz->outgoing_max += REALLOC_CHUNK_OUTGOING;
		thiz->outgoing_edge = (struct ac_edge *)realloc(thiz->outgoing_edge, thiz->outgoing_max*sizeof(struct ac_edge));
	}

	thiz->outgoing_edge[thiz->outgoing_num].alpha = alpha;
	thiz->outgoing_edge[thiz->outgoing_num++].next = next;
}

/******************************************************************************
 * Create a next node with the given alpha.
******************************************************************************/
AC_NODE_t * node_create_next (AC_NODE_t * pCur_node, char alpha)
{
	AC_NODE_t * pNext_node = NULL;
	pNext_node = node_find_next (pCur_node, alpha);

	if (pNext_node)
	{
		/* The (labeled alpha) edge already exists */
		return NULL;
	}

	/* Otherwise add new edge (node) */
	pNext_node = node_create ();
	node_add_outgoing(pCur_node, pNext_node, alpha);

	return pNext_node;
}

/******************************************************************************
 * Find out the next node for a given Alpha to move. this function is used in
 * the pre-processing stage in which edge array is not sorted. so it uses linear search.
******************************************************************************/
static AC_NODE_t * node_find_next(AC_NODE_t * pAc_node, char ch)
{
	int i = 0;

	if(NULL == pAc_node)
	{
		return NULL;
	}

	for (i=0; i < pAc_node->outgoing_num; i++)
	{
		if(pAc_node->outgoing_edge[i].alpha == ch)
			return (pAc_node->outgoing_edge[i].next);
	}

	return NULL;
}

/******************************************************************************
* add parent node's all leaf node(outgoing node) into queue
******************************************************************************/
int  queue_add_leaf_node(AC_NODE_t *parent, queue<AC_NODE_t*> &myqueue)
{
	int i;

	for (i = 0; i < parent->outgoing_num; i++)
	{
		myqueue.push (parent->outgoing_edge[i].next);
	}

	return 0;
}

/******************************************************************************
 * Initialize automata; allocate memories and add patterns into automata
******************************************************************************/
AC_NODE_t * ac_automata_create(char pattern[][255], int patterns_num)
{
	int iPattern_index, iChar_index;
	AC_NODE_t *root = node_create();
	AC_NODE_t *pCur_node = NULL, *pNext_node = NULL;
	char alpha;

	for(iPattern_index=0; iPattern_index<patterns_num; iPattern_index++)
	{
		pCur_node = root;
		for(iChar_index=0; iChar_index<strlen(pattern[iPattern_index]); iChar_index++)   ///對每一個模式進行處理
		{
			alpha = pattern[iPattern_index][iChar_index];
			pNext_node = node_find_next(pCur_node, alpha);
			if(NULL != pNext_node)
			{
				pCur_node = pNext_node;
			}
			else
			{
				pNext_node = node_create_next(pCur_node, alpha);
				if(NULL != pNext_node)
				{
					pNext_node->parent = pCur_node;
					pNext_node->depth = pCur_node->depth + 1;

					pCur_node = pNext_node;
				}
			}
		}

		pCur_node->final = 1;
		pCur_node->patternNo = iPattern_index;
	}

	return root;
}

/******************************************************************************
 * find failure node for all node, actually failure function maps a state into a new state.
 * the failure function is consulted whenever the goto function reports fail;
 * specificialy compute the failue node, we use it's parent node's failure node
******************************************************************************/
int ac_automata_setfailure(AC_NODE_t * root)
{
	int i =0;
	queue<AC_NODE_t*> myqueue;

	char edge_ch = '\0';
	AC_NODE_t *pCur_node = NULL, *parent = NULL, *pNext_Node = NULL;

	for(i= 0; i< root->outgoing_num; i++)	//f(s) = 0 for all states s of depth 1
	{
		root->outgoing_edge[i].next->failure_node = root;
	}

	queue_add_leaf_node(root, myqueue);

	while(!myqueue.empty())
	{
		parent = myqueue.front();
		myqueue.pop();
		queue_add_leaf_node(parent, myqueue);

		for(i = 0; i < parent->outgoing_num; i++)
		{
			edge_ch = parent->outgoing_edge[i].alpha;

			pCur_node = parent->outgoing_edge[i].next;

			pNext_Node = node_find_next(parent->failure_node, edge_ch);
			if(NULL == pNext_Node)
			{
				if(parent->failure_node == root)
				{
					pCur_node->failure_node = root;
				}
				else
				{
					parent = parent->failure_node->parent;
				}
			}
			else
			{
				pCur_node->failure_node = pNext_Node;
			}
		}
	}

	return 0;
}

/******************************************************************************
 * Search in the input text using the given automata.
******************************************************************************/
int ac_automata_search(AC_NODE_t * root, char* text, int txt_len, char pattern[][255])
{
	AC_NODE_t *pCur_node = root;
	AC_NODE_t *pNext_node = NULL;
	int position = 0;

	while(position < txt_len)
	{
		pNext_node = node_find_next(pCur_node, text[position]);
		if (NULL == pNext_node)
		{
			if(pCur_node == root)
			{
				position++;
			}
			else
			{
				pCur_node = pCur_node->failure_node;
			}
		}
		else
		{
			pCur_node = pNext_node;
			position++;
		}

		if(pCur_node->final == 1)    ///some pattern matched
		{
			cout<<position-strlen(pattern[pCur_node->patternNo])<< '\t' << '\t' <<pCur_node->patternNo<< '\t' << '\t' <<pattern[pCur_node->patternNo]<<endl;
		}
	}

	return 0;
}

/******************************************************************************
 * Prints the automata to output in human readable form.
******************************************************************************/
void ac_automata_display (AC_NODE_t * root)
{
	unsigned int i;
	AC_NODE_t * pCur_node = root;
	struct ac_edge * pEdge = NULL;

	if(root == NULL)
	{
		return;
	}

	printf("---------------------------------\n");

	queue<AC_NODE_t*> myqueue;
	myqueue.push( pCur_node );

	while(!myqueue.empty())
	{
		pCur_node = myqueue.front();
		myqueue.pop();

		printf("NODE(%3d)/----fail----> NODE(%3d)\n", pCur_node->id, (pCur_node->failure_node)?pCur_node->failure_node->id:0);

		for (i = 0; i < pCur_node->outgoing_num; i++)
		{
			myqueue.push (pCur_node->outgoing_edge[i].next);

			pEdge = &pCur_node->outgoing_edge[i];
			printf("         |----(");
			if(isgraph(pEdge->alpha))
				printf("%c)---", pEdge->alpha);
			else
				printf("0x%x)", pEdge->alpha);
			printf("--> NODE(%3d)\n", pEdge->next->id);
		}
		printf("---------------------------------\n");
	}
}

/******************************************************************************
 * Release all allocated memories to the automata
******************************************************************************/
int ac_automata_release(AC_NODE_t * root)
{
	if(root == NULL)
	{
		return 0;
	}

	queue<AC_NODE_t*> myqueue;
	AC_NODE_t *pCur_node = NULL;

	myqueue.push( root );
	root = NULL;

	while(!myqueue.empty())
	{
		pCur_node = myqueue.front();
		myqueue.pop();

		for (int i = 0; i < pCur_node->outgoing_num; i++)
		{
			myqueue.push (pCur_node->outgoing_edge[i].next);
		}
		free(pCur_node);
	}

	return 0;
}

int main()
{
	unsigned int i = 0;
	char haystack[] = "ushers";
	char needle[4][255]={"he","she", "his","hers"};

	/* 1. create ac finite state automata match machine, compute goto and output func*/

	AC_NODE_t *root = ac_automata_create(needle, sizeof(needle)/sizeof(needle[0]));

	/* 2. compute failure function*/

	ac_automata_setfailure( root );

	/* 3. Display automata (if you are interested)*/

	ac_automata_display( root );

	cout << endl << "haystack : " << haystack << endl;
	cout << "needles : ";
	for(i = 0; i<sizeof(needle)/sizeof(needle[0]); i++)
	{
		cout << needle[i] << " ";
	}
	cout << endl << endl;
	cout << "match result : " << endl << "position\t" << "node_id\t\t" << "pattern" << endl;

	/* 3. seaching multi patterns use automata*/

	ac_automata_search(root, haystack, strlen(haystack), needle);

	/* 4. Release the automata */

	ac_automata_release ( root );

	return 0;
}

後記:

根據不一樣的AC_NODE結構設計，實現會有些不一樣，但原理同樣;

能夠改進的地方:

一、能夠把同深度節點排序，後面查找某狀態的指定字符外向邊狀態，可使用二分查找，加快速度;

二、這裏的AC_NODE 裏面每一個節點只對應一個匹配模式(patternNo)，理論上是有多個的匹配模式的,有待完善;

三、已知g(4,e) = 5; 假設M 當前狀態爲4, 且下一個字符不是'e',這時候M 會調用f(4)=1，其實這時候咱們已經不須要去查找狀態1以'e'爲外向邊的狀態了，由於下一個字符肯定不是'e';若是沒有"his"模式，咱們能夠直接從狀態1跳到狀態0;而如今代碼是會去作這個多餘查找動做的。這個能夠用肯定有限自動機來避免，下篇文章我會詳細和你們討論下.

有任何問題，還請不吝賜教~

references:

<1>、Efficient String Matching: An Aid to Bibliographic Search.pdf(june 1975)

<2>、http://blog.csdn.net/sunnianzhong/article/details/8832496

<3>、http://blog.csdn.net/sealyao/article/details/4560427

<4>、http://www.it165.net/pro/html/201311/7860.html

<5>、http://sourceforge.net/projects/multifast/

<6>、多模式匹配算法的研究.pdf

<7>、模式匹配算法在網絡入侵系統中的應用研究.pdf