馬爾科夫鏈隨機文本生成器

時間 2019-11-19

標籤隨機文本成器简体版

原文原文鏈接

說明：node

有一種基於馬爾可夫鏈算法的隨機文本生成方法，它利用任何一個現有的某種語言的文本（如一本英文小說），能夠構造出由這個文本中的語言使用狀況而造成的統計模型，並經過該模型生成的隨機文本將具備與原文本相似的統計性質（即具備相似寫做風格）。算法

該算法的基本原理是將輸入當作是由一些互相重疊的短語構成的序列，其將每一個短語分割爲兩個部分：一部分是由多個詞構成的前綴，另外一部分是隻包含一個詞的後綴。在生成文本時依據原文本的統計性質（即前綴肯定的狀況下，獲得全部可能的後綴），隨機地選擇某前綴後面的特定後綴。在此，假設前綴長度爲兩個單詞，則馬爾可夫鏈（Markov Chain）隨機文本生成算法以下：數組

設w1和w2爲文本的前兩個詞數據結構

輸出w1和w2app

循環：函數

隨機地選出w3，它是原文本中w1w2爲前綴的後綴中的一個ui

輸出w3spa

w1 = w2指針

w2 = w3code

重複循環

下面將經過一個例子來講明該算法原理，假設有一個原文以下：

Show your flowcharts and conceal your tables and I will be mystified. Show your tables and your flowcharts will be obvious.

下面是上述原文的一些前綴和其後綴（注意只是部分）的統計：

Prefix	Suffix
Show your	flowcharts tables
your flowcharts	and will
flowcharts and	conceal
flowcharts willl	be
your tables	and and
will be	mystified. obvious.
be mystified.	Show
be obvious.	(end)

基於上述文本，按照馬爾可夫鏈（Markov Chain）算法隨機文本生成文本時，首先輸出的是Show your，而後隨機取出flowcharts或tables。若是爲前者，則接下來的前綴就變成your flowcharts，而下一個後綴應該是and或will；若是爲tables，則接下來的前綴就變成your tables，而下一個詞就應該是and。這樣繼續下去，直到產生出足夠多的輸出，或在查找後綴時遇到告終束標誌。

編寫一個程序從文件中讀入一個英文文本，利用馬爾可夫鏈（Markov Chain）算法，基於文本中固定長度的短語的出現頻率，生成一個最大單詞數目不超過N的新文本到給定文件中。程序要求前綴詞的個數爲2，最大單詞數目N由標準輸入得到。

說明：

爲了獲得更好的統計特性，在此標點符號等非字母字符（如’ 「 . , ? – ()等）也被當作單詞的一部分，即「words」和「words.」是不一樣的單詞。所以，在此將「詞」定義爲由「空白界定的字符串」；

對於同一個前綴的後綴按出現順序排放（無論該後綴是否已存在）；

在處理文本時，文件結束標誌也將做爲某一前綴的一個後綴，如上面示例（說明：在爲文件最後兩個前綴單詞「be obvious.」讀取後綴時，遇到文件結束，即其沒有相應後綴，此時可用一個特殊標記來表示其後綴，如，可存儲一個自定義的特殊串（如「(end)」）做爲其後綴來表示當前狀態，即文件結束）；

對於某一前綴，按以下方式來隨機選擇其後綴（若是某一前綴只有一個後綴，將直接選擇該後綴）：

n = (int)(rrand() * N);

在此N爲某一前綴的全部後綴的總數，n爲所肯定的後綴在該前綴的後綴序列中的序號（從0開始計數，即n爲0時選取第一個後綴，爲1時選取第二個後綴，以此類推）。在此，隨機數生成函數rrand()的定義以下：

double seed = 997;

double rrand()
{
    double lambda = 3125.0;
    double m = 34359738337.0;
    double r;
    seed = fmod(lambda*seed, m); //要包含頭文件#include <math.h>
    r = seed/ m;
    return r;
}

注意：爲了保證運行結果的肯定性，請務必使用本文提供的隨機數生成函數。

在下面條件知足時文本生成結束：1）遇到後綴爲文件結束標誌；或2）生成文本的單詞數達到所設定的最大單詞數。在程序實現時，當讀到文件（結束）尾時，可將一個特殊標誌賦給後綴串suffix變量。

【輸入形式】

建立英文文本文件「article.txt」進行統計分析，並從標準輸入中讀入一個正整數做爲生成文本時的最大單詞數。

【輸出形式】

將生成文本輸出到當前目錄下文件「markov.txt」中。單詞間以一個空格分隔，最後一個單詞後空格無關緊要。

【樣例輸入】

若當前目錄下文件article.txt中內容以下：

I will give you some advice about life.

Eat more roughage;

Do more than others expect you to do and do it pains;

Remember what life tells you;

do not take to heart every thing you hear.

do not spend all that you have.

do not sleep as long as you want;

Whenever you say "I love you", please say it honestly;

Whevever you say "I am sorry", please look into the other person's eyes;

Whenever you find your wrongdoing, be quick with reparation!

Whenever you make a phone call smil when you pick up the phone, because someone feel it!

Understand rules completely and change them reasonably;

Remember, the best love is to love others unconditionally rather than make demands on them;

Comment on the success you have attained by looking in the past at the target you wanted to achieve most;

In love and cooking, you must give 100% effort - but expect little appreciation.

從標準輸入中輸入的單詞個數爲：

1000

【樣例輸出】

當前目錄下所生成的文件markov.txt中內容以下：

I will give you some advice about life. Eat more roughage; Do more than others expect you to do and do it pains; Remember what life tells you; do not take to heart every thing you hear. do not take to heart every thing you hear. do not spend all that you have. do not sleep as long as you want; Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you say "I am sorry", please look into the other person's eyes; Whenever you say "I am sorry", please look into the other person's eyes; Whenever you make a phone call smil when you pick up the phone, because someone feel it! Understand rules completely and change them reasonably; Remember, the best love is to love others unconditionally rather than make demands on them; Comment on the success you have attained by looking in the past at the target you wanted to achieve most; In love and cooking, you must give 100% effort - but expect little appreciation.

【樣例說明】

按照本文介紹的馬爾可夫鏈（Markov Chain）算法將生成相關輸出文件。

【使用什麼數據結構？】

使用如上圖所示的數據結構，創建一個數據結構State存儲狀態和一個後綴鏈表Suffix，這樣創建數據結構的緣由是每經過hash找到一個前綴pref的State，就能夠經過這個這個State的suf鏈表尋找隨機生成的後綴，另外爲了加快速度也能夠用二叉搜索樹代替這個這個suf後綴鏈表。這裏爲了加快速度用了一些技巧，就是在State加了一個變量SufNum,這個變量的目的就是使插入的時候不用按照傳統鏈表插入到末尾這種方法，而是直接在頭部插入，讀取的時候經過(SufNum-the_index_you_want)次next操做就能夠找到所須要的後綴Suf了。

提及來很簡單，但實現起來仍是十分的困難，做者在代碼里加了兩個C語言經常使用的函數memcpy和strdup

如下是 memcpy() 函數的聲明。

void *memcpy(void *str1, const void *str2, size_t n)
參數
str1 -- 這是指針數組，其中的內容將被複制到目標，類型強制轉換爲void*類型的指針。

str2 -- 這是要複製的數據源的指針，void*類型的指針型鑄造。

n -- 這是要被複制的字節數。

返回值
這個函數返回一個指針到目的地，str1。

可參考網址 http://www.cplusplus.com/reference/cstring/memcpy/

Example Code:

#include <stdio.h>
#include <string.h>

int main ()
{
   const char src[50] = "test";
   char dest[50];
   printf("Before:%s\n", dest);
   memcpy(dest, src, strlen(src)+1);
   printf("After: %s\n", dest);
   return(0);
}

結果：

Before:

After: test

而strdup是個字符串的複製，這個函數會單獨alloc一塊新的記憶體，不像strcpy函數同樣須要本身準備兩個記憶體。

調用以後須要用free()函數釋放掉。

#include <string.h>
#include <assert.h>
#include <stdlib.h>

int main(void)
{
    const char *s1 = "String";
    char *s2 = strdup(s1);
    assert(strcmp(s1, s2) == 0);
    free(s2);
}

部分函數還使用了inline內聯函數

inline函數優勢：
傳統程序的函數調用須要不停的調用棧，當有函數須要頻繁調用的時候，那就會致使棧溢出或者效率不高等其餘問題，用inline函數至關於把函數源代碼直接「嵌入」到函數調用點
inline函數缺點：
若是調用inline函數的地方過多，也可能形成代碼膨脹。

有了如上基礎以後咱們能夠從創建State的hash表，而後編寫增長後綴函數和查找函數，就能夠實現馬爾可夫鏈隨機文本的生成了。

這裏的hash方法使用的是NHASH爲5000011的BKDR算法，寫有不少效率更高的方法，能夠上網去尋找替代。

博主的代碼:

  1 #include <math.h>
  2 #include <stdio.h>
  3 #include <stdlib.h>
  4 #include <string.h>
  5 const int NHASH = 5000011;
  6 const int PREFIX_NUM = 2;
  7 typedef struct State State;
  8 typedef struct Suffix Suffix;
  9 struct State {
 10   char *pref[PREFIX_NUM];
 11   Suffix *suf;
 12   State *next;
 13   unsigned int sufNum;
 14 };
 15 struct Suffix {
 16   char *word;
 17   Suffix *next;
 18 };
 19 State *statetab[NHASH];
 20 
 21 /*利用了BKDR HASH方法，這裏還可使用別的HASH方法減小衝突*/
 22 unsigned int hash(char *s[PREFIX_NUM]) {
 23   unsigned int seed = 131;
 24   unsigned int hash = 0;
 25   unsigned int i;
 26   for (i = 0; i < PREFIX_NUM; i++) {
 27     char *str = s[i];
 28     while (*str)
 29       hash = hash * seed + (*str++);
 30   }
 31   return hash % NHASH;
 32 }
 33 
 34 /*查找前綴數組prefix[PREFIX_NUM]是否在哈希表中出現*/
 35 State *lookup(char *prefix[PREFIX_NUM], int isBuild) {
 36   /*If isBuild is true,it will be a new node*/
 37   int i, h;
 38   h = hash(prefix);
 39   State *sp = statetab[h];
 40   while (sp != NULL) {
 41     for (i = 0; i < PREFIX_NUM; ++i) {
 42       if (strcmp(prefix[i], sp->pref[i]))
 43         break;
 44     }
 45     if (i == PREFIX_NUM) //找到了就返回
 46       return sp;
 47     sp = sp->next;
 48   }
 49   if (isBuild) {
 50     sp = malloc(sizeof(State));
 51     for (i = 0; i < PREFIX_NUM; ++i) {
 52       sp->pref[i] = prefix[i];
 53     }
 54     sp->suf = NULL;
 55     sp->sufNum = 0;
 56     sp->next = statetab[h]; //頭插法
 57     statetab[h] = sp;
 58   }
 59   return sp;
 60 }
 61 
 62 /*直接在頭結點插入後綴，減小插入時間*/
 63 State *addsuffix(State *sp, char *suffix);
 64 inline State *addsuffix(State *sp, char *suffix) {
 65   Suffix *suf = malloc(sizeof(Suffix));
 66   suf->word = suffix;
 67   suf->next = sp->suf;
 68   sp->sufNum++;
 69   sp->suf = suf;
 70   return sp;
 71 }
 72 
 73 /*往數據結構中插入一個新的項，使用inline內聯函數加快速度*/
 74 void add(char *prefix[PREFIX_NUM], char *suffix);
 75 inline void add(char *prefix[PREFIX_NUM], char *suffix) 
 76 {
 77   State *sp = NULL;
 78   sp = lookup(prefix, 1);
 79   sp = addsuffix(sp, suffix);
 80   // memmove(prefix, prefix + 1, sizeof(prefix[0]));
 81   memcpy(prefix, prefix + 1, sizeof(prefix[0]));
 82   prefix[1] = suffix;
 83 }
 84 
 85 double seed = 997;
 86 
 87 /*如上面所要求的隨機生成器*/
 88 double rrand() {
 89   double lambda = 3125.0;
 90   double m = 34359738337.0;
 91   double r;
 92   seed = fmod(lambda * seed, m); //要包含頭文件#include <math.h>
 93   r = seed / m;
 94   return r;
 95 }
 96 
 97 void build(char *prefix[PREFIX_NUM], FILE *f);
 98 inline void build(char *prefix[PREFIX_NUM], FILE *f) {
 99   char buf[40];
100   while (fscanf(f, "%39s", buf) != EOF) {
101     add(prefix, strdup(buf));
102   }
103 }
104 
105 /*從這裏生成markov鏈*/
106 void generate(int nwords, FILE *OUT) {
107   State *sp;
108   char *prefix[PREFIX_NUM], *w;
109   unsigned int i;
110   for (i = 0; i < PREFIX_NUM; ++i)
111     prefix[i] = "\0";
112 
113   for (i = 0; i < nwords; ++i) {
114     sp = lookup(prefix, 0);
115     Suffix *suf = sp->suf;
116     if (sp->sufNum == 1) {
117       w = suf->word;
118     } else {
119       int n = sp->sufNum - (int)(rrand() * sp->sufNum) - 1;
120       while (suf != NULL) {
121         if (n == 0) {
122           w = suf->word;
123           break;
124         }
125         suf = suf->next;
126         n--;
127       }
128     }
129     if (strcmp(w, "(end)") == 0)
130       break;
131     fprintf(OUT, "%s ", w);
132     memcpy(prefix, prefix + 1, (PREFIX_NUM - 1) * sizeof(Suffix));
133     // strcpy(prefix[0], prefix[1]);
134     prefix[1] = w;
135   }
136 }
137 
138 int main() {
139   unsigned long int nwords;
140   unsigned int i;
141   scanf("%lu", &nwords);
142   char *prefix[PREFIX_NUM];
143   for (i = 0; i < PREFIX_NUM; ++i)
144     prefix[i] = "\0";
145   FILE *in = fopen("article.txt", "r");
146   build(prefix, in);
147   fclose(in);
148   /*末尾處加(end)*/
149   add(prefix, "(end)");
150   FILE *out = fopen("markov.txt", "w");
151   generate(nwords, out);
152   fclose(out);
153   return 0;
154 }

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。