大部分的代碼、思路參考了《Ansi Common Lisp》P138~P141。算法
問題:給一篇英文文本,如何讓計算機依據此文本而生成隨機但可讀的文本。如:api
|Venture|less
The National Venture Capital Association estimates that wealth associated with a deal a big spending by regulations that will spend one another's main reason these projects .
dom
這是計算機學習了Paul Graham的一些文章後生成的隨機文本。它根據Venture這個單詞向兩邊延伸成一個句子。使人驚喜的是,文本經常是可讀的。函數
算法:記錄每一個單詞後面出現的單詞以及出現的次數,如I leave在原文中出現了5次,I want出現了3次,除此以外,其它地方沒有出現過I,因此在生成隨機文章的時候,當遇到I,有5/8的機率選擇leave爲下一個單詞。假如選擇了leave的話,則看看leave後面出現過哪些單詞,重複以上過程。學習
現用lisp來解決問題。spa
lisp裏的符號類型,即symbol,能夠很好記錄各類字符串還有標點符號,因此採用它來記錄。採用內附的hashtable來創建列表:debug
(defparameter *words* (make-hash-table :size 10000))code
那如何創建列表呢?orm
(let ((prev '|.|)) (defun see (sym) (let ((pair (assoc sym (gethash prev *words*)))) (if pair (incf (cdr pair)) (push (cons sym 1) (gethash prev *words*)))) (setf prev sym)))
以當前單詞爲keyword,以assoc-list關係列表爲該keyword下的值。
如I下有( (|leave| . 5) (|want| . 3) )。沒有單詞word的話,則push入(word . 1)。
如何隨機選一個詞呢?
(defun random-word (word ht) (let* ((choices (gethash word ht)) (x (random (reduce #'+ choices :key #'cdr)))) (dolist (pair choices) (decf x (cdr pair)) (if (minusp x) (return (car pair))))))
這裏巧妙用了reduce函數。
如今再來思考,如何將給定一個詞向兩側延伸成一句話呢?
1)先將文本反向,獲得一個反向的列表,也即I leave,I want變成leave I,want I。
2)將hashtable反向,獲得另一個hashtable,之後一個單詞爲關鍵字,前面可能出現的單詞及次數構成assoc-list。
3)碰運氣,從一個標點開始延續文章,直到出現給定單詞爲止。
我用了第二個方法:
(defparameter *r-words* (make-hash-table :size 10000)) (defun push-words (w1 w2 n) (push (cons w2 n) (gethash w1 *r-words*))) (defun get-reversed-words ();a cat -> cat a (maphash #'(lambda (k lst) (dolist (pair lst) (push-words (car pair) k (cdr pair)))) *words*))
遍歷原來的hashtable,再把每一對單詞前後換個位置插入另一個hashtable。
給出雙向延伸句子的自動生成文本代碼:
(defparameter *words* (make-hash-table :size 10000)) (defconstant maxword 100) (defparameter nwords 0) (defconstant debug nil) (let ((prev '|.|)) (defun see (sym) (incf nwords) (let ((pair (assoc sym (gethash prev *words*)))) (if pair (incf (cdr pair)) (push (cons sym 1) (gethash prev *words*)))) (setf prev sym))) (defun check-punc (c);char to symbol (case c (#\. '|.|) (#\, '|,|) (#\; '|;|) (#\? '|?|) (#\: '|:|) (#\! '|!|))) (defun read-text (pathname) (with-open-file (str pathname :direction :input) (let ((buf (make-string maxword)) (pos 0)) (do ((c (read-char str nil 'eof) (read-char str nil 'eof))) ((eql c 'eof)) (if (or (alpha-char-p c) (eql c #\')) (progn (setf (char buf pos) c) (incf pos)) (progn (unless (zerop pos) (see (intern (subseq buf 0 pos))) (setf pos 0)) (let ((punc (check-punc c))) (if punc (see punc))))))))) (defun print-ht (ht) (maphash #'(lambda (k v) (format t "~A ~A~%" k v)) ht)) (defparameter *r-words* (make-hash-table :size 10000)) (defun push-words (w1 w2 n) (push (cons w2 n) (gethash w1 *r-words*))) (defun get-reversed-words ();a cat -> cat a (maphash #'(lambda (k lst) (dolist (pair lst) (push-words (car pair) k (cdr pair)))) *words*)) (defun print-a-word (word ht) (maphash #'(lambda (k lst) (if (eql k word) (format t "~A ~A~%" k lst))) ht)) (if debug (print-a-word '|leave| *r-words*)) (defun punc-p (sym);symbol to char,nil when fails. (check-punc (char (symbol-name sym) 0))) (defun random-word (word ht) (let* ((choices (gethash word ht)) (x (random (reduce #'+ choices :key #'cdr)))) (dolist (pair choices) (decf x (cdr pair)) (if (minusp x) (return (car pair)))))) (defun gen-former (word str) (let ((last (random-word word *r-words*))) (if (not (punc-p last)) (progn (gen-former last str) (format str "~A " last))))) (defun gen-latter (word str) (let ((next (random-word word *words*))) (format str "~A " next) (if (not (punc-p next)) (gen-latter next str)))) ;(gen-latter '|leave| t) (defun get-a-word (ht);get a random word (let ((x (random nwords))) (maphash #'(lambda (k v) (dolist (pair v) (decf x (cdr pair)) (if (minusp x) (return-from get-a-word (car pair))))) ht))) ;(get-a-word *words*) (defun gen-sentence (word str) (gen-former word str) (format str "~A " word) (gen-latter word str)) (defun test () (setf nwords 0) (read-text "essay.txt") (get-reversed-words) (let ((word (get-a-word *words*))) (print word) (gen-sentence word t))) (test)
文本語料庫、lisp源代碼見: Here