TF-IDF 提取關鍵詞

時間 2019-11-06

原文原文鏈接

<?php

class Document
{
    protected $words;
    protected $tf_matrix;
    protected $tfidf_matrix;
    public function __construct($string)
    {
        $this->tfidf_matrix = null;
        if (isset($string))
        {
            $string = strtolower($string);
            $this->words = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $string, -1, PREG_SPLIT_NO_EMPTY);
            $this->build_tf();
        }
        else
        {
            $this->words = null;
            $this->tf_matrix = null;
        }
    }
    public function build_tf()
    {
        if (isset($this->tf_matrix) && $this->tf_matrix)
            return ;
        $this->tfidf_matrix = null;
        $words_count = count($this->words);
        $words_occ = array_count_values($this->words);
        foreach ($words_occ as $word => $amount)
            $this->tf_matrix[$word] = $amount / $words_count;
        arsort($this->tf_matrix);
    }
    public function build_tfidf($idf)
    {
        if (isset($this->tfidf_matrix) && $this->tfidf_matrix)
            return true;
        if (!isset($this->tf_matrix) || !$this->tf_matrix)
            return false;
        if (!isset($idf) || !$idf)
            return false;
    
        if(is_array($idf)){
            foreach ($this->tf_matrix as $word => $word_tf){
                $this->tfidf_matrix[$word] = $word_tf * $idf[$word];
            }

        }else{
            foreach ($this->tf_matrix as $word => $word_tf){
                $this->tfidf_matrix[$word] = $word_tf * $idf;
            }
        }
        arsort($this->tfidf_matrix);
        return true;
    }
    public function getWords()
    {
        return ($this->words);
    }
    public function getTf()
    {
        return ($this->tf_matrix);
    }
    public function getTfidf()
    {
        return ($this->tfidf_matrix);
    }
}

/*
第一步，計算詞頻。
考慮到文章有長短之分，爲了便於不一樣文章的比較，進行"詞頻"標準化。

第二步，計算逆文檔頻率。
這時，須要一個語料庫（corpus），用來模擬語言的使用環境。
若是一個詞越常見，那麼分母就越大，逆文檔頻率就越小越接近0。分母之因此要加1，是爲了不分母爲0（即全部文檔都不包含該詞）。log表示對獲得的值取對數。

第三步，計算TF-IDF。
能夠看到，TF-IDF與一個詞在文檔中的出現次數成正比，與該詞在整個語言中的出現次數成反比。因此，自動提取關鍵詞的算法就很清楚了，就是計算出文檔的每一個詞的TF-IDF值，而後按降序排列，取排在最前面的幾個詞。
*/
$text = 'i very good, ha , i very nice, i is good';


$obj = new Document($text);
$obj->build_tf();   //詞頻率TF，通常是詞出現次數/總詞數

$idf = log(3 / 2);   //逆文檔頻率，總文檔數/包含該詞的文檔數
$obj->build_tfidf($idf);  

//越高則頻率高
var_dump($obj->getWords(), 88, $obj->getTf(), 99, $obj->getTfidf());

http://www.ruanyifeng.com/blog/2013/03/tf-idf.htmlphp

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。