文章類似度計算

時間 2019-12-06

標籤文章類似計算简体版

原文原文鏈接

文章內容類似度計算幾種方式及優缺點

PHP 內置方法 similar_text

similar_text 是PHP內置的字符串類似度對比函數，是使用方式最便捷的一種,可是由於它的時間複雜度是 O(N**3)，處理時間會隨着內容長度增長,若比較5000字以上的文章，或者比較文章的量級比較大不建議使用,只是單篇文章對單篇文章能夠使用。php

經過分詞進行餘弦類似度對比

解決方案是首先進行文章分詞能夠用結巴或者迅搜分詞服務進行文章分詞,而後將須要對比的文章分詞結果存入redis，在有新文章進行對比的時候從redis將全部文章的分詞結果從內存中取出來而後進行類似度對比，逐詞進行類似度計算。類似度計算的準確性很高，可是對比的文章量很是大的時候,處理時間仍是會很長,5000文章的類似度計算須要近30Smysql

主要計算代碼：redis

Class TextSimilarity
{
    /**
     * [排除的詞語]
     *
     * @var array
     */
    private $_excludeArr = array('的', '了', '和', '呢', '啊', '哦', '恩', '嗯', '吧');

    /**
     * [詞語分佈數組]
     *
     * @var array
     */
    private $_words = array();

    /**
     * [分詞後的數組一]
     *
     * @var array
     */
    private $_segList1 = array();

    /**
     * [分詞後的數組二]
     *
     * @var array
     */
    private $_segList2 = array();

    private static $test1 = array();
    private static $test2 = array();

    /**
     * [分詞兩段文字]
     *
     * @param [type] $text1 [description]
     * @param [type] $text2 [description]
     */
    public function __construct($text1, $text2)
    {
        $this->_segList1 = is_array( $text1 ) ? $text1 : $this->segment( $text1 );
        $this->_segList2 = is_array( $text2 ) ? $text2 : $this->segment( $text2 );
    }

    /**
     * [外部調用]
     *
     * @return [type] [description]
     */
    public function run()
    {
        $this->analyse();
        $rate = $this->handle();
        return $rate ? $rate : 'errors';
    }

    /**
     * [分析兩段文字]
     */
    private function analyse()
    {
        //t1
        foreach ($this->_segList1 as $v) {
            if (!in_array($v, $this->_excludeArr)) {
                if (!array_key_exists($v, $this->_words)) {
                    $this->_words[$v] = array(1, 0);
                } else {
                    $this->_words[$v][0] += 1;
                }
            }
        }

        //t2
        foreach ($this->_segList2 as $v) {
            if (!in_array($v, $this->_excludeArr)) {
                if (!array_key_exists($v, $this->_words)) {
                    $this->_words[$v] = array(0, 1);
                } else {
                    $this->_words[$v][1] += 1;
                }
            }
        }
    }

    /**
     * [處理類似度]
     *
     * @return [type] [description]
     */
    private function handle()
    {
        $sum = $sumT1 = $sumT2 = 0;
        foreach ($this->_words as $word) {
            $sum += $word[0] * $word[1];
            $sumT1 += pow($word[0], 2);
            $sumT2 += pow($word[1], 2);
        }

        $rate = $sum / (sqrt($sumT1 * $sumT2));
        return $rate;
    }

    /**
     * [分詞  【http://www.xunsearch.com/scws/docs.php#pscws23】]
     *
     * @param [type] $text [description]
     *
     * @return [type] [description]
     *
     * @description 分詞只是一個簡單的例子，你能夠使用任意的分詞服務
     */
    private function segment( $text )
    {
        $outText = array();
        $xs = new XS('demo');  // 必須先建立一個 xs 實例，不然會拋出異常
        $tokenizer = new XSTokenizerScws;   // 直接建立實例
        $tokenizer->setIgnore();
        //處理
        $outText = $tokenizer->setMulti(1)->getResult($text);
        $outText = array_column( $outText, 'word');
        $res = $xs->getScwsServer();
        $res->close();
        return $outText;
    }
}

SimHash

SimHash的原理是將很長的一段文字降維成一個0和1組成的字符串,而後計算兩個01字符串的類似度，從而算出兩篇文章的類似程度。也是將文章先分詞,計算存量文章的類似度存入redis或者mysql，須要的時候取出來對比,對比速度20000篇文章的計算時間基本上在2s之內，可是當文章字數很是小而且重複詞很是多的時候會出現文章不相同可是類似度很是高的問題。
主要計算代碼：sql

class SimHash
{
    protected static $length = 256;
    protected static $search = array('0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f');
    protected static $replace = array('0000','0001','0010','0011','0100','0101','0110','0111','1000','1001','1010','1011','1100','1101','1110','1111');
    /**
     * [排除的詞語]
     *
     * @var array
     */
    private static $_excludeArr = array('的', '了', '和', '呢', '啊', '哦', '恩', '嗯', '吧','你','我','&nbsp;');

    public static function get(array &$set)
    {
        $boxes = array_fill(0, self::$length, 0);
        if (is_int(key($set)))
            $dict = array_count_values($set);
        else
            $dict = &$set;

        foreach ($dict as $element => $weight) {
            if ( in_array($element, self::$_excludeArr )){
                continue;
            }

            $hash = hash('sha256', $element);
            $hash = str_replace(self::$search, self::$replace, $hash);
            $hash = substr($hash, 0, self::$length);
            $hash = str_pad($hash, self::$length, '0', STR_PAD_LEFT);

            for ( $i=0; $i < self::$length; $i++ ) {
                $boxes[$i] += ($hash[$i] == '1') ? $weight : -$weight;
            }
        }
        $s = '';
        foreach ($boxes as $box) {
            if ($box > 0)
                $s .= '1';
            else
                $s .= '0';
        }
        return $s;
    }

    public static function hd($h1, $h2)
    {
        $dist = 0;
        for ($i=0;$i<self::$length;$i++) {
            if ( $h1[$i] != $h2[$i] )
                $dist++;
        }
        return (self::$length - $dist) / self::$length;
    }

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。