Elasticsearch中文同義詞

Elasticsearch的標準版本及以上是支持設置同義詞功能的, 其實也就是除了OSS(開源)版之外其它的都支持.html

環境說明

  • Elasticsearch 7.6.x
  • 與ES相匹配的IK分詞插件
  • 示例中會分別使用到shell命令和Kibana, 以$開頭的表明是shell命令, 不然表示Kibana的console命令

操做

同義詞能夠使用 synonym 參數來內嵌指定,或者必須 存在於集羣每個節點上的同義詞文件中。 同義詞文件路徑由 synonyms_path 參數指定,應絕對或相對於 Elasticsearch config 目錄。

下面以同義詞的兩種設置方式來介紹:shell

同義詞文件方式

設置同義詞文件

# 進入Elasticsearch目錄執行,生成文件
$ echo '"iPhone,蘋果手機 => iPhone,蘋果手機",
    "2233,22娘,33娘 => bilibili,B站"' > config/analysis/synonyms.txt

建立索引

PUT /goods2
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "updateable": true,
          "synonyms_path": "analysis/synonyms.txt"
        }
      },
      "analyzer": {
        "my_synonyms_analyzer": {
          "tokenizer": "ik_smart",
          "filter": [
            "my_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "search_analyzer": "my_synonyms_analyzer"
      }
    }
  }
}
my_synonym_filter是自定義的 詞彙過濾器, my_synonyms_analyzer是自定義的分析器, 能夠看出後者是包含並引用了前者的.

在本索引中自定義的詞彙過慮器和分析器也只能在當前索引中使用.json

updateable指示可否動態更新, 必須爲true才能動態更新同義詞api

synonyms_path指示同義詞文件的位置app

analysis.analyzer.tokenizer指示在這個分析器裏用ik_smart的分詞器, 在這個索引中的分析鏈是原始文本 => 分詞器 => 詞彙過濾器, 即原始文本先通過分詞的結果再用來給詞彙過濾器處理(在這個索引的做用是同義詞).iphone

mappings.properties.title.search_analyzer指示title字段在查詢時使用my_synonyms_analyzer分析器, 同理mappings.properties.title.analyzer指示其在索引時使用的分析器.elasticsearch

查看分析結果

第一行分詞的效果
# 字母大小寫沒有影響
GET goods2/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "iphone"
}

GET goods2/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "蘋果手機"
}

上面兩條語句的結果是同樣的ide

{
  "tokens" : [
    {
      "token" : "iphone",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "蘋果",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "手機",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}
第二行分詞的效果
GET goods2/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "2233"
}

GET goods2/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "22娘"
}

結果ui

{
  "tokens" : [
    {
      "token" : "bilibili",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "b",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "站",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

變動同義詞並更新索引

# 進入Elasticsearch目錄執行,生成文件
# `iPhone,蘋果手機 => iPhone,蘋果手機`與`iPhone,蘋果手機`的效果是同樣的
# 內容中的雙引號`"`和行末的逗號`,`不是必須的(沒有的話需要有換行符), 這裏只是爲了和和內嵌式的保持一致才這麼寫的
$ echo '"iPhone,蘋果手機",
    "2233,22娘,33娘 => bilibili,B站,二次元"' > config/analysis/synonyms.txt
# 使新的同義詞生效
POST /goods2/_reload_search_analyzers
變動同義詞後的第二行分詞的效果
GET goods2/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "2233"
}

結果插件

{
  "tokens" : [
    {
      "token" : "bilibili",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "b",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "二次元",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "站",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

內嵌方式

建立索引

同義詞配置就在synonyms屬性裏

PUT /goods3
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "iPhone,蘋果手機 => iPhone,蘋果手機",
            "2233,22娘,33娘 => bilibili,B站"
          ]
        }
      },
      "analyzer": {
        "my_synonyms_analyzer": {
          "tokenizer": "ik_smart",
          "filter": [
            "my_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_smart",
        "search_analyzer": "my_synonyms_analyzer"
      }
    }
  }
}

查看分析結果

下面的結果跟同義詞文件方式的是同樣的

GET goods3/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "iphone"
}

GET goods3/_analyze
{
  "analyzer": "my_synonyms_analyzer",
  "text": "2233"
}

變動同義詞並更新索引

# 需要先關閉索引才能變動設置
POST /goods3/_close

PUT /goods3/_settings/
{
  "analysis": {
    "filter": {
      "my_synonym_filter": {
        "type": "synonym",
        "synonyms": [
          "iPhone,蘋果手機",
          "2233,22娘,33娘 => bilibili,B站,二次元"
        ]
      }
    }
  }
}

# 從新開啓索引
POST /goods3/_open

查詢實踐

以索引goods2爲例

# 插入一條數據
POST /goods2/_doc/1
{
  "title":"bilibili是個好平臺"
}

# 經過`2233`關鍵詞查找
GET /goods2/_search
{
  "query": {
    "match": {
      "title": "2233"
    }
  }
}

結果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "goods2",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "bilibili是個好平臺"
        }
      }
    ]
  }
}

總結

  • 在Elasticsearch中設置同義詞有內嵌式同義詞文件式兩種
  • 同義詞文件式能夠在不關閉索引的狀況下動態更新同義詞

參考資料

相關文章
相關標籤/搜索