Annoy 近鄰算法

時間 2019-12-13

標籤 annoy 近鄰算法简体版

原文原文鏈接

Annoy

隨機選擇兩個點，以這兩個節點爲初始中心節點，執行聚類數爲2的kmeans過程，最終產生收斂後兩個聚類中心點 git

二叉樹底層是葉子節點記錄原始數據節點，其餘中間節點記錄的是分割超平面的信息 github

可是上述描述存在兩個問題： ide

（1）查詢過程最終落到葉子節點的數據節點數小於咱們須要的Top N類似鄰居節點數目怎麼辦？ ui

（2）兩個相近的數據節點劃分到二叉樹不一樣分支上怎麼辦？ idea

針對這個問題能夠經過兩個方法來解決： spa

（1）若是分割超平面的兩邊都很類似，那能夠兩邊都遍歷 orm

（2）創建多棵二叉樹樹，構成一個森林 blog

（3）全部樹返回近鄰點都插入到優先隊列中，求並集去重, 而後計算和查詢點距離，最終根據距離值從近距離到遠距離排序，返回Top N近鄰節點集合排序

Summary of features 隊列

Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Native Python support, tested with 2.7, 3.6, and 3.7.
Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)