Annoy 近鄰算法

Annoy

   

   

隨機選擇兩個點,以這兩個節點爲初始中心節點,執行聚類數爲2的kmeans過程,最終產生收斂後兩個聚類中心點 git

二叉樹底層是葉子節點記錄原始數據節點,其餘中間節點記錄的是分割超平面的信息 github

   

   

   

可是上述描述存在兩個問題: ide

(1)查詢過程最終落到葉子節點的數據節點數小於 咱們須要的Top N類似鄰居節點數目怎麼辦? ui

(2)兩個相近的數據節點劃分到二叉樹不一樣分支上怎麼辦? idea

   

針對這個問題能夠經過兩個方法來解決: spa

(1)若是分割超平面的兩邊都很類似,那能夠兩邊都遍歷 orm

(2) 創建多棵二叉樹樹,構成一個森林 blog

(3)全部樹返回近鄰點都插入到優先隊列中,求並集去重, 而後計算和查詢點距離, 最終根據距離值從近距離到遠距離排序, 返回Top N近鄰節點集合 排序

Summary of features 隊列

  • Euclidean distanceManhattan distancecosine distanceHamming distance, or Dot (Inner) Product distance
  • Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
  • Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
  • Small memory usage
  • Lets you share memory between multiple processes
  • Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
  • Native Python support, tested with 2.7, 3.6, and 3.7.
  • Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

build(-1)的樹的顆數問題

:全部節點的個數是trainning data的2倍左右:https://github.com/spotify/annoy/issues/338

build_on_disk 問題

寫文件時候,會向磁盤寫

相關文章
相關標籤/搜索