Annoy
隨機選擇兩個點,以這兩個節點爲初始中心節點,執行聚類數爲2的kmeans過程,最終產生收斂後兩個聚類中心點 git
二叉樹底層是葉子節點記錄原始數據節點,其餘中間節點記錄的是分割超平面的信息 github
![](http://static.javashuo.com/static/loading.gif)
![](http://static.javashuo.com/static/loading.gif)
可是上述描述存在兩個問題: ide
(1)查詢過程最終落到葉子節點的數據節點數小於 咱們須要的Top N類似鄰居節點數目怎麼辦? ui
(2)兩個相近的數據節點劃分到二叉樹不一樣分支上怎麼辦? idea
針對這個問題能夠經過兩個方法來解決: spa
(1)若是分割超平面的兩邊都很類似,那能夠兩邊都遍歷 orm
(2) 創建多棵二叉樹樹,構成一個森林 blog
(3)全部樹返回近鄰點都插入到優先隊列中,求並集去重, 而後計算和查詢點距離, 最終根據距離值從近距離到遠距離排序, 返回Top N近鄰節點集合 排序
Summary of features 隊列
- Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
- Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
- Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
- Small memory usage
- Lets you share memory between multiple processes
- Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
- Native Python support, tested with 2.7, 3.6, and 3.7.
- Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)
build(-1)的樹的顆數問題
:全部節點的個數是trainning data的2倍左右:https://github.com/spotify/annoy/issues/338
![](http://static.javashuo.com/static/loading.gif)
build_on_disk 問題
寫文件時候,會向磁盤寫
![](http://static.javashuo.com/static/loading.gif)