推薦系統——（一）經典論文文獻及業界應用

http://semocean.com/%E6%8E%A8%E8%8D%90%E7%B3%BB%E7%BB%9F%E7%BB%8F%E5%85%B8%E8%AE%BA%E6%96%87%E6%96%87%E7%8C%AE%E5%8F%8A%E8%B5%84%E6%96%99/git

列了一些以前設計開發百度關鍵詞搜索推薦引擎時，參考過的論文，書籍，以及調研過的推薦系統相關的工具；同時給出參加過及未參加過的業界推薦引擎應用交流資料（有我網盤的連接），材料組織方式參考了廠裏部分同窗的整理。github

由於推薦引擎不能算是一個獨立學科，它與機器學習，數據挖掘有自然不可分的關係，因此同時列了一些這方面有用的工具及書籍，但願能對你們有所幫助。web

一. Survey方面的文章及資料

Adomavicius G, Tuzhilin A. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions[J]. Knowledge and Data Engineering, IEEE Transactions on, 2005, 17(6): 734-749. 2005年的state-of-the-art的推薦綜述，按照content-based, CF, Hybrid的分類方法進行組織，並介紹了推薦引擎設計時須要關注的特性指標，內容很是全。
Marlin B. Collaborative filtering: A machine learning perspective[D]. University of Toronto, 2004. 從傳統機器學習的分類角度來介紹推薦算法，有必定機器學習背景的人來看該文章的話，會以爲寫得通俗易懂
Koren Y, Bell R. Advances in collaborative filtering[M]//Recommender Systems Handbook. Springer US, 2011: 145-186. RSs Handbook中專門講述協同過濾的一章，其中對近年協同過濾的一些重要突破進行了介紹，包括因式分解，時間相關推薦，基於近鄰的推薦以及多種方法的融合，內部很少，但其中引用的論文值得細看
Su X, Khoshgoftaar T M. A survey of collaborative filtering techniques[J]. Advances in artificial intelligence, 2009, 2009: 4. 協同過濾的篇survey，按照memory-base, model-based, hybrid分類方法介紹各類協同過濾方法及評價標準，並在其中給出基於netflix數據進行評估的效果對比
Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems[J]. Computer, 2009, 42(8): 30-37. 主要集中在因式分解實現協同過濾方法，若是看完Advances in collaborative filtering[M]//Recommender Systems Handbook的話，這篇文章就沒有必要再看了
Pazzani M J, Billsus D. Content-based recommendation systems[M]//The adaptive web. Springer Berlin Heidelberg, 2007: 325-341.從宏觀上介紹content-based的策略架構

1. Content-based方法

content-based方法很是依賴於特定領域item的特徵提取及處理，例如音樂推薦或是關鍵詞推薦中不少細節內容信息處理過程都是不同的，故這裏僅列了content-based綜述類的幾篇文章。算法

Pazzani M J, Billsus D. Content-based recommendation systems[M]//The adaptive web. Springer Berlin Heidelberg, 2007: 325-341.從宏觀上介紹content-based的策略架構
Lops P, de Gemmis M, Semeraro G. Content-based recommender systems: State of the art and trends[M]//Recommender Systems Handbook. Springer US, 2011: 73-105. RS Handbook中專門介紹content-based 算法的章節
Jannach D, Zanker M, Felfernig A, et al. Content-based recommendation [M] Charpter 3 Recommender systems: an introduction[M]. Cambridge University Press, 2010.

2. Collaborative Filtering方法

1) Neighbourhood Based Methods

Sarwar B, Karypis G, Konstan J, et al. Item-based collaborative filtering recommendation algorithms[C]//Proceedings of the 10th international conference on World Wide Web. ACM, 2001: 285-295. KNN進行item-based推薦的經典文章，其中也介紹了多種類似度度量標準
Linden G, Smith B, York J. Amazon. com recommendations: Item-to-item collaborative filtering[J]. Internet Computing, IEEE, 2003, 7(1): 76-80. 經典的亞馬遜item-based算法的文章
Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing[C]//VLDB. 1999, 99: 518-529. LSH
Bell R M, Koren Y. Scalable collaborative filtering with jointly derived neighborhood interpolation weights[C]//Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 2007: 43-52.
Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality[C]//Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 1998: 604-613. LSH
Buhler J. Efficient large-scale sequence comparison by locality-sensitive hashing[J]. Bioinformatics, 2001, 17(5): 419-428. LSH應用
Chen T, Zheng Z, Lu Q, et al. Feature-based matrix factorization[J]. arXiv preprint arXiv:1109.2271, 2011.上交Apex實驗室開發的svdfeature工具背後的原理。優勢是能夠對照着代碼學習

2) Model Based Methods

Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems[J]. Computer, 2009, 42(8): 30-37.主要集中在因式分解實現協同過濾方法，若是看完Advances in collaborative filtering[M]//Recommender Systems Handbook的話，這篇文章就沒有必要再看了
Singh A P, Gordon G J. A unified view of matrix factorization models[M]//Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2008: 358-373.

3) Hybrid Methods

Koren Y. Factorization meets the neighborhood: a multifaceted collaborative filtering model[C]//Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008: 426-434. 因式分解與Neighbour-based方法融合
Burke R. Hybrid recommender systems: Survey and experiments[J]. User modeling and user-adapted interaction, 2002, 12(4): 331-370.
Burke R. Hybrid recommender systems: Survey and experiments[J]. User modeling and user-adapted interaction, 2002, 12(4): 331-370. 介紹了多種推薦算法進行融合的框架

二. 推薦系統工業界應用

Netflix：Netflix視頻推薦的背後：算法知道你想看什麼
Netflix：Netflix Recommendations Beyond the 5 Stars
Hulu：Recommender System Algorithm and Architecture-項亮
Youtube：Davidson J, Liebald B, Liu J, et al. The YouTube video recommendation system[C]//Proceedings of the fourth ACM conference on Recommender systems. ACM, 2010: 293-296. Youtube推薦系統中的主要算法。百度關鍵詞搜索推薦系統對其進行了優化，實現了任意類型的級聯二部圖推薦。具體內容可參見博文： google youtube 電影推薦算法，以及百度關鍵詞搜索推薦級聯二部圖實現
豆瓣：個性化推薦系統的幾個問題_豆瓣網王守崑
豆瓣：阿穩_尋路推薦_豆瓣
豆瓣：豆瓣在推薦領域的實踐與思考
百分點：量化美-時尚服飾搭配引擎
weibo及考拉FM：停不下來的推薦實踐_陳開江
阿里：天貓雙11推薦技術應用
阿里：淘寶推薦系統
噹噹：噹噹網搜索和推薦_莊洪波
土豆：個性化視頻推薦系統土豆_明洪濤
360：360推薦系統實踐-楊浩
盛大：推薦系統實戰與效果提高之道-陳運文
盛大：智能推薦系統的開發與應用-陳運文

三. 推薦系統書籍

Segaran T. Programming collective intelligence: building smart web 2.0 applications[M]. O’Reilly Media, 2007.寓教於樂的一本入門教材，附有能夠直接動手實踐的toy級別代碼
Shapira B. Recommender systems handbook[M]. Springer, 2011. 推薦系統可作枕頭，也應該放在枕邊的書籍，看了半本多。若是將該書及其中的參考文獻都看完並理解，那恭喜你，你已經對這個領域有深刻理解了
Jannach D, Zanker M, Felfernig A, et al. Recommender systems: an introduction[M]. Cambridge University Press, 2010. 能夠認爲是2010年前推薦系統論文的綜述集合
Celma O. Music recommendation and discovery[M]. Springer, 2010. 主要內容集中在音樂推薦，領域很是專一於音樂推薦，包括選取的特徵，評測時如何考慮音樂因素
Word sense disambiguation: Algorithms and applications[M]. Springer Science+ Business Media, 2006. 若是涉及到關鍵詞推薦，或是文本推薦，則能夠查閱該書

P.S. 想對某個領域或是工具備深刻了解，能夠找一本該行業的XX HandBook滿懷勇氣與無畏細心看完，而後就會對這個領域有必定（較深）瞭解，固然若是手頭有相關項目同步進行，治療效果更好^_^spring

四. 其餘資料

由於我一直認爲推薦系統不是一個獨立的學科，它不少技術都是直接來自於機器學習，數據挖掘和信息檢索（特別是文本相關的搜索推薦），因此如下也整理了一些以前工做及工做之餘看過，瞭解過，或者準備看的這方面的資料apache

1. 數據挖掘資料

Han J, Kamber M, Pei J. Data mining: concepts and techniques[M]. Morgan kaufmann, 2006. 數據挖掘方面的handbook，教科書類型，雖然厚，卻通俗易懂(再次提醒，要了解某一領域，找本該領域的啥啥handbook耐心認真讀完，那你基本對該領域有必定認識了)
Chakrabarti S. Mining the Web: Discovering knowledge from hypertext data[M]. Morgan Kaufmann, 2003.介紹了一個搜索引擎中的大部分技術，包括spider，索引創建，內部的機器學習算法，信息檢索，並且很是具備實用性，我在百度商務搜索部開發的spider，就是按照其中的架構設計開發的
Liu B. Web data mining: exploring hyperlinks, contents, and usage data[M]. Springer, 2007. 若是說 Mining the Web: Discovering knowledge from hypertext data更偏web mining更偏總體，工程的話，這本書就更偏策略，兩本都讀過的話，你對搜索引擎中的數據挖掘算法的瞭解，就比較全面了
Wu X, Kumar V, Quinlan J R, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37. 專門將2006年評選出來的10大數據挖掘算法拎了出來說講
Rajaraman A, Ullman J D. Mining of massive datasets[M]. Cambridge University Press, 2012.介紹如何使用hadoop進行數據挖掘，若是有hadoop環境則很是實用
Feldman R, Sanger J. The text mining handbook: advanced approaches in analyzing unstructured data[M]. Cambridge University Press, 2007.文本挖掘的handbook
Witten I H, Frank E. Data Mining: Practical machine learning tools and techniques[M]. Morgan Kaufmann, 2005. 結合weka介紹數據挖掘，最大的優勢是weka open source

2. 機器學習資料

Tom M Mitchell,Machine Learning, McGraw-Hill Science/Engineering/Mat, 1997，很是早起的機器學習書籍，很是適合入門，淺顯易懂，但對於工業界應用，只能說是Toy級別的算法。
Bishop C M, Nasrabadi N M. Pattern recognition and machine learning[M]. New York: springer, 2006. 進階型的書籍，對每種算法都有較爲具體的理論介紹
課程：機器學習（Stanford->Andrew Ng）http://v.163.com/special/opencourse/machinelearning.html，大名鼎鼎的Andrew Ng的機器學習公開課，網易上字幕版本；配合課程stanford cs229對應的handout及習題一塊兒學習效果更好
Liu T Y. Learning to rank for information retrieval[J]. Foundations and Trends in Information Retrieval, 2009, 3(3): 225-331. LTR技術比較全的介紹，包括概念，技術；同時還包含該領域中具體的開放數據集合，選擇特徵的標準等；在學習基本概念的同時，可使用這些數據作一作實驗。
http://archive.ics.uci.edu/ml/datasets.html 包含了不少機器學習的數據集，是很是好的學習上手數據

3. 信息檢索

Agirre, Eneko, and Philip Glenny Edmonds, eds. Word sense disambiguation: Algorithms and applications. Vol. 33. Springer Science+ Business Media, 2006.
Manning C D, Raghavan P, Schütze H. Introduction to information retrieval[M]. Cambridge: Cambridge University Press, 2008.
MOFFAT A A, Bell T C. Managing gigabytes: compressing and indexing documents and images[M]. Morgan Kaufmann, 1999.一本很老的介紹搜索引擎的書了，不過09年的時候看仍是被震撼到了，書中各類變着戲法使用幾十M內存處理上G數據，感受很是牛叉。
Liu T Y. Learning to rank for information retrieval[J]. Foundations and Trends in Information Retrieval, 2009, 3(3): 225-331.
Cao Z, Qin T, Liu T Y, et al. Learning to rank: from pairwise approach to listwise approach[C]//Proceedings of the 24th international conference on Machine learning. ACM, 2007: 129-136. 另外附上《tutorial-lTR by Hang Li》《tutorial-LTR by TY Liu》

五. 推薦系統經典軟件

收集和整理了目前互聯網上能找到的和推薦系統相關的開源項目(Open Source Software | Recommendation)，羅列以下，但願對本領域感興趣的朋友有幫助（文/陳運文）api

1. SVDFeature

由上海交大的同窗開發（開發語言C++），代碼嚴謹、質量高，咱們參加KDD競賽時用過，很可靠和方便，並且出自我們國人之手，因此置頂推薦！

項目地址：

http://svdfeature.apexlab.org/wiki/Main_Page

SVDFeature包含一個很靈活的Matrix Factorization推薦框架，能方便的實現SVD、SVD++等方法, 是單模型推薦算法中精度最高的一種。SVDFeature代碼精煉，能夠用相對較少的內存實現較大規模的單機版矩陣分解運算。

另外含有Logistic regression的model，能夠很方便的用來進行ensemble運算

2. Crab

項目地址：

http://geektell.com/story/crab-recommender-systems-in-python/

系統的Tutorial能夠看這裏：

http://muricoca.github.io/crab/

Crab是基於Python開發的開源推薦軟件，其中實現有item和user的協同過濾。聽說更多算法還在開發中，

Crab的python代碼看上去很清晰明瞭，適合一讀

3. CofiRank

C++開發的 Collaborative Filtering算法的開源推薦系統，但彷佛2009年後做者就沒有更新了，

CofiRank依賴boost庫，聯編會比較麻煩。不是特別推薦

項目地址：

http://www.cofirank.org/

4. EasyRec

Java開發的推薦系統，感受更像一個完整的推薦產品，包括了數據錄入模塊、管理模塊、推薦挖掘、離線分析等，整個系統比較完備。

項目地址：

http://easyrec.org/

5. GraphLab

項目地址：

http://graphlab.org/

Graphlab是基於C++開發的一個高性能分佈式graph處理挖掘系統，特色是對迭代的並行計算處理能力強（這方面是hadoop的弱項），

因爲功能獨到，GraphLab在業界名聲很響

用GraphLab來進行大數據量的random walk或graph-based的推薦算法很是有效。

Graphlab雖然名氣比較響亮（CMU開發），可是對通常數據量的應用來講可能還用不上

6. Lenskit

http://lenskit.grouplens.org/

這個Java開發的開源推薦系統，來自美國的明尼蘇達大學，也是推薦領域知名的測試數據集Movielens的做者，

他們的推薦系統團隊，在學術圈內的影響力很大，不少新的學術思想會放到這裏

7. Mahout

網址

http://mahout.apache.org/

Mahout知名度很高，它是Apache基金資助的重要項目，在國內流傳很廣，並已經有一些中文相關書籍了。注意Mahout是一個分佈式機器學習算法的集合，協同過濾只是其中的一部分。除了被稱爲Taste的分佈式協同過濾的實現（Hadoop-based，另有pure Java版本），Mahout裏還有其餘常見的機器學習算法的分佈式實現方案。

另外Mahout的做者之一Sean Owen基於Mahout開發了一個試驗性質的推薦系統，稱爲Myrrix, 能夠看這裏：

http://myrrix.com/quick-start/架構

8. MyMediaLite

http://mymedialite.net/index.html

基於.NET框架的C#開發（也有Java版本），做者基原本自德國、英國等歐洲的一些高校。

除了提供了常見場景的推薦算法，MyMediaLite也有Social Matrix Factorization這樣獨特的功能

儘管是.Net框架，但也提供了Python、Ruby等腳本語言的調用API

MyMediaLite的做者之一Lars Schmidt在2012年KDD會議上專門介紹過他們系統的一些狀況，惋惜因爲.Net開發框架日漸式微，MyMediaLite對Windows NT Server的系統吸引力大些，LAMP網站用得不多

9. LibFM

項目網址：

http://www.libfm.org/

做者是德國Konstanz University的Steffen Rendle，去年KDD Cup競賽上咱們的老對手，他用LibFM同時玩轉Track1和Track2兩個子競賽單元，都取得了很好的成績，說明LibFM是很是管用的利器（雖然在Track1上被咱們戰勝了，hiahia）

顧名思義，LibFM是專門用於矩陣分解的利器，尤爲是其中實現了MCMC（Markov Chain Monte Carlo）優化算法，比常見的SGD（隨即梯度降低）優化方法精度要高（固然也會慢一些）

10. LibMF

項目地址：

http://www.csie.ntu.edu.tw/~cjlin/libmf/

注意LibMF和上面的LibFM是兩個不一樣的開源項目。這個LibMF的做者是大名鼎鼎的臺灣國立大學，他們在機器學習領域頗負盛名，近年連續多屆KDD Cup競賽上均得到優異成績，並曾連續多年得到冠軍。臺灣大學的風格很是務實，業界經常使用的LibSVM， Liblinear等都是他們開發的，開源代碼的效率和質量都很是高

LibMF在矩陣分解的並行化方面做出了很好的貢獻，針對SDG優化方法在並行計算中存在的locking problem和memory discontinuity問題，提出了一種矩陣分解的高效算法，根據計算節點的個數來劃分評分矩陣block，並分配計算節點。系統介紹能夠見這篇論文（Recsys 2013的 Best paper Award）

Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems. Proceedings of ACM Recommender Systems 2013.