The Universal Recommender (UR) is a new type of collaborative filtering recommender based on an algorithm that can use data from a wide variety of user taste indicators—it is called the Correlated Cross-Occurrence algorithm. Unlike the matrix factorization embodied in things like MLlib's ALS, The UR's CCO algorithm is able to ingest any number of user actions, events, profile data, and contextual information. It then serves results in a fast and scalable way. It also supports item properties for filtering and boosting recommendations and can therefor be considered a hybrid collaborative filtering and content-based recommender.架構
The use of multiple types of data fundamentally changes the way a recommender is used and, when employed correctly, will provide a significant increase in quality of recommendations vs. using only one user event. Most recommenders, for instance, can only use "purchase" events. Using all we know about a user and their context allows us to much better predict their preferences.dom
User | Action | Item |
---|---|---|
u1 | view | t1 |
u1 | view | t2 |
u1 | view | t3 |
u1 | view | t5 |
u2 | view | t1 |
u2 | view | t3 |
u2 | view | t4 |
u2 | view | t5 |
u3 | view | t2 |
u3 | view | t3 |
u3 | view | t5 |
整理後獲得如下關係:
u1=> [ t1, t2, t3, t5 ]
u2=> [ t1, t3, t4, t5 ]
u3=> [ t2, t3,t5 ]ide
$r=(P^{T}P)h_{p}$fetch
$h_{p}$= 某一用戶的歷史動做(好比購買動做)this
針對某個item的動做在史來狀況下是有可能重複的,若是表達???$h_{u1}=\begin{bmatrix}1 & 2 & 1 & 0 & 1\end{bmatrix}$ 2表明了購買item2兩次spa
若是這麼表示,那麼問題來了,近期的動做和久遠的動做,意義是不一樣的。偶爾受傷買個了拐,是不能根據這個動做就推薦拐的,LLR是否是能夠消減這類狀況呢?scala
$P$ = 歷史全部用戶的主動做(主事件)構成的矩陣rest
$P=\begin{bmatrix}1 & 1 & 1& 0 & 1\\ 1 & 0 & 1 & 1 &1 \\ 0& 1& 1 & 0 & 1\end{bmatrix}$code
$(P^{T}P)$ = compares column to column using log-likelihood based correlation testorm
Let's call ($P^{T}P$) an indicator matrix for some primary action like purchase
Log-likelihood Ratio(LLR對數似然比) finds important/correlating cooccurrences and filters out the rest —a major improvement in quality over simple cooccurrences or other similarity metrics.
根據兩個事件的共現關係計算LLR值,用於衡量兩個事件的關聯度:
$P^{T}\cdot P=\begin{bmatrix}- & 1 & 2 & 1 & 2\\ 1& - & 1 &1 &1\\2& 1& - & 1 &2 \\1& 1 & 1 & - &1\\2&1&2&1 & -\end{bmatrix}\overset{LLR}{\rightarrow}\begin{bmatrix}-& 1.05 & 3.82 & 1.05 &3.82 \\ 1.05 & - &1.05 &1.05 &1.05 \\ 3.82& 1.05 & - & 1.05&3.82 \\1.05&1.05 &1.05 & - &1.05 \\3.82& 1.05 & 3.82 & 1.05&-\end{bmatrix}$
注意:咱們發現每一個用戶都有點擊廣告a4,但a4的LLR值倒是0,也就是a4跟任何帖子都沒有關聯,這看上去很奇怪。但其實這是LLR的特色,LLR對於熱門事件有很大的懲罰,簡單來講它認爲瀏覽t1和點擊廣告a4這兩個事件共同發生的緣由不是由於瀏覽t1和點擊a4有關聯,而僅僅只是由於點擊a4自己是一個高頻發生的事件。
$r=(P^{t}P)h_{p}$
$h_{p}$ =P動做歷史行爲
$r=(P^{t}P)h_{p}$
$r=(P^{t}P)h_{p}$
基於 CCO 的協同過濾推薦能夠表示爲:
$r=(P^{T}P)h_{p}+(P^{T}V)h_{v}+(P^{T}C)h_{c}+…$
Given strong data about user preferences on a general population we can also use
Collaborative Topic Filtering
Entity Preferences:
Indicators can also be based on content similarity
$r=(TT^{t})h_{t}+I\cdot L$
$(TT^{t})$ is a calculation that compares every 2 documents to each other and finds the most similar—based upon content alone
Cooccurences
Cross-occurrence
SimilarityAnalysis.cooccurences
Content or metadata
SimilarityAnalysis.rowSimilarity
Intrinsic
"Universal" means one query on all indicators at once
$r=(P^{T}P)h_{p}+(P^{T}V)h_{v}+(P^{T}C)h_{c}+…(TT^{T}h_{t})+I\cdot L$
Unified query:
Once indicators are indexed as search fields this entire equation is a single query
Fast!
Solution to the "cold-start" problem—items with too short a lifespan or new users with no history
how to solve ??
v0.3.0—most current release
Randomize some returned recs, if they are acted upon they become part of the new training data and are more likely to be recommended in the future
Visibility control: