最近flink較火,嘗試使用flink作推薦功能試試,說幹就幹,話說flink-ml確實比較水,包含的算法較少,且只支持scala版本,以致flink1.9已經將flink-ml移除,看來是準備有大動做,但後期的實時推薦,flink能派上大用場。所幸基於物品的協同過濾算法相對簡單,實現起來難度不大。先看目前推薦總體的架構。redis
先說一下用到的類似算法:
X=(x1, x2, x3, … xn),Y=(y1, y2, y3, … yn)
那麼歐式距離爲:算法
很明顯,值越大,類似性越差,若是二者徹底相同,那麼距離爲0。sql
第一步準備數據,數據的格式以下:架構
actionObject 是房屋的編號,actionType是用戶的行爲,包括曝光未點擊,點擊,收藏等。ide
下面的代碼是從hdfs中獲取數據,並將view事件的數據清除,其餘的行爲轉化爲分數優化
public static DataSet<Tuple2<Tuple2<String, String>, Float>> getData(ExecutionEnvironment env, String path) { DataSet<Tuple2<Tuple2<String, String>, Float>> res= env.readTextFile(path).map(new MapFunction<String, Tuple2<Tuple2<String, String>, Float>> (){ @Override public Tuple2<Tuple2<String, String>, Float> map(String value) throws Exception { JSONObject jj=JSON.parseObject(value); if(RecommendUtils.getValidAction(jj.getString("actionType"))) { return new Tuple2<>(new Tuple2<>(jj.getString("userId"),jj.getString("actionObject")),RecommendUtils.getScore(jj.getString("actionType"))); }else { return null; } } }).filter(new FilterFunction<Tuple2<Tuple2<String, String>, Float>>(){ @Override public boolean filter(Tuple2<Tuple2<String, String>, Float> value) throws Exception { return value!=null; } }); return res; }
數據通過簡單的清洗後變成以下的格式ui
按照前兩列聚合, scala
groupBy(0).reduce(new ReduceFunction<Tuple2<Tuple2<String, String>, Float>>() { @Override public Tuple2<Tuple2<String, String>, Float> reduce(Tuple2<Tuple2<String, String>, Float> value1, Tuple2<Tuple2<String, String>, Float> value2) throws Exception { // TODO Auto-generated method stub return new Tuple2<>(new Tuple2<>(value1.f0.f0, value1.f0.f1),(value1.f1+value2.f1)); } })
結構變成code
此時,理論上BJCY56167779_03,BJCY56167779_04 的類似度爲 (4-3) ^2+(5-2) ^2, 再開方,繼續前進。orm
去掉第一列,格式以下
由於:
(x1-y1)^2+(x2-y2)^2=x1^2+y1^2-2x1y1+x2^2+y2^2-2x2y2=x1^2+y1^2+x2^2+y2^2-2(x1y1+x2y2), 因此咱們先求x1^2+x2^2的值,並註冊爲item表
.map(new MapFunction<Tuple2<String, Float>, Tuple2<String, Float>>() { @Override public Tuple2<String, Float> map(Tuple2<String, Float> value) throws Exception { return new Tuple2<>(value.f0, value.f1*value.f1); } }). groupBy(0).reduce(new ReduceFunction<Tuple2<String, Float>>(){ @Override public Tuple2<String, Float> reduce(Tuple2<String, Float> value1, Tuple2<String, Float> value2) throws Exception { Tuple2<String, Float> temp= new Tuple2<>(value1.f0, value1.f1 + value2.f1); return temp; } }).map(new MapFunction<Tuple2<String, Float>, ItemDTO> (){ @Override public ItemDTO map(Tuple2<String, Float> value) throws Exception { ItemDTO nd=new ItemDTO(); nd.setItemId(value.f0); nd.setScore(value.f1); return nd; } }); tableEnv.registerDataSet("item", itemdto); // 註冊表信息
通過上面的轉化,前半部分的值已經求出,下面要求出(x1y1+x2y2)的值
將上面的原始table再次轉一下,變成下面的格式
代碼以下:
.map(new MapFunction<Tuple2<String,List<Tuple2<String,Float>>>, List<Tuple2<Tuple2<String, String>, Float>>>() { @Override public List<Tuple2<Tuple2<String, String>, Float>> map(Tuple2<String,List<Tuple2<String,Float>>> value) throws Exception { List<Tuple2<String, Float>> ll= value.f1; List<Tuple2<Tuple2<String, String>, Float>> list = new ArrayList<>(); for (int i = 0; i < ll.size(); i++) { for (int j = 0; j < ll.size(); j++) { list.add(new Tuple2<>(new Tuple2<>(ll.get(i).f0, ll.get(j).f0), ll.get(i).f1 * ll.get(j).f1)); } } return list; } }) tableEnv.registerDataSet("item_relation", itemRelation); // 註冊表信息
下面就是將整個公式連起來,完成最後的計算。
Table similarity=tableEnv.sqlQuery("select ta.firstItem,ta.secondItem," + "(sqrt(tb.score + tc.score - 2 * ta.relationScore)) as similarScore from item tb " + "inner join item_relation ta on tb.itemId = ta.firstItem and ta.firstItem <> ta.secondItem "+ "inner join item tc on tc.itemId = ta.secondItem " ); DataSet<ItemSimilarDTO> ds=tableEnv.toDataSet(similarity, ItemSimilarDTO.class);
如今結構變成
感受離終點不遠了,上述結構依然不是咱們想要的,咱們但願結構更加清晰,以下格式
代碼以下:
DataSet<RedisDataModel> redisResult= ds.map(new MapFunction<ItemSimilarDTO, Tuple2<String, Tuple2<String, Float>>> (){ @Override public Tuple2<String, Tuple2<String, Float>> map(ItemSimilarDTO value) throws Exception { return new Tuple2<String, Tuple2<String, Float>>(value.getFirstItem(), new Tuple2<>(value.getSecondItem(), value.getSimilarScore().floatValue())); } }).groupBy(0).reduceGroup(new GroupReduceFunction<Tuple2<String, Tuple2<String, Float>> , Tuple2<String, List<RoomModel>>>() { @Override public void reduce(Iterable<Tuple2<String, Tuple2<String, Float>>> values, Collector<Tuple2<String, List<RoomModel>>> out) throws Exception { List<RoomModel> list=new ArrayList<>(); String key=null; for (Tuple2<String, Tuple2<String, Float>> t : values) { key=t.f0; RoomModel rm=new RoomModel(); rm.setRoomCode(t.f1.f0); rm.setScore(t.f1.f1); list.add(rm); } //升序排序 Collections.sort(list,new Comparator<RoomModel>(){ @Override public int compare(RoomModel o1, RoomModel o2) { return o1.getScore().compareTo(o2.getScore()); } }); out.collect(new Tuple2<>(key,list)); } }).map(new MapFunction<Tuple2<String, List<RoomModel>>, RedisDataModel>(){ @Override public RedisDataModel map(Tuple2<String, List<RoomModel>> value) throws Exception { RedisDataModel m=new RedisDataModel(); m.setExpire(-1); m.setKey(JobConstants.REDIS_FLINK_ITEMCF_KEY_PREFIX+value.f0); m.setGlobal(true); m.setValue(JSON.toJSONString(value.f1)); return m; } });
最終將這些數據存入redis中,方便查詢
RedisOutputFormat redisOutput = RedisOutputFormat.buildRedisOutputFormat() .setHostMaster(AppConfig.getProperty(JobConstants.REDIS_HOST_MASTER)) .setHostSentinel(AppConfig.getProperty(JobConstants.REDIS_HOST_SENTINELS)) .setMaxIdle(Integer.parseInt(AppConfig.getProperty(JobConstants.REDIS_MAXIDLE))) .setMaxTotal(Integer.parseInt(AppConfig.getProperty(JobConstants.REDIS_MAXTOTAL))) .setMaxWaitMillis(Integer.parseInt(AppConfig.getProperty(JobConstants.REDIS_MAXWAITMILLIS))) .setTestOnBorrow(Boolean.parseBoolean(AppConfig.getProperty(JobConstants.REDIS_TESTONBORROW))) .finish(); redisResult.output(redisOutput); env.execute("itemcf");
大功告成,其實沒有想象中的那麼難。固然這裏只是一個demo,實際狀況還要進行數據過濾,多表join優化等。