滴滴KDD2018：強化學習派單

時間 2019-12-04

標籤滴滴 kdd2018 kdd 強化學習简体版

原文原文鏈接

白話解讀

離線learning部分

本質上是將任意時刻任意空間位置離散化爲時空網格，根據派單記錄（含參加調度但無單的司機）計算該時空網格到當天結束時刻的預期收入。ios

關鍵問題：怎麼計算預期收入？算法

動態規劃思路：假設總共有時刻區間爲[0, T)；先計算T-1時刻的全部網格的預期收入（此時將來收入爲0，只有當前收入），其本質就是計算當前收入的均值；而後計算T-2時刻的全部網格的預期收入；...；以此類推app

這樣的話，就能夠計算出每一個時空網格到當天結束時刻的預期收入。框架

重點：爲何按照這個方式獲得的值函數是合理的？ide

The resultant value function captures spatiotemporal patterns of both the demand side and the supply side. To make it clearer, asa special case, when using no discount and an episode-length of a day, the state-value function in fact corresponds to the expected revenue that this driver will earn on average from the current time until the end of the day.函數

在線planning部分

使用如下公式描述訂單和司機之間的匹配度：學習

價格越高，匹配度越高
當前位置價值越大，匹配度越低
將來位置價值越大，匹配度越高
接駕里程，隱形表達，越大則預計送達時間越大，衰減係數越小，匹配度越低

使用KM算法求解匹配結果this

評估方案

AB-test方案

we adopted a customized A/B testing design thatsplits tra c according to large time slices (three or six hours). Forexample, a three-hour split sets the rst three hours in Day 1 to runvariant A and the next three hours for variant B. The order is thenreversed for Day 2. Such experiments will last for two weeks toeliminate the daily di erence. We select large time slices to observelong-term impacts generated by order dispatch approaches.spa

實際收益

the performance improvementbrought by the MDP method is consistent in all cities, with gains inglobal GMV and completion rate ranging from 0.5% to 5%. Consis-tent to the previous discoveries, the MDP method achieved its bestperformance gain in cities with high order-driver ratios. Meanwhile,the averaged dispatch time was nearly identical to the baselinemethod, indicating little sacrifice in user experienceorm

Value function可視化效果

如何包裝爲強化學習

將時空網格定義爲state；將派單和不派單定義爲action；將state的預期收入定義爲狀態值函數。

強化學習的目的是求解最優策略，也等價於求解最優值函數。派單場景的獨特的地方是，建模的時候agent是每一個司機，作決策的時候是平臺決策，因此司機實際上是沒有策略的，或者說，經過派單機制，司機的策略被統一化爲使平臺的指望收入最大。所以在強化學習的框架下，能夠將離線learning和在線planning認爲是policy iteration的兩個步驟，learning是更新value function，planning是policy update。然而，其實細想起來，仍是有些勉強。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。