本質上是將任意時刻任意空間位置離散化爲時空網格,根據派單記錄(含參加調度但無單的司機)計算該時空網格到當天結束時刻的預期收入。ios
關鍵問題:怎麼計算預期收入?算法
動態規劃思路:假設總共有時刻區間爲[0, T);先計算T-1時刻的全部網格的預期收入(此時將來收入爲0,只有當前收入),其本質就是計算當前收入的均值;而後計算T-2時刻的全部網格的預期收入;...;以此類推app
這樣的話,就能夠計算出每一個時空網格到當天結束時刻的預期收入。框架
重點:爲何按照這個方式獲得的值函數是合理的?ide
The resultant value function captures spatiotemporal patterns of both the demand side and the supply side. To make it clearer, asa special case, when using no discount and an episode-length of a day, the state-value function in fact corresponds to the expected revenue that this driver will earn on average from the current time until the end of the day.函數
使用如下公式描述訂單和司機之間的匹配度:學習
使用KM算法求解匹配結果this
we adopted a customized A/B testing design thatsplits tra c according to large time slices (three or six hours). Forexample, a three-hour split sets the rst three hours in Day 1 to runvariant A and the next three hours for variant B. The order is thenreversed for Day 2. Such experiments will last for two weeks toeliminate the daily di erence. We select large time slices to observelong-term impacts generated by order dispatch approaches.spa
the performance improvementbrought by the MDP method is consistent in all cities, with gains inglobal GMV and completion rate ranging from 0.5% to 5%. Consis-tent to the previous discoveries, the MDP method achieved its bestperformance gain in cities with high order-driver ratios. Meanwhile,the averaged dispatch time was nearly identical to the baselinemethod, indicating little sacrifice in user experienceorm
將時空網格定義爲state;將派單和不派單定義爲action;將state的預期收入定義爲狀態值函數。
強化學習的目的是求解最優策略,也等價於求解最優值函數。派單場景的獨特的地方是,建模的時候agent是每一個司機,作決策的時候是平臺決策,因此司機實際上是沒有策略的,或者說,經過派單機制,司機的策略被統一化爲使平臺的指望收入最大。所以在強化學習的框架下,能夠將離線learning和在線planning認爲是policy iteration的兩個步驟,learning是更新value function,planning是policy update。然而,其實細想起來,仍是有些勉強。