公司須要根據過去一段時間內天天網站的流量數據,預測將來一段時間每日流量,這樣,在流量高峯到來前,能夠提早警示相關的運營、運維提早準備。html
這是個典型的「時序預測問題」,關於時序預測的方法有不少,有規則法、機器學習、傳統建模法等等。web
本文主要講述機器學習的方式。算法
因爲工做中主要用的是Spark技術棧處理數據,因此這裏也選用SparkML來解決。固然,機器學習的包和庫又不少,徹底能夠用sklearn來作。實際上,數據分析階段我用的是pandas、numpy、sklearn,效率更高些。sql
初始數據很簡單,只有兩列:PV、日期apache
畫個曲線圖,觀察一下:app
從圖中看出,發現2019-07先後總體差別很大,這實際上是因爲業務調整致使的。因爲需求是預測將來幾天的pv,那麼必定是以現有的業務爲基礎,過早的數據反而是噪聲,直接拋棄。運維
選取近半年的數據,再觀察一下:機器學習
這個數據就相對比較穩定了。學習
總體觀察,數據變化存在週期性,一個週期是一星期;工做日相對週末pv高些;優化
局部觀察,節假日爲高峯(但並不是全部節假日都是高峯,一樣這與具體業務相關,因此須要按本身的業務整理出節假日表);
另外,非節假日也有高峯,可能的緣由是有熱點事件(2020年2月,疫情期間熱點較多);對於熱點事件致使的流量高峯不可預測,因此咱們儘可能減少這類樣本的影響,所以後邊數據處理時會「去熱點」。
這裏選取線性迴歸模型做爲機器學習模型,並不是是線性迴歸是最優的,而是趨勢預測很容易想到線性迴歸模型,能夠做爲baseline,後續在此基礎上嘗試其餘模型進行優化。
通過上邊的數據分析,能夠知道週末、工做日、節假日對pv影響較大,所以能夠把這幾個值做爲特徵:
day_of_week // 星期幾,取值1~7 is_weekend // 是不是週末,取值0、1,星期六和星期日是週末 is_holiday // 是不是節假日,取值0、1,節假日庫根據實際業務維護
既然有周期性,那麼
週一的pv與全部週一的平均值有必定關係
週二的pv與全部週二的平均值有必定關係
...
因此,每一個day_of_week的平均值能夠做爲一個特徵。
一樣,週末、節假日都有相似的均值特徵。
day_of_week_avg // 按 day_of_week 分組,求平均值 is_weekend_avg // 按 is_weekend 分組,取平均值 is_holiday_avg // 按 is_holiday 分組,取平均值
與均值特徵相似,能夠有中位數特徵
day_of_week_med // 按 day_of_week 分組,取中位數 is_weekend_med // 按 is_weekend 分組,取中位數 is_holiday_med // 按 is_holiday 分組,取中位數
均值特徵、中位數特徵反應的是總體的狀況,實際上某日的pv頗有可能取決於最近N天的pv。
具體N取幾?須要多試試了。這裏N取1到14,獲得一組特徵:
lag_1 // 平移1天,即昨天的pv lag_2 // 平移2天,即兩天前的pv ... lag_7 // 平移7天,上週這天的pv ... lag_14 // 平移14天,上上週這天的pv
平移後數據的樣子:
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ | pv| day| lag_1| lag_2| lag_3| lag_4| lag_5| lag_6| lag_7| lag_8| lag_9| lag_10| lag_11| lag_12| lag_13| lag_14| +--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ |15156440|2019-11-01| null| null| null| null| null| null| null| null| null| null| null| null| null| null| |12633297|2019-11-02|15156440| null| null| null| null| null| null| null| null| null| null| null| null| null| |11818845|2019-11-03|12633297|15156440| null| null| null| null| null| null| null| null| null| null| null| null| |15130911|2019-11-04|11818845|12633297|15156440| null| null| null| null| null| null| null| null| null| null| null| |14332734|2019-11-05|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null| null| null| null| |15972959|2019-11-06|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null| null| null| |16366371|2019-11-07|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null| null| |16969708|2019-11-08|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null| null| |12983425|2019-11-09|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| null| |11759009|2019-11-10|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| null| |13700888|2019-11-11|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| null| |15490684|2019-11-12|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| null| |15275479|2019-11-13|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| null| |14978239|2019-11-14|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| null| |16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| |15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297| |15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845| |16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911| |16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734| |17422016|2019-11-20|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|
特徵處理代碼:
Dataset<Row> df = spark.read().schema(schema).option("header", "false").csv("file:///Users/sun/Downloads/pv_data.csv"); df.createOrReplaceTempView("tmp"); df = spark.sql("select * from tmp where day>='2019-11-01'"); df.createOrReplaceTempView("tmp"); // 補充待預測日期: int predDays = 7; String lastDay = spark.sql("select max(day) as day from tmp").first().getAs("day"); Date lastDate = DateUtils.parseDate(lastDay, new String[]{"yyyy-MM-dd"}); String sql = "select pv, day from tmp"; for (int i=0; i<predDays; i++) { Date date = DateUtils.addDays(lastDate, (i + 1)); String day = new SimpleDateFormat("yyyy-MM-dd").format(date); sql += " union (select 0, '" + day + "' from tmp limit 1)"; } sql += " order by day asc"; df = spark.sql(sql); df.createOrReplaceTempView("tmp"); // 平移特徵: int lagStart = 1; int lagEnd = 14; sql = "select *, "; for (int i=lagStart; i<=lagEnd; i++) { sql += "lag(pv, " + i + ") over (partition by null order by day) as lag_" + i; if (i <= lagEnd - 1) sql += ", "; } sql += " from tmp"; df = spark.sql(sql); df.createOrReplaceTempView("tmp"); // 時間特徵: sql = "select *, " + "dayofweek(day) as day_of_week, " + "case when dayofweek(day)==1 or dayofweek(day)==7 then 1 else 0 end as is_weekend, " + "case when day in (" + Arrays.asList(holidays.split(",")).stream().map(s -> "'" + s + "'").collect(Collectors.joining(",")) + ") then 1 else 0 end as is_holiday " + "from tmp"; df = spark.sql(sql); df.registerTempTable("tmp"); // 均值特徵: sql = "select tmp.*, t1.day_of_week_avg, t2.is_weekend_avg, t3.is_holiday_avg from tmp " + "left join (select day_of_week, avg(pv) as day_of_week_avg from tmp group by day_of_week) as t1 on tmp.day_of_week = t1.day_of_week " + "left join (select is_weekend, avg(pv) as is_weekend_avg from tmp group by is_weekend) as t2 on tmp.is_weekend = t2.is_weekend " + "left join (select is_holiday, avg(pv) as is_holiday_avg from tmp group by is_holiday) as t3 on tmp.is_holiday = t3.is_holiday "; df = spark.sql(sql); df.registerTempTable("tmp"); // 中位數特徵: sql = "select tmp.*, t1.day_of_week_med, t2.is_weekend_med, t3.is_holiday_med from tmp " + "left join (select day_of_week, percentile_approx(pv, 0.5) as day_of_week_med from tmp group by day_of_week) as t1 on tmp.day_of_week = t1.day_of_week " + "left join (select is_weekend, percentile_approx(pv, 0.5) as is_weekend_med from tmp group by is_weekend) as t2 on tmp.is_weekend = t2.is_weekend " + "left join (select is_holiday, percentile_approx(pv, 0.5) as is_holiday_med from tmp group by is_holiday) as t3 on tmp.is_holiday = t3.is_holiday "; df = spark.sql(sql); df.registerTempTable("tmp");
以前提到,有些樣本並不是是節假日,但PV很高,多是因爲熱點事件致使。
大體有兩種狀況:1. 運營搞了一些活動,刺激流量激增;2. 社會化熱點事件(參考微博熱搜)。
實際上,經過進一步的數據分析,能夠知道主要緣由是「疫情」間接帶來的PV波動。
熱點事件不像節假日同樣有跡可循,而有必定的隨機性、突發性。爲了簡化,咱們採起必定策略,對異常值進行處理。
這裏,使用策略爲:若是非節假日PV高於中位數的1.5倍,那麼取中位數。代碼以下:
// 異常值處理: // 非節假日,但流量超過中位數的1.5倍,認爲這樣的樣本是異常的(多是熱點事件致使),處理爲中位數 df = spark.sql("select *, " + "if(is_holiday=0 and pv>day_of_week_med*1.5, day_of_week_med, pv) as y " + "from tmp order by day asc"); df = df.na().drop(); df.registerTempTable("tmp"); // 平移特徵0缺失值處理:處理爲day_of_week_avg sql = "select *, "; for (int i=lagStart; i<=lagEnd; i++) { sql += "case when lag_" + i + ">0 then lag_"+i + " else day_of_week_avg end as lag_" + i + "_fix"; if (i <= lagEnd - 1) sql += ", "; } sql += " from tmp"; df = spark.sql(sql); df.registerTempTable("tmp"); // 保存數據: df.write().option("header", "true").csv("file:///Users/sun/Downloads/df");
獲得數據示例:
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+--------------------+---------------+--------------+--------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ | pv| day| lag_1| lag_2| lag_3| lag_4| lag_5| lag_6| lag_7| lag_8| lag_9| lag_10| lag_11| lag_12| lag_13| lag_14|day_of_week|is_weekend|is_holiday| day_of_week_avg| is_weekend_avg| is_holiday_avg|day_of_week_med|is_weekend_med|is_holiday_med| y| lag_1_fix| lag_2_fix| lag_3_fix| lag_4_fix| lag_5_fix| lag_6_fix| lag_7_fix| lag_8_fix| lag_9_fix| lag_10_fix| lag_11_fix| lag_12_fix| lag_13_fix| lag_14_fix| +--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+--------------------+---------------+--------------+--------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+ |16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| 6| 0| 0|2.2269334259259257E7|2.1140308681818184E7| 2.047994621387283E7| 18144580| 17914823| 17128256|16900067| 1.4978239E7| 1.5275479E7| 1.5490684E7| 1.3700888E7| 1.1759009E7| 1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7| 1.515644E7| |15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297| 7| 1| 0|2.2007671444444444E7|2.1197627611111112E7| 2.047994621387283E7| 15728601| 15623119| 17128256|15668745| 1.6900067E7| 1.4978239E7| 1.5275479E7| 1.5490684E7| 1.3700888E7| 1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7| |15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845| 1| 1| 0|2.0387583777777776E7|2.1197627611111112E7| 2.047994621387283E7| 15245430| 15623119| 17128256|15102373| 1.5668745E7| 1.6900067E7| 1.4978239E7| 1.5275479E7| 1.5490684E7| 1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7| |16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911| 2| 0| 0|1.9976350222222224E7|2.1140308681818184E7| 2.047994621387283E7| 16614896| 17914823| 17128256|16475787| 1.5102373E7| 1.5668745E7| 1.6900067E7| 1.4978239E7| 1.5275479E7| 1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7| |16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734| 3| 0| 0|2.0061554769230768E7|2.1140308681818184E7| 2.047994621387283E7| 17121601| 17914823| 17128256|16946753| 1.6475787E7| 1.5102373E7| 1.5668745E7| 1.6900067E7| 1.4978239E7| 1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|
lag_、 _avg、 *_med 這些特徵是pv,量級爲千萬級,對其進行標準化:
// 標準化:lag_*、 *_avg、 *_med 特徵進行標準化 VectorAssembler vectorAssembler = new VectorAssembler() .setInputCols(new String[]{"lag_1", "lag_2", "lag_3", "lag_4", "lag_5", "lag_6", "lag_7", "lag_8", "lag_9", "lag_10", "lag_11", "lag_12", "lag_13", "lag_14", "day_of_week_avg", "is_weekend_avg", "is_holiday_avg", "day_of_week_med", "is_weekend_med", "is_holiday_med"}) .setOutputCol("feature_vec"); df = vectorAssembler.transform(df); MinMaxScaler scaler = new MinMaxScaler() .setInputCol("feature_vec") .setOutputCol("feature_out"); df = scaler.fit(df).transform(df);
VectorAssembler 能夠把 Dataset 的列轉爲Vector類型(後邊算法API必須使用向量做爲入參);
MinMaxScaler 把特徵縮放到[0,1]區間。
處理結果:
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+ | pv| day| lag_1| lag_2| lag_3| lag_4| lag_5| lag_6| lag_7| lag_8| lag_9| lag_10| lag_11| lag_12| lag_13| lag_14|day_of_week|is_weekend|is_holiday| day_of_week_avg| is_weekend_avg| is_holiday_avg|day_of_week_med|is_weekend_med|is_holiday_med| y| lag_1_fix| lag_2_fix| lag_3_fix| lag_4_fix| lag_5_fix| lag_6_fix| lag_7_fix| lag_8_fix| lag_9_fix| lag_10_fix| lag_11_fix| lag_12_fix| lag_13_fix| lag_14_fix| feature_vec| features| +--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+ |16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440| 6| 0| 0|2.2269334259259257E7|2.1140308681818184E7|2.047994621387283E7| 18144580| 17914823| 17128256|16900067|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7| 1.515644E7|[1.4978239E7,1.52...|[0.02878242285919...| |15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297| 7| 1| 0|2.2007671444444444E7|2.1197627611111112E7|2.047994621387283E7| 15728601| 15623119| 17128256|15668745|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7|[1.6900067E7,1.49...|[0.05339190949495...| |15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845| 1| 1| 0|2.0387583777777776E7|2.1197627611111112E7|2.047994621387283E7| 15245430| 15623119| 17128256|15102373|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|[1.5668745E7,1.69...|[0.03762452432660...| |16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911| 2| 0| 0|1.9976350222222224E7|2.1140308681818184E7|2.047994621387283E7| 16614896| 17914823| 17128256|16475787|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|[1.5102373E7,1.56...|[0.03037198967476...| |16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734| 3| 0| 0|2.0061554769230768E7|2.1140308681818184E7|2.047994621387283E7| 17121601| 17914823| 17128256|16946753|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|[1.6475787E7,1.51...|[0.04795889832547...| |17422016|2019-11-20|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959| 4| 0| 0| 2.172384296153846E7|2.1140308681818184E7|2.047994621387283E7| 17928108| 17914823| 17128256|17422016|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|[1.6946753E7,1.64...|[0.05398973536338...| |18010112|2019-11-21|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371| 5| 0| 0|2.1671804769230768E7|2.1140308681818184E7|2.047994621387283E7| 17962984| 17914823| 17128256|18010112|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|[1.7422016E7,1.69...|[0.06007559655750...| |17935725|2019-11-22|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708| 6| 0| 0|2.2269334259259257E7|2.1140308681818184E7|2.047994621387283E7| 18144580| 17914823| 17128256|17935725|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|[1.8010112E7,1.74...|[0.06760631244495...| |15623119|2019-11-23|17935725|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425| 7| 1| 0|2.2007671444444444E7|2.1197627611111112E7|2.047994621387283E7| 15728601| 15623119| 17128256|15623119|1.7935725E7|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|[1.7935725E7,1.80...|[0.06665376836589...| |14637174|2019-11-24|15623119|17935725|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009| 1| 1| 0|2.0387583777777776E7|2.1197627611111112E7|2.047994621387283E7| 15245430| 15623119| 17128256|14637174|1.5623119E7|1.7935725E7|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|[1.5623119E7,1.79...|[0.03704027202242...| +--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+
features爲處理後的特徵列。
// 訓練:使用 lastDay 以前的數據進行訓練 Dataset<Row> trainDataset = spark.sql("select day, features, pv, y from tmp where day<='" + lastDay + "' order by day asc"); double maxR2 = 0.0D; double bestParam = 0.0D; LinearRegressionModel bestModel = null; // 搜索最優參數: for (int i=1; i<=10; i++) { LinearRegression lr = new LinearRegression() .setLabelCol("y") .setFeaturesCol("features") .setMaxIter(10000) .setRegParam(0.03) // 步長 .setElasticNetParam(0.1 * i); LinearRegressionModel model = lr.fit(trainDataset); LinearRegressionTrainingSummary trainingSummary = model.summary(); System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError()); System.out.println("r2: " + trainingSummary.r2()); if(trainingSummary.r2() > maxR2) { bestParam = 0.1 * i; maxR2 = trainingSummary.r2(); bestModel = model; } } System.out.println("best param -> " + bestParam); System.out.println("best r2 -> " + maxR2);
這裏使用LinearRegression ,主要調節setElasticNetParam參數值,詳細參數說明能夠參考文檔。
從0.1~1.0,尋找一個最優值,使得模型r2最高,此時的模型做爲最優模型。
最終,獲得elasticnet爲0.5時最優,r2爲0.7008858790650143。
對將來7天的數據進行預測
Dataset<Row> predDataset = spark.sql("select day, features, pv, y from tmp where day>'" + lastDay + "' order by day asc"); bestModel.setPredictionCol("pv_pred"); bestModel.transform(predDataset).show();
結果以下:
+----------+--------------------+---+---+--------------------+ | day| features| pv| y| pv_pred| +----------+--------------------+---+---+--------------------+ |2020-04-28|[0.03159798985245...| 0| 0|1.7333708553490087E7| |2020-04-29|[0.11516156320975...| 0| 0|1.7833363920196097E7| |2020-04-30|[0.11449520118456...| 0| 0|1.7624262847742468E7| |2020-05-01|[0.12214671526351...| 0| 0|3.6077728160918914E7| |2020-05-02|[0.11879605768944...| 0| 0| 1.518647529881512E7| |2020-05-03|[0.09805043124337...| 0| 0|1.5407320504048364E7| |2020-05-04|[0.09278448304737...| 0| 0| 3.56043256732697E7| +----------+--------------------+---+---+--------------------+
5月1日、5月4日是節假日,預計這兩天將出現流量高峯。
做者:易企秀工程師 Sun