新聞預測股價——對衝基金Two Sigma尋求智能解決方案

[TOC]python

微信公衆號:AIKaggle 歡迎建議和拍磚,若須要資源,請公衆號留言; 若是你以爲AIKaggle對你有幫助,點一下在看算法

引言

做爲Boosting算法前世此生的緩衝,今天來說一個很是有意思的Kaggle比賽,也就是用新聞數據來預測金融市場。這是由對衝基金Two Sigma在Kaggle社區發起的一場競賽,獎池是100,000萬美金,吸引了衆多Kagglers大神參與比賽,貢獻思路。本文介紹本次比賽的背景、數據、提交方式、評估方式等。並提出一個解決方案,介紹eda(early data analysis)方法,包括時序分析、數據可視化、異常值處理並給出提交結果。該kernel由Andrew Lukyanenko創做。一言以概之,就是EDA兩小時,Modeling5分鐘。微信

咱們能夠經過分析新聞內容來預測股價表現嗎?現在,多維度的數據使投資者們可以作出更好的投資決策,而其中的挑戰主要在於如何在這個信息海洋中提取有用的信息並加以使用。對衝基金Two Sigma在 Kaggle 社區舉辦了一個經過分析新聞數據來預測股票價格的比賽,Kagglers有機會來推動此項研究,探索如何用新聞預測股票價格,並且研究的結果可能在全世界產生重大的經濟影響。 本次比賽的數據來自如下來源: Intrinio提供市場數據。 app

  • Thomson Reuters提供新聞數據。

評價指標

  • 在本次競賽中,您必須預測一個有符號置信度值 \(\hat y_{ti} \in [-1,1]\),它乘以在給定assetCode下的10天市場調整回報。若是您認爲股票在將來十天內與大盤相比具備較大的正回報,您能夠爲其分配一個大的、正的confidenceValue(接近$1.0$)。若是您認爲股票具備負回報,您能夠爲其指定一個較大的、負的confidenceValue(接近$-1.0$)。若是不肯定,您能夠指定confidenceValue爲接近零的值。
  • 對於評估時間段中的每一天,咱們計算: \(x_t =\sum_{i}\hat y_{ti}r_{ti}u_{ti}\) 其中$r_$是第$t$天第$i$個金融資產對應的10天市場調整回報,而$u_$是取值爲0或1的universe邏輯變量(詳細信息見數據描述),$u_$表示特定資產是否包含在特定日期的評分中。
  • 您的提交分數將計算爲每日$x_t$的平均值除以每日$x_t$值的標準差: \(score = \frac{\bar x_t}{\sigma(x_t)}\) 若是預測的標準誤差爲0,則分數定義爲0。

提交說明

提交內容

  • 這是一個僅限於Kernels的兩階段競賽,其中第二階段是真正的預測將來。在第一階段,參與者將創建模型,排行榜將反映歷史時間段內的分數。在第一階段結束時,代碼將被凍結,排行榜將轉換爲顯示將來數據的分數。Kaggle將從新運行參與者選擇的在將來數據上運行的Kernel,並從新提交該Kernel生成的提交文件。 dom

  • 本次競賽的全部提交都將經過Kernels環境進行。Kernels環境有一個自定義python模塊,參與者必須使用它來訪問比賽數據,進行預測並編寫適當的提交文件。此模塊用於確保模型在進行預測時不包含將來信息。爲簡單起見,本次比賽的提交文件將涵蓋歷史,第1階段時段和將來第2階段時段。這意味着在給定時間只有一個「有效」提交文件(參與者同時預測每一個階段的時間跨度,如上圖)。在評分期間,Kaggle將忽略當前階段以外的預測值。ide

提交文件

  • 您必須直接從Kaggle Kernels提交。在調用env.write_submission_file()時,內核環境會自動格式化並建立提交文件,無需手動建立提交。
  • 提交的格式以下:
time,assetCode,confidenceValue 
2017-01-03,RPXC.O,0.1 
2017-01-04,RPXC.O,0.02 
2017-01-05,RPXC.O,-0.3
etc.

數據描述

  • 在本次競賽中,您將基於兩個數據來源預測將來股票價格:
  1. 由Intrinio提供的市場數據(2007年至今):包含金融市場信息,如開盤價,收盤價,交易量,計算回報等。 2.新聞數據(2007年至今)資料來源:Thomson Reuters:包含有關資產的新聞文章/警告信息,如文章詳情,投資情緒和其餘評論。
  • 每一個資產都由assetCode標識(請注意,單個公司可能有多個assetCode)。根據您的目的,您可使用assetCodeassetNametime將市場數據和新聞數據進行JOIN。

Market data

Within the marketdata, you will find the following columns:函數

  • 數據包括美國上市金融資產的一個子集。這個集合包含的金融資產天天都會發生變化,並根據交易金額和信息的可用性來肯定。這意味着可能有金融資產輸入和離開這部分數據。所以,所提供的數據中可能存在Gaps,這並不必定意味着數據不存在(因爲選擇條件,這些行可能不包括在內)。市場數據包含在不一樣時間段上計算的各類回報。這組市場數據中的全部收益都具備如下屬性:
  • 收益的計算要麼是開盤到開盤(從一個交易日的開盤時間到另外一個交易日的開盤時間)要麼是收盤到收盤(從一個交易日的收盤時間到另外一個交易日的開盤時間)。
  • 回報率要麼是原始的(數據沒有根據任何基準進行調整),要麼是市場剩餘化(Mktres)(這意味着市場做爲一個總體的運動已經被考慮在內,只留下金融資產固有的運動。
  • 能夠在任意時間間隔內計算返回值。這裏提供1天和10天的視野。
  • 若是在時間上是往過去看,則標記返回值爲「prev」;若是在時間上是向前看,則標記返回值爲「next」。

在市場數據中,您將找到如下列:性能

  • time(datetime64[ns, UTC]) - 當前時間 (市場數據中,全部行都顯示 22:00 UTC)
  • assetCode(object) - 資產的惟一ID
  • assetName(category) - 一組assetCodes對應的名稱。若是相應assetCode的新聞數據中沒有任何行,則這些多是"Unknown" 。
  • universe(float64) - 一個布爾值,表示該金融資產是否包含在當天的評分中。在訓練數據時間段以外不提供該值。特定日期的交易範圍是可用於交易的金融資產集合(評分函數給不在交易領域中的金融資產打分)。交易範圍天天都在變化。
  • volume(float64) - 當天股票交易量
  • close(float64) - 當天收盤價(未調整分割或股息)
  • open(float64) - 當天的開盤價(未調整分拆或股息)
  • returnsClosePrevRaw1(float64) - 請參閱上面的返回說明 -returnsOpenPrevRaw1(float64) - 請參閱上面的返回說明
  • returnsClosePrevMktres1(float64) - 請參閱上面的返回說明
  • returnsOpenPrevMktres1(float64) - 請參閱上面的返回說明
  • returnsClosePrevRaw10(float64) - 請參閱上面的返回說明
  • returnsOpenPrevRaw10(float64) - 請參閱上面的返回說明
  • returnsClosePrevMktres10(float64) - 請參閱上面的返回說明
  • returnsOpenPrevMktres10(float64) - 請參閱上面的返回說明
  • returnsOpenNextMktres10(float64) - 將來10天的市場殘差回報。這是競爭評分中使用的目標變量。市場數據已通過濾,所以 returnsOpenNextMktres10不爲空。

News data

新聞數據包含新聞文章和資產信息。ui

  • time(datetime64[ns, UTC]) - 顯示數據在訂閱源上可用的UTC時間戳
  • sourceTimestamp(datetime64[ns, UTC]) - 此新聞項建立時的UTC時間戳
  • firstCreated(datetime64[ns, UTC]) - UTC timestamp for the first version of the item
  • sourceId(object) - 新聞Id
  • headline(object) - 標題
  • urgency(int8) - 類型 (1: alert, 3: article)
  • takeSequence(int16) - 新聞項的獲取序列號,從1開始。對於給定的故事,alert和article具備單獨的序列。
  • provider(category) - 提供新聞項目的組織的標識符(例如,RTRS表明路透社新聞的,BSW表明美國商業資訊)
  • subjects(category) - 與該新聞項目相關的主題代碼和公司標識符。主題代碼描述了新聞項目的主題。這些能夠涵蓋資產類別,地理位置,事件,行業/部門和其餘類型。
  • audiences(category) - 標識新聞項目所屬的桌面新聞產品。它們一般針對特定受衆羣體量身定製。 (例如,「M」爲Money國際新聞服務,「FB」爲法國通用新聞服務)
  • bodySize(int32) - 故事主體的當前版本的大小
  • companyCount(int8) - 新聞項目中明確列出的公司數量
  • headlineTag(object) -新聞的湯森路透標題標籤
  • marketCommentary(bool) - 布爾值,新聞是否在討論通常市場條件
  • sentenceCount(int16) - 新聞中的句子總數
  • wordCount(int32) - 新聞中的詞彙總數
  • assetCodes(category) - 新聞中提到的資產代碼
  • assetName(category) -資產名稱
  • firstMentionSentence(int16) - 第一句提到被評分資產的句子。
    • 1: 標題
    • 2: 故事主體的第一句話
    • 3: 故事主體的第二句話,以此類推
    • 0: 在新聞項目的標題或正文中未找到被評分的資產。
  • relevance(float32) - 一個十進制數字,表示新聞項與資產的相關性。它的範圍是0到1.若是標題中提到了資產,則相關性設置爲1.當新聞是alert(urgency== 1)時,相關性應該由firstMentionSentence來衡量。 還有較多較爲類似的列,考慮到篇幅大小,故不列出,接下來介紹一個Kernel提供的EDA。

解決方案

準備數據

輸入:導入強大的packagethis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import datetime
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
stop = set(stopwords.words('english'))


import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn import model_selection
from sklearn.metrics import accuracy_score

輸入:獲取數據的官方途徑

# official way to get the data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()
print('Done!')

輸出:導入成功

Loading the data... This could take a minute.
Done!
Done!

輸入:咱們有兩部分數據——市場數據和新聞數據,分別探索之。

(market_train_df, news_train_df) = env.get_training_data()

市場數據

這是一個很是有趣的數據集,其中包含十多年來許多公司的股票價格。如今讓咱們來看看數據自己,咱們能夠看到長期趨勢,公司的初露頭角和衰落,還有許多其餘事情。 輸入:打印市場數據的維度

print(f'{market_train_df.shape[0]} samples and {market_train_df.shape[1]} features in the training market dataset.')

輸出:樣本數目和特徵數目

4072956 samples and 16 features in the training market dataset.

輸入:看看前五條數據長什麼樣子

market_train_df.head()

輸出:前五條數據的dataframe

輸入:隨機選擇10條資產記錄,可視化他們收盤價格的時序變化。

data = []
for asset in np.random.choice(market_train_df['assetName'].unique(), 10):
    asset_df = market_train_df[(market_train_df['assetName'] == asset)]

    data.append(go.Scatter(
        x = asset_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
![plot2.PNG](https://upload-images.jianshu.io/upload_images/19514105-ca6033e21c4c752b.PNG?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
        y = asset_df['close'].values,
        name = asset
    ))
layout = go.Layout(dict(title = "Closing prices of 10 random assets",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')

輸出:隨機抽取10條資產記錄的收盤價格隨時間變化的曲線

輸入:收盤價格分位數的趨勢變化曲線

data = []
#market_train_df['close'] = market_train_df['close'] / 20
for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]:
    price_df = market_train_df.groupby('time')['close'].quantile(i).reset_index()

    data.append(go.Scatter(
        x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = price_df['close'].values,
        name = f'{i} quantile'
    ))
layout = go.Layout(dict(title = "Trends of closing prices by quantiles",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')

分析:可以看到市場如何下跌並再次上漲是很激動人心的。當市場出現嚴重的股價下跌時,能夠注意到,較高的分位數價格隨着時間的推移而增長,較低的分位數價格降低。也許貧富差距會愈來愈大…

輸入:看看價格降低的細節,計算天天收盤價和開盤價的價格差,並計算價格差的平均標準差

market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open']
grouped = market_train_df.groupby('time').agg({'price_diff': ['std', 'min']}).reset_index()

print(f"Average standard deviation of price change within a day in {grouped['price_diff']['std'].mean():.4f}.")

輸出:天天的平均deviation爲1.0335

Average standard deviation of price change within a day in 1.0335.

輸入:可視化deviation最大的十個月

g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10]
g['min_text'] = 'Maximum price drop: ' + (-1 * g['price_diff']['min']).astype(str)
trace = go.Scatter(
    x = g['time'].dt.strftime(date_format='%Y-%m-%d').values,
    y = g['price_diff']['std'].values,
    mode='markers',
    marker=dict(
        size = g['price_diff']['std'].values,
        color = g['price_diff']['std'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = g['min_text'].values
    #text = f"Maximum price drop: {g['price_diff']['min'].values}"
    #g['time'].dt.strftime(date_format='%Y-%m-%d').values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Top 10 months by standard deviation of price change within a day',
    hovermode= 'closest',
    yaxis=dict(
        title= 'price_diff',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

輸出:能夠看到有一個月的deviation很大。推測一下緣由:會不會是當市場崩潰時,股價波動劇烈?但這彷佛不是很合理,2010年1月並無發生市場崩潰...這多是出現異常值致使的,接下來須要處理異常值。

輸入:觀察價格差最大的10條數據

market_train_df.sort_values('price_diff')[:10]

輸出: table2.PNG

分析:能夠看到,「Towers Watson&Co」股票的價格幾乎是10000 …這頗有可能就是咱們要找的異常值。 可是Bank of New York Mellon Corp呢? 讓咱們看看雅虎的數據: Bank of New York Mellon Corp的數據和比賽提供的是一致的。

Archrock Inc是成本等於999,這個數字看起來很可疑。 讓咱們來看看Archrock Inc。

分析:Archrock Inc是成本等於999,這個數字看起來也很不正常。觀察yahoo上Archrock Inc的數據,果真又找到一個異常值。

輸入:天天價格波動超過20%的數據有多少條

market_train_df['close_to_open'] =  np.abs(market_train_df['close'] / market_train_df['open'])
print(f"In {(market_train_df['close_to_open'] >= 1.2).sum()} lines price increased by 20% or more.")
print(f"In {(market_train_df['close_to_open'] <= 0.8).sum()} lines price decreased by 20% or more.")

輸出:

In 1211 lines price increased by 20% or more.
In 778 lines price decreased by 20% or more.

輸入:繼續挖掘奇怪的案例,天天價格波動超過100%的數據有多少條

print(f"In {(market_train_df['close_to_open'] >= 2).sum()} lines price increased by 100% or more.")
print(f"In {(market_train_df['close_to_open'] <= 0.5).sum()} lines price decreased by 100% or more.")

輸出:

In 38 lines price increased by 100% or more.
In 16 lines price decreased by 100% or more.

輸入:咱們不妨假設天天價格波動超過100%的數據是異常值,須要替換異常值。一個快速的解決方案是,用這家公司的平均開盤價或收盤價替換這些線中的異常值(中位數、衆數也可)。

market_train_df['assetName_mean_open'] = market_train_df.groupby('assetName')['open'].transform('mean')
market_train_df['assetName_mean_close'] = market_train_df.groupby('assetName')['close'].transform('mean')

# if open price is too far from mean open price for this company, replace it. Otherwise replace close price.
for i, row in market_train_df.loc[market_train_df['close_to_open'] >= 2].iterrows():
    if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
        market_train_df.iloc[i,5] = row['assetName_mean_open']
    else:
        market_train_df.iloc[i,4] = row['assetName_mean_close']
        
for i, row in market_train_df.loc[market_train_df['close_to_open'] <= 0.5].iterrows():
    if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
        market_train_df.iloc[i,5] = row['assetName_mean_open']
    else:
        market_train_df.iloc[i,4] = row['assetName_mean_close']

輸入:從新可視化deviation

market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open']
grouped = market_train_df.groupby(['time']).agg({'price_diff': ['std', 'min']}).reset_index()
g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10]
g['min_text'] = 'Maximum price drop: ' + (-1 * np.round(g['price_diff']['min'], 2)).astype(str)
trace = go.Scatter(
    x = g['time'].dt.strftime(date_format='%Y-%m-%d').values,
    y = g['price_diff']['std'].values,
    mode='markers',
    marker=dict(
        size = g['price_diff']['std'].values * 5,
        color = g['price_diff']['std'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = g['min_text'].values
    #text = f"Maximum price drop: {g['price_diff']['min'].values}"
    #g['time'].dt.strftime(date_format='%Y-%m-%d').values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Top 10 months by standard deviation of price change within a day',
    hovermode= 'closest',
    yaxis=dict(
        title= 'price_diff',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

輸出:看起來正常了

輸入:觀察目標變量

data = []
for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]:
    price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].quantile(i).reset_index()

    data.append(go.Scatter(
        x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = price_df['returnsOpenNextMktres10'].values,
        name = f'{i} quantile'
    ))
layout = go.Layout(dict(title = "Trends of returnsOpenNextMktres10 by quantiles",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')

輸入:咱們能夠看到分位數具備較高的誤差,但平均值變化不大。咱們只留下2010年以來的數據,如今來看看目標變量。

data = []
market_train_df = market_train_df.loc[market_train_df['time'] >= '2010-01-01 22:00:00+0000']

price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].mean().reset_index()

data.append(go.Scatter(
    x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
    y = price_df['returnsOpenNextMktres10'].values,
    name = f'{i} quantile'
))
layout = go.Layout(dict(title = "Treand of returnsOpenNextMktres10 mean",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')

輸出: 波動彷佛很高,但實際上它們均低於8%,就像一個隨機的噪音……

輸入:觀察return爲前綴的變量

data = []
for col in ['returnsClosePrevRaw1', 'returnsOpenPrevRaw1',
       'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
       'returnsClosePrevRaw10', 'returnsOpenPrevRaw10',
       'returnsClosePrevMktres10', 'returnsOpenPrevMktres10',
       'returnsOpenNextMktres10']:
    df = market_train_df.groupby('time')[col].mean().reset_index()
    data.append(go.Scatter(
        x = df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = df[col].values,
        name = col
    ))
    
layout = go.Layout(dict(title = "Treand of mean values",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')

輸出:看起來前10天的回報波動最大。

新聞數據

news_train_df.head()
print(f'{news_train_df.shape[0]} samples and {news_train_df.shape[1]} features in the training news dataset.')

輸出

9328827 samples and 35 features in the training news dataset.

輸入:該文件太大而沒法直接處理文本,因此先看看最後100000個標題生成的詞雲。

text = ' '.join(news_train_df['headline'].str.lower().values[-1000000:])
wordcloud = WordCloud(max_font_size=None, stopwords=stop, background_color='white',
                      width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top words in headline')
plt.axis("off")
plt.show()

輸出:

輸入:關於urgency

# Let's also limit the time period
news_train_df = news_train_df.loc[news_train_df['time'] >= '2010-01-01 22:00:00+0000']
(news_train_df['urgency'].value_counts() / 1000000).plot('bar');
plt.xticks(rotation=30);
plt.title('Urgency counts (mln)');

輸出:看起來urgency爲2的數據幾乎沒有。

輸入:每句詞數的統計

news_train_df['sentence_word_count'] =  news_train_df['wordCount'] / news_train_df['sentenceCount']
plt.boxplot(news_train_df['sentence_word_count'][news_train_df['sentence_word_count'] < 40]);

輸出:沒有明顯的異常值,每句話大多有15-25詞。

輸入:

news_train_df['provider'].value_counts().head(10)

輸出:能夠看到,路透社是最大的提供商。

RTRS    5517624
PRN      503267
BSW      472612
GNW      145309
MKW      129621
LSE       64250
HIIS      56489
RNS       39833
CNW       30779
ONE       25233
Name: provider, dtype: int64

輸入:標題標籤類型

(news_train_df['headlineTag'].value_counts() / 1000)[:10].plot('barh');
plt.title('headlineTag counts (thousands)');

輸出:標籤缺失現象較爲嚴重

輸入:情緒分析

for i, j in zip([-1, 0, 1], ['negative', 'neutral', 'positive']):
    df_sentiment = news_train_df.loc[news_train_df['sentimentClass'] == i, 'assetName']
    print(f'Top mentioned companies for {j} sentiment are:')
    print(df_sentiment.value_counts().head(5))
    print('')

輸出:蘋果既是積極情緒的top1,也是消極情緒的top1。

Top mentioned companies for negative sentiment are:
Apple Inc                  22518
JPMorgan Chase & Co        20647
BP PLC                     19328
Goldman Sachs Group Inc    17955
Bank of America Corp       17704
Name: assetName, dtype: int64

Top mentioned companies for neutral sentiment are:
HSBC Holdings PLC    19462
Credit Suisse AG     14632
Deutsche Bank AG     12959
Barclays PLC         12414
Apple Inc            10994
Name: assetName, dtype: int64

Top mentioned companies for positive sentiment are:
Apple Inc                19020
Barclays PLC             18051
Royal Dutch Shell PLC    15484
General Electric Co      14163
Boeing Co                14080
Name: assetName, dtype: int64

建模

輸入:加入一些特徵,可能幫助模型訓練取得更好的結果。 好比每日價格波動(收盤價與開盤價之比) 進行歸一化(減少因數據絕對值大小對結果的影響)

#%%time
# code mostly takes from this kernel: https://www.kaggle.com/ashishpatel26/bird-eye-view-of-two-sigma-xgb

def data_prep(market_df,news_df):
    market_df['time'] = market_df.time.dt.date
    market_df['returnsOpenPrevRaw1_to_volume'] = market_df['returnsOpenPrevRaw1'] / market_df['volume']
    market_df['close_to_open'] = market_df['close'] / market_df['open']
    market_df['volume_to_mean'] = market_df['volume'] / market_df['volume'].mean()
    news_df['sentence_word_count'] =  news_df['wordCount'] / news_df['sentenceCount']
    news_df['time'] = news_df.time.dt.hour
    news_df['sourceTimestamp']= news_df.sourceTimestamp.dt.hour
    news_df['firstCreated'] = news_df.firstCreated.dt.date
    news_df['assetCodesLen'] = news_df['assetCodes'].map(lambda x: len(eval(x)))
    news_df['assetCodes'] = news_df['assetCodes'].map(lambda x: list(eval(x))[0])
    news_df['headlineLen'] = news_df['headline'].apply(lambda x: len(x))
    news_df['assetCodesLen'] = news_df['assetCodes'].apply(lambda x: len(x))
    news_df['asset_sentiment_count'] = news_df.groupby(['assetName', 'sentimentClass'])['time'].transform('count')
    news_df['asset_sentence_mean'] = news_df.groupby(['assetName', 'sentenceCount'])['time'].transform('mean')
    lbl = {k: v for v, k in enumerate(news_df['headlineTag'].unique())}
    news_df['headlineTagT'] = news_df['headlineTag'].map(lbl)
    kcol = ['firstCreated', 'assetCodes']
    news_df = news_df.groupby(kcol, as_index=False).mean()

    market_df = pd.merge(market_df, news_df, how='left', left_on=['time', 'assetCode'], 
                            right_on=['firstCreated', 'assetCodes'])

    lbl = {k: v for v, k in enumerate(market_df['assetCode'].unique())}
    market_df['assetCodeT'] = market_df['assetCode'].map(lbl)
    
    market_df = market_df.dropna(axis=0)
    
    return market_df

market_train_df.drop(['price_diff', 'assetName_mean_open', 'assetName_mean_close'], axis=1, inplace=True)
market_train = data_prep(market_train_df, news_train_df)
print(market_train.shape)
up = market_train.returnsOpenNextMktres10 >= 0

fcol = [c for c in market_train.columns if c not in ['assetCode', 'assetCodes', 'assetCodesLen', 'assetName', 'assetCodeT',
                                             'firstCreated', 'headline', 'headlineTag', 'marketCommentary', 'provider',
                                             'returnsOpenNextMktres10', 'sourceId', 'subjects', 'time', 'time_x', 'universe','sourceTimestamp']]

X = market_train[fcol].values
up = up.values
r = market_train.returnsOpenNextMktres10.values

# Scaling of X values
mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
rng = maxs - mins
X = 1 - ((maxs - X) / rng)

輸出

(611261, 54)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning:

invalid value encountered in subtract

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning:

invalid value encountered in true_divide

輸入:LightGBM訓練(下次能夠詳細講講各類BOOSTING算法的參數含義)

X_train, X_test, up_train, up_test, r_train, r_test = model_selection.train_test_split(X, up, r, test_size=0.1, random_state=99)

# xgb_up = XGBClassifier(n_jobs=4,
#                        n_estimators=300,
#                        max_depth=3,
#                        eta=0.15,
#                        random_state=42)
params = {'learning_rate': 0.01, 'max_depth': 12, 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'is_training_metric': True, 'seed': 42}
model = lgb.train(params, train_set=lgb.Dataset(X_train, label=up_train), num_boost_round=2000,
                  valid_sets=[lgb.Dataset(X_train, label=up_train), lgb.Dataset(X_test, label=up_test)],
                  verbose_eval=100, early_stopping_rounds=100)

訓練過程:觀察AUC

Training until validation scores don't improve for 100 rounds.
[100]	valid_0's auc: 0.570258	valid_1's auc: 0.566332
[200]	valid_0's auc: 0.573703	valid_1's auc: 0.567868
[300]	valid_0's auc: 0.577024	valid_1's auc: 0.568927
[400]	valid_0's auc: 0.580109	valid_1's auc: 0.569985
[500]	valid_0's auc: 0.582933	valid_1's auc: 0.570694
[600]	valid_0's auc: 0.585372	valid_1's auc: 0.571191
[700]	valid_0's auc: 0.58784	valid_1's auc: 0.571578
[800]	valid_0's auc: 0.590147	valid_1's auc: 0.571726
[900]	valid_0's auc: 0.592448	valid_1's auc: 0.571908
[1000]	valid_0's auc: 0.594658	valid_1's auc: 0.57203
[1100]	valid_0's auc: 0.596887	valid_1's auc: 0.572259
[1200]	valid_0's auc: 0.598918	valid_1's auc: 0.572422
[1300]	valid_0's auc: 0.601052	valid_1's auc: 0.572563
[1400]	valid_0's auc: 0.603196	valid_1's auc: 0.57269
[1500]	valid_0's auc: 0.605227	valid_1's auc: 0.572756
[1600]	valid_0's auc: 0.60723	valid_1's auc: 0.572837
[1700]	valid_0's auc: 0.609211	valid_1's auc: 0.572897
[1800]	valid_0's auc: 0.611181	valid_1's auc: 0.573038
[1900]	valid_0's auc: 0.613095	valid_1's auc: 0.573162
[2000]	valid_0's auc: 0.615015	valid_1's auc: 0.573307
Did not meet early stopping. Best iteration is:
[2000]	valid_0's auc: 0.615015	valid_1's auc: 0.573307

輸入:觀察特徵重要程度

def generate_color():
    color = '#{:02x}{:02x}{:02x}'.format(*map(lambda x: np.random.randint(0, 255), range(3)))
    return color

df = pd.DataFrame({'imp': model.feature_importance(), 'col':fcol})
df = df.sort_values(['imp','col'], ascending=[True, False])
data = [df]
for dd in data:  
    colors = []
    for i in range(len(dd)):
         colors.append(generate_color())

    data = [
        go.Bar(
        orientation = 'h',
        x=dd.imp,
        y=dd.col,
        name='Features',
        textfont=dict(size=20),
            marker=dict(
            color= colors,
            line=dict(
                color='#000000',
                width=0.5
            ),
            opacity = 0.87
        )
    )
    ]
    layout= go.Layout(
        title= 'Feature Importance of LGB',
        xaxis= dict(title='Columns', ticklen=5, zeroline=False, gridwidth=2),
        yaxis=dict(title='Value Count', ticklen=5, gridwidth=2),
        showlegend=True
    )

    py.iplot(dict(data=data,layout=layout), filename='horizontal-bar')

輸入:提交

days = env.get_prediction_days()
import time

n_days = 0
prep_time = 0
prediction_time = 0
packaging_time = 0
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    n_days +=1
    if n_days % 50 == 0:
        print(n_days,end=' ')
    
    t = time.time()
    market_obs_df = data_prep(market_obs_df, news_obs_df)
    market_obs_df = market_obs_df[market_obs_df.assetCode.isin(predictions_template_df.assetCode)]
    X_live = market_obs_df[fcol].values
    X_live = 1 - ((maxs - X_live) / rng)
    prep_time += time.time() - t
    
    t = time.time()
    lp = model.predict(X_live)
    prediction_time += time.time() -t
    
    t = time.time()
    confidence = 2 * lp -1
    preds = pd.DataFrame({'assetCode':market_obs_df['assetCode'],'confidence':confidence})
    predictions_template_df = predictions_template_df.merge(preds,how='left').drop('confidenceValue',axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'})
    env.predict(predictions_template_df)
    packaging_time += time.time() - t
    
env.write_submission_file()

輸出:

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning:

invalid value encountered in true_divide

50 100 150 
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning:

invalid value encountered in subtract

200 250 300 350 400 450 500 550 600 Your submission file has been saved. Once you `Commit` your Kernel and it finishes running, you can submit the file to the competition from the Kernel Viewer `Output` tab.

總結

  • 這篇kernel好花了大量的時間作了EDA,妥善處理了異常值,作了充分的數據分析,而高質量的數據絕對對最終結果有較大程度的提高,關於模型訓練的部分只是簡單調用了LightGBM模型(多是出於效率和效果的性價比考慮)。有機會詳細聊聊各類Boosting算法實現時的各類參數是怎樣從理論得來的,以及最終如何影響模型性能。
相關文章
相關標籤/搜索