[TOC]python
微信公衆號:AIKaggle 歡迎建議和拍磚,若須要資源,請公衆號留言; 若是你以爲AIKaggle對你有幫助,點一下在看算法
做爲Boosting算法前世此生的緩衝,今天來說一個很是有意思的Kaggle比賽,也就是用新聞數據來預測金融市場。這是由對衝基金Two Sigma在Kaggle社區發起的一場競賽,獎池是100,000萬美金,吸引了衆多Kagglers大神參與比賽,貢獻思路。本文介紹本次比賽的背景、數據、提交方式、評估方式等。並提出一個解決方案,介紹eda(early data analysis)方法,包括時序分析、數據可視化、異常值處理並給出提交結果。該kernel由Andrew Lukyanenko創做。一言以概之,就是EDA兩小時,Modeling5分鐘。微信
咱們能夠經過分析新聞內容來預測股價表現嗎?現在,多維度的數據使投資者們可以作出更好的投資決策,而其中的挑戰主要在於如何在這個信息海洋中提取有用的信息並加以使用。對衝基金Two Sigma在 Kaggle 社區舉辦了一個經過分析新聞數據來預測股票價格的比賽,Kagglers有機會來推動此項研究,探索如何用新聞預測股票價格,並且研究的結果可能在全世界產生重大的經濟影響。 本次比賽的數據來自如下來源: Intrinio提供市場數據。 app
assetCode
下的10天市場調整回報。若是您認爲股票在將來十天內與大盤相比具備較大的正回報,您能夠爲其分配一個大的、正的confidenceValue
(接近$1.0$)。若是您認爲股票具備負回報,您能夠爲其指定一個較大的、負的confidenceValue
(接近$-1.0$)。若是不肯定,您能夠指定confidenceValue
爲接近零的值。universe
邏輯變量(詳細信息見數據描述),$u_$表示特定資產是否包含在特定日期的評分中。這是一個僅限於Kernels的兩階段競賽,其中第二階段是真正的預測將來。在第一階段,參與者將創建模型,排行榜將反映歷史時間段內的分數。在第一階段結束時,代碼將被凍結,排行榜將轉換爲顯示將來數據的分數。Kaggle將從新運行參與者選擇的在將來數據上運行的Kernel,並從新提交該Kernel生成的提交文件。 dom
本次競賽的全部提交都將經過Kernels環境進行。Kernels環境有一個自定義python模塊,參與者必須使用它來訪問比賽數據,進行預測並編寫適當的提交文件。此模塊用於確保模型在進行預測時不包含將來信息。爲簡單起見,本次比賽的提交文件將涵蓋歷史,第1階段時段和將來第2階段時段。這意味着在給定時間只有一個「有效」提交文件(參與者同時預測每一個階段的時間跨度,如上圖)。在評分期間,Kaggle將忽略當前階段以外的預測值。ide
env.write_submission_file()
時,內核環境會自動格式化並建立提交文件,無需手動建立提交。time,assetCode,confidenceValue 2017-01-03,RPXC.O,0.1 2017-01-04,RPXC.O,0.02 2017-01-05,RPXC.O,-0.3 etc.
assetCode
標識(請注意,單個公司可能有多個assetCode
)。根據您的目的,您可使用assetCode
,assetName
或time
將市場數據和新聞數據進行JOIN。Within the marketdata, you will find the following columns:函數
在市場數據中,您將找到如下列:性能
time
(datetime64[ns, UTC]) - 當前時間 (市場數據中,全部行都顯示 22:00 UTC)assetCode
(object) - 資產的惟一IDassetName
(category) - 一組assetCodes
對應的名稱。若是相應assetCode
的新聞數據中沒有任何行,則這些多是"Unknown" 。universe
(float64) - 一個布爾值,表示該金融資產是否包含在當天的評分中。在訓練數據時間段以外不提供該值。特定日期的交易範圍是可用於交易的金融資產集合(評分函數給不在交易領域中的金融資產打分)。交易範圍天天都在變化。volume
(float64) - 當天股票交易量close
(float64) - 當天收盤價(未調整分割或股息)open
(float64) - 當天的開盤價(未調整分拆或股息)returnsClosePrevRaw1
(float64) - 請參閱上面的返回說明 -returnsOpenPrevRaw1
(float64) - 請參閱上面的返回說明returnsClosePrevMktres1
(float64) - 請參閱上面的返回說明returnsOpenPrevMktres1
(float64) - 請參閱上面的返回說明returnsClosePrevRaw10
(float64) - 請參閱上面的返回說明returnsOpenPrevRaw10
(float64) - 請參閱上面的返回說明returnsClosePrevMktres10
(float64) - 請參閱上面的返回說明returnsOpenPrevMktres10
(float64) - 請參閱上面的返回說明returnsOpenNextMktres10
(float64) - 將來10天的市場殘差回報。這是競爭評分中使用的目標變量。市場數據已通過濾,所以 returnsOpenNextMktres10
不爲空。新聞數據包含新聞文章和資產信息。ui
time
(datetime64[ns, UTC]) - 顯示數據在訂閱源上可用的UTC時間戳sourceTimestamp
(datetime64[ns, UTC]) - 此新聞項建立時的UTC時間戳firstCreated
(datetime64[ns, UTC]) - UTC timestamp for the first version of the itemsourceId
(object) - 新聞Idheadline
(object) - 標題urgency
(int8) - 類型 (1: alert, 3: article)takeSequence
(int16) - 新聞項的獲取序列號,從1開始。對於給定的故事,alert和article具備單獨的序列。provider
(category) - 提供新聞項目的組織的標識符(例如,RTRS表明路透社新聞的,BSW表明美國商業資訊)subjects
(category) - 與該新聞項目相關的主題代碼和公司標識符。主題代碼描述了新聞項目的主題。這些能夠涵蓋資產類別,地理位置,事件,行業/部門和其餘類型。audiences
(category) - 標識新聞項目所屬的桌面新聞產品。它們一般針對特定受衆羣體量身定製。 (例如,「M」爲Money國際新聞服務,「FB」爲法國通用新聞服務)bodySize
(int32) - 故事主體的當前版本的大小companyCount
(int8) - 新聞項目中明確列出的公司數量headlineTag
(object) -新聞的湯森路透標題標籤marketCommentary
(bool) - 布爾值,新聞是否在討論通常市場條件sentenceCount
(int16) - 新聞中的句子總數wordCount
(int32) - 新聞中的詞彙總數assetCodes
(category) - 新聞中提到的資產代碼assetName
(category) -資產名稱firstMentionSentence
(int16) - 第一句提到被評分資產的句子。
relevance
(float32) - 一個十進制數字,表示新聞項與資產的相關性。它的範圍是0到1.若是標題中提到了資產,則相關性設置爲1.當新聞是alert(urgency== 1
)時,相關性應該由firstMentionSentence
來衡量。 還有較多較爲類似的列,考慮到篇幅大小,故不列出,接下來介紹一個Kernel提供的EDA。輸入:導入強大的packagethis
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import datetime import lightgbm as lgb from scipy import stats from scipy.sparse import hstack, csr_matrix from sklearn.model_selection import train_test_split from wordcloud import WordCloud from collections import Counter from nltk.corpus import stopwords from nltk.util import ngrams from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import StandardScaler stop = set(stopwords.words('english')) import plotly.offline as py py.init_notebook_mode(connected=True) import plotly.graph_objs as go import plotly.tools as tls from xgboost import XGBClassifier import lightgbm as lgb from sklearn import model_selection from sklearn.metrics import accuracy_score
輸入:獲取數據的官方途徑
# official way to get the data from kaggle.competitions import twosigmanews env = twosigmanews.make_env() print('Done!')
輸出:導入成功
Loading the data... This could take a minute. Done! Done!
輸入:咱們有兩部分數據——市場數據和新聞數據,分別探索之。
(market_train_df, news_train_df) = env.get_training_data()
這是一個很是有趣的數據集,其中包含十多年來許多公司的股票價格。如今讓咱們來看看數據自己,咱們能夠看到長期趨勢,公司的初露頭角和衰落,還有許多其餘事情。 輸入:打印市場數據的維度
print(f'{market_train_df.shape[0]} samples and {market_train_df.shape[1]} features in the training market dataset.')
輸出:樣本數目和特徵數目
4072956 samples and 16 features in the training market dataset.
輸入:看看前五條數據長什麼樣子
market_train_df.head()
輸出:前五條數據的dataframe
輸入:隨機選擇10條資產記錄,可視化他們收盤價格的時序變化。
data = [] for asset in np.random.choice(market_train_df['assetName'].unique(), 10): asset_df = market_train_df[(market_train_df['assetName'] == asset)] data.append(go.Scatter( x = asset_df['time'].dt.strftime(date_format='%Y-%m-%d').values, ![plot2.PNG](https://upload-images.jianshu.io/upload_images/19514105-ca6033e21c4c752b.PNG?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240) y = asset_df['close'].values, name = asset )) layout = go.Layout(dict(title = "Closing prices of 10 random assets", xaxis = dict(title = 'Month'), yaxis = dict(title = 'Price (USD)'), ),legend=dict( orientation="h")) py.iplot(dict(data=data, layout=layout), filename='basic-line')
輸出:隨機抽取10條資產記錄的收盤價格隨時間變化的曲線
輸入:收盤價格分位數的趨勢變化曲線
data = [] #market_train_df['close'] = market_train_df['close'] / 20 for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]: price_df = market_train_df.groupby('time')['close'].quantile(i).reset_index() data.append(go.Scatter( x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values, y = price_df['close'].values, name = f'{i} quantile' )) layout = go.Layout(dict(title = "Trends of closing prices by quantiles", xaxis = dict(title = 'Month'), yaxis = dict(title = 'Price (USD)'), ),legend=dict( orientation="h")) py.iplot(dict(data=data, layout=layout), filename='basic-line')
分析:可以看到市場如何下跌並再次上漲是很激動人心的。當市場出現嚴重的股價下跌時,能夠注意到,較高的分位數價格隨着時間的推移而增長,較低的分位數價格降低。也許貧富差距會愈來愈大…
輸入:看看價格降低的細節,計算天天收盤價和開盤價的價格差,並計算價格差的平均標準差
market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open'] grouped = market_train_df.groupby('time').agg({'price_diff': ['std', 'min']}).reset_index() print(f"Average standard deviation of price change within a day in {grouped['price_diff']['std'].mean():.4f}.")
輸出:天天的平均deviation爲1.0335
Average standard deviation of price change within a day in 1.0335.
輸入:可視化deviation最大的十個月
g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10] g['min_text'] = 'Maximum price drop: ' + (-1 * g['price_diff']['min']).astype(str) trace = go.Scatter( x = g['time'].dt.strftime(date_format='%Y-%m-%d').values, y = g['price_diff']['std'].values, mode='markers', marker=dict( size = g['price_diff']['std'].values, color = g['price_diff']['std'].values, colorscale='Portland', showscale=True ), text = g['min_text'].values #text = f"Maximum price drop: {g['price_diff']['min'].values}" #g['time'].dt.strftime(date_format='%Y-%m-%d').values ) data = [trace] layout= go.Layout( autosize= True, title= 'Top 10 months by standard deviation of price change within a day', hovermode= 'closest', yaxis=dict( title= 'price_diff', ticklen= 5, gridwidth= 2, ), showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')
輸出:能夠看到有一個月的deviation很大。推測一下緣由:會不會是當市場崩潰時,股價波動劇烈?但這彷佛不是很合理,2010年1月並無發生市場崩潰...這多是出現異常值致使的,接下來須要處理異常值。
輸入:觀察價格差最大的10條數據
market_train_df.sort_values('price_diff')[:10]
輸出:
分析:能夠看到,「Towers Watson&Co」股票的價格幾乎是10000 …這頗有可能就是咱們要找的異常值。 可是Bank of New York Mellon Corp呢? 讓咱們看看雅虎的數據: Bank of New York Mellon Corp的數據和比賽提供的是一致的。
Archrock Inc是成本等於999,這個數字看起來很可疑。 讓咱們來看看Archrock Inc。
分析:Archrock Inc是成本等於999,這個數字看起來也很不正常。觀察yahoo上Archrock Inc的數據,果真又找到一個異常值。
輸入:天天價格波動超過20%的數據有多少條
market_train_df['close_to_open'] = np.abs(market_train_df['close'] / market_train_df['open']) print(f"In {(market_train_df['close_to_open'] >= 1.2).sum()} lines price increased by 20% or more.") print(f"In {(market_train_df['close_to_open'] <= 0.8).sum()} lines price decreased by 20% or more.")
輸出:
In 1211 lines price increased by 20% or more. In 778 lines price decreased by 20% or more.
輸入:繼續挖掘奇怪的案例,天天價格波動超過100%的數據有多少條
print(f"In {(market_train_df['close_to_open'] >= 2).sum()} lines price increased by 100% or more.") print(f"In {(market_train_df['close_to_open'] <= 0.5).sum()} lines price decreased by 100% or more.")
輸出:
In 38 lines price increased by 100% or more. In 16 lines price decreased by 100% or more.
輸入:咱們不妨假設天天價格波動超過100%的數據是異常值,須要替換異常值。一個快速的解決方案是,用這家公司的平均開盤價或收盤價替換這些線中的異常值(中位數、衆數也可)。
market_train_df['assetName_mean_open'] = market_train_df.groupby('assetName')['open'].transform('mean') market_train_df['assetName_mean_close'] = market_train_df.groupby('assetName')['close'].transform('mean') # if open price is too far from mean open price for this company, replace it. Otherwise replace close price. for i, row in market_train_df.loc[market_train_df['close_to_open'] >= 2].iterrows(): if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']): market_train_df.iloc[i,5] = row['assetName_mean_open'] else: market_train_df.iloc[i,4] = row['assetName_mean_close'] for i, row in market_train_df.loc[market_train_df['close_to_open'] <= 0.5].iterrows(): if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']): market_train_df.iloc[i,5] = row['assetName_mean_open'] else: market_train_df.iloc[i,4] = row['assetName_mean_close']
輸入:從新可視化deviation
market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open'] grouped = market_train_df.groupby(['time']).agg({'price_diff': ['std', 'min']}).reset_index() g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10] g['min_text'] = 'Maximum price drop: ' + (-1 * np.round(g['price_diff']['min'], 2)).astype(str) trace = go.Scatter( x = g['time'].dt.strftime(date_format='%Y-%m-%d').values, y = g['price_diff']['std'].values, mode='markers', marker=dict( size = g['price_diff']['std'].values * 5, color = g['price_diff']['std'].values, colorscale='Portland', showscale=True ), text = g['min_text'].values #text = f"Maximum price drop: {g['price_diff']['min'].values}" #g['time'].dt.strftime(date_format='%Y-%m-%d').values ) data = [trace] layout= go.Layout( autosize= True, title= 'Top 10 months by standard deviation of price change within a day', hovermode= 'closest', yaxis=dict( title= 'price_diff', ticklen= 5, gridwidth= 2, ), showlegend= False ) fig = go.Figure(data=data, layout=layout) py.iplot(fig,filename='scatter2010')
輸出:看起來正常了
輸入:觀察目標變量
data = [] for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]: price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].quantile(i).reset_index() data.append(go.Scatter( x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values, y = price_df['returnsOpenNextMktres10'].values, name = f'{i} quantile' )) layout = go.Layout(dict(title = "Trends of returnsOpenNextMktres10 by quantiles", xaxis = dict(title = 'Month'), yaxis = dict(title = 'Price (USD)'), ),legend=dict( orientation="h"),) py.iplot(dict(data=data, layout=layout), filename='basic-line')
輸入:咱們能夠看到分位數具備較高的誤差,但平均值變化不大。咱們只留下2010年以來的數據,如今來看看目標變量。
data = [] market_train_df = market_train_df.loc[market_train_df['time'] >= '2010-01-01 22:00:00+0000'] price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].mean().reset_index() data.append(go.Scatter( x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values, y = price_df['returnsOpenNextMktres10'].values, name = f'{i} quantile' )) layout = go.Layout(dict(title = "Treand of returnsOpenNextMktres10 mean", xaxis = dict(title = 'Month'), yaxis = dict(title = 'Price (USD)'), ),legend=dict( orientation="h"),) py.iplot(dict(data=data, layout=layout), filename='basic-line')
輸出: 波動彷佛很高,但實際上它們均低於8%,就像一個隨機的噪音……
輸入:觀察return爲前綴的變量
data = [] for col in ['returnsClosePrevRaw1', 'returnsOpenPrevRaw1', 'returnsClosePrevMktres1', 'returnsOpenPrevMktres1', 'returnsClosePrevRaw10', 'returnsOpenPrevRaw10', 'returnsClosePrevMktres10', 'returnsOpenPrevMktres10', 'returnsOpenNextMktres10']: df = market_train_df.groupby('time')[col].mean().reset_index() data.append(go.Scatter( x = df['time'].dt.strftime(date_format='%Y-%m-%d').values, y = df[col].values, name = col )) layout = go.Layout(dict(title = "Treand of mean values", xaxis = dict(title = 'Month'), yaxis = dict(title = 'Price (USD)'), ),legend=dict( orientation="h"),) py.iplot(dict(data=data, layout=layout), filename='basic-line')
輸出:看起來前10天的回報波動最大。
news_train_df.head() print(f'{news_train_df.shape[0]} samples and {news_train_df.shape[1]} features in the training news dataset.')
輸出
9328827 samples and 35 features in the training news dataset.
輸入:該文件太大而沒法直接處理文本,因此先看看最後100000個標題生成的詞雲。
text = ' '.join(news_train_df['headline'].str.lower().values[-1000000:]) wordcloud = WordCloud(max_font_size=None, stopwords=stop, background_color='white', width=1200, height=1000).generate(text) plt.figure(figsize=(12, 8)) plt.imshow(wordcloud) plt.title('Top words in headline') plt.axis("off") plt.show()
輸出:
輸入:關於urgency
# Let's also limit the time period news_train_df = news_train_df.loc[news_train_df['time'] >= '2010-01-01 22:00:00+0000'] (news_train_df['urgency'].value_counts() / 1000000).plot('bar'); plt.xticks(rotation=30); plt.title('Urgency counts (mln)');
輸出:看起來urgency爲2的數據幾乎沒有。
輸入:每句詞數的統計
news_train_df['sentence_word_count'] = news_train_df['wordCount'] / news_train_df['sentenceCount'] plt.boxplot(news_train_df['sentence_word_count'][news_train_df['sentence_word_count'] < 40]);
輸出:沒有明顯的異常值,每句話大多有15-25詞。
輸入:
news_train_df['provider'].value_counts().head(10)
輸出:能夠看到,路透社是最大的提供商。
RTRS 5517624 PRN 503267 BSW 472612 GNW 145309 MKW 129621 LSE 64250 HIIS 56489 RNS 39833 CNW 30779 ONE 25233 Name: provider, dtype: int64
輸入:標題標籤類型
(news_train_df['headlineTag'].value_counts() / 1000)[:10].plot('barh'); plt.title('headlineTag counts (thousands)');
輸出:標籤缺失現象較爲嚴重
輸入:情緒分析
for i, j in zip([-1, 0, 1], ['negative', 'neutral', 'positive']): df_sentiment = news_train_df.loc[news_train_df['sentimentClass'] == i, 'assetName'] print(f'Top mentioned companies for {j} sentiment are:') print(df_sentiment.value_counts().head(5)) print('')
輸出:蘋果既是積極情緒的top1,也是消極情緒的top1。
Top mentioned companies for negative sentiment are: Apple Inc 22518 JPMorgan Chase & Co 20647 BP PLC 19328 Goldman Sachs Group Inc 17955 Bank of America Corp 17704 Name: assetName, dtype: int64 Top mentioned companies for neutral sentiment are: HSBC Holdings PLC 19462 Credit Suisse AG 14632 Deutsche Bank AG 12959 Barclays PLC 12414 Apple Inc 10994 Name: assetName, dtype: int64 Top mentioned companies for positive sentiment are: Apple Inc 19020 Barclays PLC 18051 Royal Dutch Shell PLC 15484 General Electric Co 14163 Boeing Co 14080 Name: assetName, dtype: int64
輸入:加入一些特徵,可能幫助模型訓練取得更好的結果。 好比每日價格波動(收盤價與開盤價之比) 進行歸一化(減少因數據絕對值大小對結果的影響)
#%%time # code mostly takes from this kernel: https://www.kaggle.com/ashishpatel26/bird-eye-view-of-two-sigma-xgb def data_prep(market_df,news_df): market_df['time'] = market_df.time.dt.date market_df['returnsOpenPrevRaw1_to_volume'] = market_df['returnsOpenPrevRaw1'] / market_df['volume'] market_df['close_to_open'] = market_df['close'] / market_df['open'] market_df['volume_to_mean'] = market_df['volume'] / market_df['volume'].mean() news_df['sentence_word_count'] = news_df['wordCount'] / news_df['sentenceCount'] news_df['time'] = news_df.time.dt.hour news_df['sourceTimestamp']= news_df.sourceTimestamp.dt.hour news_df['firstCreated'] = news_df.firstCreated.dt.date news_df['assetCodesLen'] = news_df['assetCodes'].map(lambda x: len(eval(x))) news_df['assetCodes'] = news_df['assetCodes'].map(lambda x: list(eval(x))[0]) news_df['headlineLen'] = news_df['headline'].apply(lambda x: len(x)) news_df['assetCodesLen'] = news_df['assetCodes'].apply(lambda x: len(x)) news_df['asset_sentiment_count'] = news_df.groupby(['assetName', 'sentimentClass'])['time'].transform('count') news_df['asset_sentence_mean'] = news_df.groupby(['assetName', 'sentenceCount'])['time'].transform('mean') lbl = {k: v for v, k in enumerate(news_df['headlineTag'].unique())} news_df['headlineTagT'] = news_df['headlineTag'].map(lbl) kcol = ['firstCreated', 'assetCodes'] news_df = news_df.groupby(kcol, as_index=False).mean() market_df = pd.merge(market_df, news_df, how='left', left_on=['time', 'assetCode'], right_on=['firstCreated', 'assetCodes']) lbl = {k: v for v, k in enumerate(market_df['assetCode'].unique())} market_df['assetCodeT'] = market_df['assetCode'].map(lbl) market_df = market_df.dropna(axis=0) return market_df market_train_df.drop(['price_diff', 'assetName_mean_open', 'assetName_mean_close'], axis=1, inplace=True) market_train = data_prep(market_train_df, news_train_df) print(market_train.shape) up = market_train.returnsOpenNextMktres10 >= 0 fcol = [c for c in market_train.columns if c not in ['assetCode', 'assetCodes', 'assetCodesLen', 'assetName', 'assetCodeT', 'firstCreated', 'headline', 'headlineTag', 'marketCommentary', 'provider', 'returnsOpenNextMktres10', 'sourceId', 'subjects', 'time', 'time_x', 'universe','sourceTimestamp']] X = market_train[fcol].values up = up.values r = market_train.returnsOpenNextMktres10.values # Scaling of X values mins = np.min(X, axis=0) maxs = np.max(X, axis=0) rng = maxs - mins X = 1 - ((maxs - X) / rng)
輸出
(611261, 54) /opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning: invalid value encountered in subtract /opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning: invalid value encountered in true_divide
輸入:LightGBM訓練(下次能夠詳細講講各類BOOSTING算法的參數含義)
X_train, X_test, up_train, up_test, r_train, r_test = model_selection.train_test_split(X, up, r, test_size=0.1, random_state=99) # xgb_up = XGBClassifier(n_jobs=4, # n_estimators=300, # max_depth=3, # eta=0.15, # random_state=42) params = {'learning_rate': 0.01, 'max_depth': 12, 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'is_training_metric': True, 'seed': 42} model = lgb.train(params, train_set=lgb.Dataset(X_train, label=up_train), num_boost_round=2000, valid_sets=[lgb.Dataset(X_train, label=up_train), lgb.Dataset(X_test, label=up_test)], verbose_eval=100, early_stopping_rounds=100)
訓練過程:觀察AUC
Training until validation scores don't improve for 100 rounds. [100] valid_0's auc: 0.570258 valid_1's auc: 0.566332 [200] valid_0's auc: 0.573703 valid_1's auc: 0.567868 [300] valid_0's auc: 0.577024 valid_1's auc: 0.568927 [400] valid_0's auc: 0.580109 valid_1's auc: 0.569985 [500] valid_0's auc: 0.582933 valid_1's auc: 0.570694 [600] valid_0's auc: 0.585372 valid_1's auc: 0.571191 [700] valid_0's auc: 0.58784 valid_1's auc: 0.571578 [800] valid_0's auc: 0.590147 valid_1's auc: 0.571726 [900] valid_0's auc: 0.592448 valid_1's auc: 0.571908 [1000] valid_0's auc: 0.594658 valid_1's auc: 0.57203 [1100] valid_0's auc: 0.596887 valid_1's auc: 0.572259 [1200] valid_0's auc: 0.598918 valid_1's auc: 0.572422 [1300] valid_0's auc: 0.601052 valid_1's auc: 0.572563 [1400] valid_0's auc: 0.603196 valid_1's auc: 0.57269 [1500] valid_0's auc: 0.605227 valid_1's auc: 0.572756 [1600] valid_0's auc: 0.60723 valid_1's auc: 0.572837 [1700] valid_0's auc: 0.609211 valid_1's auc: 0.572897 [1800] valid_0's auc: 0.611181 valid_1's auc: 0.573038 [1900] valid_0's auc: 0.613095 valid_1's auc: 0.573162 [2000] valid_0's auc: 0.615015 valid_1's auc: 0.573307 Did not meet early stopping. Best iteration is: [2000] valid_0's auc: 0.615015 valid_1's auc: 0.573307
輸入:觀察特徵重要程度
def generate_color(): color = '#{:02x}{:02x}{:02x}'.format(*map(lambda x: np.random.randint(0, 255), range(3))) return color df = pd.DataFrame({'imp': model.feature_importance(), 'col':fcol}) df = df.sort_values(['imp','col'], ascending=[True, False]) data = [df] for dd in data: colors = [] for i in range(len(dd)): colors.append(generate_color()) data = [ go.Bar( orientation = 'h', x=dd.imp, y=dd.col, name='Features', textfont=dict(size=20), marker=dict( color= colors, line=dict( color='#000000', width=0.5 ), opacity = 0.87 ) ) ] layout= go.Layout( title= 'Feature Importance of LGB', xaxis= dict(title='Columns', ticklen=5, zeroline=False, gridwidth=2), yaxis=dict(title='Value Count', ticklen=5, gridwidth=2), showlegend=True ) py.iplot(dict(data=data,layout=layout), filename='horizontal-bar')
輸入:提交
days = env.get_prediction_days() import time n_days = 0 prep_time = 0 prediction_time = 0 packaging_time = 0 for (market_obs_df, news_obs_df, predictions_template_df) in days: n_days +=1 if n_days % 50 == 0: print(n_days,end=' ') t = time.time() market_obs_df = data_prep(market_obs_df, news_obs_df) market_obs_df = market_obs_df[market_obs_df.assetCode.isin(predictions_template_df.assetCode)] X_live = market_obs_df[fcol].values X_live = 1 - ((maxs - X_live) / rng) prep_time += time.time() - t t = time.time() lp = model.predict(X_live) prediction_time += time.time() -t t = time.time() confidence = 2 * lp -1 preds = pd.DataFrame({'assetCode':market_obs_df['assetCode'],'confidence':confidence}) predictions_template_df = predictions_template_df.merge(preds,how='left').drop('confidenceValue',axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'}) env.predict(predictions_template_df) packaging_time += time.time() - t env.write_submission_file()
輸出:
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning: invalid value encountered in true_divide 50 100 150 /opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning: invalid value encountered in subtract 200 250 300 350 400 450 500 550 600 Your submission file has been saved. Once you `Commit` your Kernel and it finishes running, you can submit the file to the competition from the Kernel Viewer `Output` tab.