01-NLP-04-04

用每日新聞預測金融市場變化(進階版)

這篇教程裏,咱們會使用FastText來作分類html

In [53]:
import pandas as pd import numpy as np from sklearn.metrics import roc_auc_score from datetime import date 
 

監視數據

咱們先讀入數據。這裏我提供了一個已經combine好了的數據。python

In [54]:
data = pd.read_csv('../input/Combined_News_DJIA.csv') 
 

這時候,咱們能夠看一下數據長什麼樣子web

In [55]:
data.head() 
Out[55]:
  Date Label Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 ... Top16 Top17 Top18 Top19 Top20 Top21 Top22 Top23 Top24 Top25
0 2008-08-08 0 b"Georgia 'downs two Russian warplanes' as cou... b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into So... b'Russian tanks are moving towards the capital... b"Afghan children raped with 'impunity,' U.N. ... b'150 Russian tanks have entered South Ossetia... b"Breaking: Georgia invades South Ossetia, Rus... b"The 'enemy combatent' trials are nothing but... ... b'Georgia Invades South Ossetia - if Russia ge... b'Al-Qaeda Faces Islamist Backlash' b'Condoleezza Rice: "The US would not act to p... b'This is a busy day: The European Union has ... b"Georgia will withdraw 1,000 soldiers from Ir... b'Why the Pentagon Thinks Attacking Iran is a ... b'Caucasus in crisis: Georgia invades South Os... b'Indian shoe manufactory - And again in a se... b'Visitors Suffering from Mental Illnesses Ban... b"No Help for Mexico's Kidnapping Surge"
1 2008-08-11 1 b'Why wont America and Nato help us? If they w... b'Bush puts foot down on Georgian conflict' b"Jewish Georgian minister: Thanks to Israeli ... b'Georgian army flees in disarray as Russians ... b"Olympic opening ceremony fireworks 'faked'" b'What were the Mossad with fraudulent New Zea... b'Russia angered by Israeli military sale to G... b'An American citizen living in S.Ossetia blam... ... b'Israel and the US behind the Georgian aggres... b'"Do not believe TV, neither Russian nor Geor... b'Riots are still going on in Montreal (Canada... b'China to overtake US as largest manufacturer' b'War in South Ossetia [PICS]' b'Israeli Physicians Group Condemns State Tort... b' Russia has just beaten the United States ov... b'Perhaps *the* question about the Georgia - R... b'Russia is so much better at war' b"So this is what it's come to: trading sex fo...
2 2008-08-12 0 b'Remember that adorable 9-year-old who sang a... b"Russia 'ends Georgia operation'" b'"If we had no sexual harassment we would hav... b"Al-Qa'eda is losing support in Iraq because ... b'Ceasefire in Georgia: Putin Outmaneuvers the... b'Why Microsoft and Intel tried to kill the XO... b'Stratfor: The Russo-Georgian War and the Bal... b"I'm Trying to Get a Sense of This Whole Geor... ... b'U.S. troops still in Georgia (did you know t... b'Why Russias response to Georgia was right' b'Gorbachev accuses U.S. of making a "serious ... b'Russia, Georgia, and NATO: Cold War Two' b'Remember that adorable 62-year-old who led y... b'War in Georgia: The Israeli connection' b'All signs point to the US encouraging Georgi... b'Christopher King argues that the US and NATO... b'America: The New Mexico?' b"BBC NEWS | Asia-Pacific | Extinction 'by man...
3 2008-08-13 0 b' U.S. refuses Israel weapons to attack Iran:... b"When the president ordered to attack Tskhinv... b' Israel clears troops who killed Reuters cam... b'Britain\'s policy of being tough on drugs is... b'Body of 14 year old found in trunk; Latest (... b'China has moved 10 *million* quake survivors... b"Bush announces Operation Get All Up In Russi... b'Russian forces sink Georgian ships ' ... b'Elephants extinct by 2020?' b'US humanitarian missions soon in Georgia - i... b"Georgia's DDOS came from US sources" b'Russian convoy heads into Georgia, violating... b'Israeli defence minister: US against strike ... b'Gorbachev: We Had No Choice' b'Witness: Russian forces head towards Tbilisi... b' Quarter of Russians blame U.S. for conflict... b'Georgian president says US military will ta... b'2006: Nobel laureate Aleksander Solzhenitsyn...
4 2008-08-14 1 b'All the experts admit that we should legalis... b'War in South Osetia - 89 pictures made by a ... b'Swedish wrestler Ara Abrahamian throws away ... b'Russia exaggerated the death toll in South O... b'Missile That Killed 9 Inside Pakistan May Ha... b"Rushdie Condemns Random House's Refusal to P... b'Poland and US agree to missle defense deal. ... b'Will the Russians conquer Tblisi? Bet on it,... ... b'Bank analyst forecast Georgian crisis 2 days... b"Georgia confict could set back Russia's US r... b'War in the Caucasus is as much the product o... b'"Non-media" photos of South Ossetia/Georgia ... b'Georgian TV reporter shot by Russian sniper ... b'Saudi Arabia: Mother moves to block child ma... b'Taliban wages war on humanitarian aid workers' b'Russia: World "can forget about" Georgia\'s... b'Darfur rebels accuse Sudan of mounting major... b'Philippines : Peace Advocate say Muslims nee...

5 rows × 27 columnsapi

 

其實看起來特別的簡單直觀。若是是1,那麼當日的DJIA就提升或者不變了。若是是1,那麼DJIA那天就是跌了。數組

 

分割測試/訓練集

這下,咱們能夠先把數據給分紅Training/Testing data網絡

In [56]:
train = data[data['Date'] < '2015-01-01'] test = data[data['Date'] > '2014-12-31'] 
 

而後,咱們把每條新聞作成一個單獨的句子,集合在一塊兒:app

In [57]:
X_train = train[train.columns[2:]] corpus = X_train.values.flatten().astype(str) X_train = X_train.values.astype(str) X_train = np.array([' '.join(x) for x in X_train]) X_test = test[test.columns[2:]] X_test = X_test.values.astype(str) X_test = np.array([' '.join(x) for x in X_test]) y_train = train['Label'].values y_test = test['Label'].values 
 

這裏咱們注意,咱們須要三樣東西:dom

corpus是所有咱們『可見』的文本資料。咱們假設每條新聞就是一句話,把他們所有flatten()了,咱們就會獲得list of sentences。ide

同時咱們的X_train和X_test可不能隨便flatten,他們須要與y_train和y_test對應。oop

In [58]:
corpus[:3] 
Out[58]:
array([ 'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war"',
       "b'BREAKING: Musharraf to be impeached.'",
       "b'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)'"], 
      dtype='<U312')
In [59]:
X_train[:1] 
Out[59]:
array([ 'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war" b\'BREAKING: Musharraf to be impeached.\' b\'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)\' b\'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire\' b"Afghan children raped with \'impunity,\' U.N. official says - this is sick, a three year old was raped and they do nothing" b\'150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.\' b"Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO\'s side" b"The \'enemy combatent\' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it." b\'Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO]\' b\'Did the U.S. Prep Georgia for War with Russia?\' b\'Rice Gives Green Light for Israel to Attack Iran: Says U.S. has no veto over Israeli military ops\' b\'Announcing:Class Action Lawsuit on Behalf of American Public Against the FBI\' b"So---Russia and Georgia are at war and the NYT\'s top story is opening ceremonies of the Olympics?  What a fucking disgrace and yet further proof of the decline of journalism." b"China tells Bush to stay out of other countries\' affairs" b\'Did World War III start today?\' b\'Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war?\' b\'Al-Qaeda Faces Islamist Backlash\' b\'Condoleezza Rice: "The US would not act to prevent an Israeli strike on Iran." Israeli Defense Minister Ehud Barak: "Israel is prepared for uncompromising victory in the case of military hostilities."\' b\'This is a busy day:  The European Union has approved new sanctions against Iran in protest at its nuclear programme.\' b"Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia\'s breakaway region of South Ossetia" b\'Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News &amp; World Report\' b\'Caucasus in crisis: Georgia invades South Ossetia\' b\'Indian shoe manufactory  - And again in a series of "you do not like your work?"\' b\'Visitors Suffering from Mental Illnesses Banned from Olympics\' b"No Help for Mexico\'s Kidnapping Surge"'], 
      dtype='<U4424')
In [60]:
y_train[:5] 
Out[60]:
array([0, 1, 0, 0, 1])
 

來,咱們再把每一個單詞給分隔開:

一樣,corpus和X_train的處理不一樣

In [61]:
from nltk.tokenize import word_tokenize corpus = [word_tokenize(x) for x in corpus] X_train = [word_tokenize(x) for x in X_train] X_test = [word_tokenize(x) for x in X_test] 
 

tokenize完畢後,

咱們能夠看到,雖然corpus和x都是一個二維數組,可是他們的意義不一樣了。

corpus裏,第二維數據是一個個句子。

x裏,第二維數據是一個個數據點(對應每一個label)

In [62]:
X_train[:2] 
Out[62]:
[['b',
  "''",
  'Georgia',
  "'downs",
  'two',
  'Russian',
  'warplanes',
  "'",
  'as',
  'countries',
  'move',
  'to',
  'brink',
  'of',
  'war',
  "''",
  "b'BREAKING",
  ':',
  'Musharraf',
  'to',
  'be',
  'impeached',
  '.',
  "'",
  "b'Russia",
  'Today',
  ':',
  'Columns',
  'of',
  'troops',
  'roll',
  'into',
  'South',
  'Ossetia',
  ';',
  'footage',
  'from',
  'fighting',
  '(',
  'YouTube',
  ')',
  "'",
  "b'Russian",
  'tanks',
  'are',
  'moving',
  'towards',
  'the',
  'capital',
  'of',
  'South',
  'Ossetia',
  ',',
  'which',
  'has',
  'reportedly',
  'been',
  'completely',
  'destroyed',
  'by',
  'Georgian',
  'artillery',
  'fire',
  "'",
  'b',
  "''",
  'Afghan',
  'children',
  'raped',
  'with',
  "'impunity",
  ',',
  "'",
  'U.N.',
  'official',
  'says',
  '-',
  'this',
  'is',
  'sick',
  ',',
  'a',
  'three',
  'year',
  'old',
  'was',
  'raped',
  'and',
  'they',
  'do',
  'nothing',
  "''",
  "b'150",
  'Russian',
  'tanks',
  'have',
  'entered',
  'South',
  'Ossetia',
  'whilst',
  'Georgia',
  'shoots',
  'down',
  'two',
  'Russian',
  'jets',
  '.',
  "'",
  'b',
  "''",
  'Breaking',
  ':',
  'Georgia',
  'invades',
  'South',
  'Ossetia',
  ',',
  'Russia',
  'warned',
  'it',
  'would',
  'intervene',
  'on',
  'SO',
  "'s",
  'side',
  "''",
  'b',
  "''",
  'The',
  "'enemy",
  'combatent',
  "'",
  'trials',
  'are',
  'nothing',
  'but',
  'a',
  'sham',
  ':',
  'Salim',
  'Haman',
  'has',
  'been',
  'sentenced',
  'to',
  '5',
  '1/2',
  'years',
  ',',
  'but',
  'will',
  'be',
  'kept',
  'longer',
  'anyway',
  'just',
  'because',
  'they',
  'feel',
  'like',
  'it',
  '.',
  "''",
  "b'Georgian",
  'troops',
  'retreat',
  'from',
  'S.',
  'Osettain',
  'capital',
  ',',
  'presumably',
  'leaving',
  'several',
  'hundred',
  'people',
  'killed',
  '.',
  '[',
  'VIDEO',
  ']',
  "'",
  "b'Did",
  'the',
  'U.S.',
  'Prep',
  'Georgia',
  'for',
  'War',
  'with',
  'Russia',
  '?',
  "'",
  "b'Rice",
  'Gives',
  'Green',
  'Light',
  'for',
  'Israel',
  'to',
  'Attack',
  'Iran',
  ':',
  'Says',
  'U.S.',
  'has',
  'no',
  'veto',
  'over',
  'Israeli',
  'military',
  'ops',
  "'",
  "b'Announcing",
  ':',
  'Class',
  'Action',
  'Lawsuit',
  'on',
  'Behalf',
  'of',
  'American',
  'Public',
  'Against',
  'the',
  'FBI',
  "'",
  'b',
  "''",
  'So',
  '--',
  '-Russia',
  'and',
  'Georgia',
  'are',
  'at',
  'war',
  'and',
  'the',
  'NYT',
  "'s",
  'top',
  'story',
  'is',
  'opening',
  'ceremonies',
  'of',
  'the',
  'Olympics',
  '?',
  'What',
  'a',
  'fucking',
  'disgrace',
  'and',
  'yet',
  'further',
  'proof',
  'of',
  'the',
  'decline',
  'of',
  'journalism',
  '.',
  "''",
  'b',
  "''",
  'China',
  'tells',
  'Bush',
  'to',
  'stay',
  'out',
  'of',
  'other',
  'countries',
  "'",
  'affairs',
  "''",
  "b'Did",
  'World',
  'War',
  'III',
  'start',
  'today',
  '?',
  "'",
  "b'Georgia",
  'Invades',
  'South',
  'Ossetia',
  '-',
  'if',
  'Russia',
  'gets',
  'involved',
  ',',
  'will',
  'NATO',
  'absorb',
  'Georgia',
  'and',
  'unleash',
  'a',
  'full',
  'scale',
  'war',
  '?',
  "'",
  "b'Al-Qaeda",
  'Faces',
  'Islamist',
  'Backlash',
  "'",
  "b'Condoleezza",
  'Rice',
  ':',
  '``',
  'The',
  'US',
  'would',
  'not',
  'act',
  'to',
  'prevent',
  'an',
  'Israeli',
  'strike',
  'on',
  'Iran',
  '.',
  "''",
  'Israeli',
  'Defense',
  'Minister',
  'Ehud',
  'Barak',
  ':',
  '``',
  'Israel',
  'is',
  'prepared',
  'for',
  'uncompromising',
  'victory',
  'in',
  'the',
  'case',
  'of',
  'military',
  'hostilities',
  '.',
  "''",
  "'",
  "b'This",
  'is',
  'a',
  'busy',
  'day',
  ':',
  'The',
  'European',
  'Union',
  'has',
  'approved',
  'new',
  'sanctions',
  'against',
  'Iran',
  'in',
  'protest',
  'at',
  'its',
  'nuclear',
  'programme',
  '.',
  "'",
  'b',
  "''",
  'Georgia',
  'will',
  'withdraw',
  '1,000',
  'soldiers',
  'from',
  'Iraq',
  'to',
  'help',
  'fight',
  'off',
  'Russian',
  'forces',
  'in',
  'Georgia',
  "'s",
  'breakaway',
  'region',
  'of',
  'South',
  'Ossetia',
  "''",
  "b'Why",
  'the',
  'Pentagon',
  'Thinks',
  'Attacking',
  'Iran',
  'is',
  'a',
  'Bad',
  'Idea',
  '-',
  'US',
  'News',
  '&',
  'amp',
  ';',
  'World',
  'Report',
  "'",
  "b'Caucasus",
  'in',
  'crisis',
  ':',
  'Georgia',
  'invades',
  'South',
  'Ossetia',
  "'",
  "b'Indian",
  'shoe',
  'manufactory',
  '-',
  'And',
  'again',
  'in',
  'a',
  'series',
  'of',
  '``',
  'you',
  'do',
  'not',
  'like',
  'your',
  'work',
  '?',
  "''",
  "'",
  "b'Visitors",
  'Suffering',
  'from',
  'Mental',
  'Illnesses',
  'Banned',
  'from',
  'Olympics',
  "'",
  'b',
  "''",
  'No',
  'Help',
  'for',
  'Mexico',
  "'s",
  'Kidnapping',
  'Surge',
  "''"],
 ["b'Why",
  'wont',
  'America',
  'and',
  'Nato',
  'help',
  'us',
  '?',
  'If',
  'they',
  'wont',
  'help',
  'us',
  'now',
  ',',
  'why',
  'did',
  'we',
  'help',
  'them',
  'in',
  'Iraq',
  '?',
  "'",
  "b'Bush",
  'puts',
  'foot',
  'down',
  'on',
  'Georgian',
  'conflict',
  "'",
  'b',
  "''",
  'Jewish',
  'Georgian',
  'minister',
  ':',
  'Thanks',
  'to',
  'Israeli',
  'training',
  ',',
  'we',
  "'re",
  'fending',
  'off',
  'Russia',
  '``',
  "b'Georgian",
  'army',
  'flees',
  'in',
  'disarray',
  'as',
  'Russians',
  'advance',
  '-',
  'Gori',
  'abandoned',
  'to',
  'Russia',
  'without',
  'a',
  'shot',
  'fired',
  "'",
  'b',
  "''",
  'Olympic',
  'opening',
  'ceremony',
  'fireworks',
  "'faked",
  "'",
  "''",
  "b'What",
  'were',
  'the',
  'Mossad',
  'with',
  'fraudulent',
  'New',
  'Zealand',
  'Passports',
  'doing',
  'in',
  'Iraq',
  '?',
  "'",
  "b'Russia",
  'angered',
  'by',
  'Israeli',
  'military',
  'sale',
  'to',
  'Georgia',
  "'",
  "b'An",
  'American',
  'citizen',
  'living',
  'in',
  'S.Ossetia',
  'blames',
  'U.S.',
  'and',
  'Georgian',
  'leaders',
  'for',
  'the',
  'genocide',
  'of',
  'innocent',
  'people',
  "'",
  "b'Welcome",
  'To',
  'World',
  'War',
  'IV',
  '!',
  'Now',
  'In',
  'High',
  'Definition',
  '!',
  "'",
  'b',
  "''",
  'Georgia',
  "'s",
  'move',
  ',',
  'a',
  'mistake',
  'of',
  'monumental',
  'proportions',
  '``',
  "b'Russia",
  'presses',
  'deeper',
  'into',
  'Georgia',
  ';',
  'U.S.',
  'says',
  'regime',
  'change',
  'is',
  'goal',
  "'",
  "b'Abhinav",
  'Bindra',
  'wins',
  'first',
  'ever',
  'Individual',
  'Olympic',
  'Gold',
  'Medal',
  'for',
  'India',
  "'",
  'b',
  "'",
  'U.S.',
  'ship',
  'heads',
  'for',
  'Arctic',
  'to',
  'define',
  'territory',
  "'",
  "b'Drivers",
  'in',
  'a',
  'Jerusalem',
  'taxi',
  'station',
  'threaten',
  'to',
  'quit',
  'rather',
  'than',
  'work',
  'for',
  'their',
  'new',
  'boss',
  '-',
  'an',
  'Arab',
  "'",
  "b'The",
  'French',
  'Team',
  'is',
  'Stunned',
  'by',
  'Phelps',
  'and',
  'the',
  '4x100m',
  'Relay',
  'Team',
  "'",
  "b'Israel",
  'and',
  'the',
  'US',
  'behind',
  'the',
  'Georgian',
  'aggression',
  '?',
  "'",
  'b',
  "'",
  "''",
  'Do',
  'not',
  'believe',
  'TV',
  ',',
  'neither',
  'Russian',
  'nor',
  'Georgian',
  '.',
  'There',
  'are',
  'much',
  'more',
  'victims',
  "''",
  "'",
  "b'Riots",
  'are',
  'still',
  'going',
  'on',
  'in',
  'Montreal',
  '(',
  'Canada',
  ')',
  'because',
  'police',
  'murdered',
  'a',
  'boy',
  'on',
  'Saturday',
  '.',
  "'",
  "b'China",
  'to',
  'overtake',
  'US',
  'as',
  'largest',
  'manufacturer',
  "'",
  "b'War",
  'in',
  'South',
  'Ossetia',
  '[',
  'PICS',
  ']',
  "'",
  "b'Israeli",
  'Physicians',
  'Group',
  'Condemns',
  'State',
  'Torture',
  "'",
  'b',
  "'",
  'Russia',
  'has',
  'just',
  'beaten',
  'the',
  'United',
  'States',
  'over',
  'the',
  'head',
  'with',
  'Peak',
  'Oil',
  "'",
  "b'Perhaps",
  '*the*',
  'question',
  'about',
  'the',
  'Georgia',
  '-',
  'Russia',
  'conflict',
  "'",
  "b'Russia",
  'is',
  'so',
  'much',
  'better',
  'at',
  'war',
  "'",
  'b',
  "''",
  'So',
  'this',
  'is',
  'what',
  'it',
  "'s",
  'come',
  'to',
  ':',
  'trading',
  'sex',
  'for',
  'food',
  '.',
  "''"]]
In [63]:
corpus[:2] 
Out[63]:
[['b',
  "''",
  'Georgia',
  "'downs",
  'two',
  'Russian',
  'warplanes',
  "'",
  'as',
  'countries',
  'move',
  'to',
  'brink',
  'of',
  'war',
  "''"],
 ["b'BREAKING", ':', 'Musharraf', 'to', 'be', 'impeached', '.', "'"]]
 

預處理

咱們進行一些預處理來把咱們的文本資料變得更加統一:

  • 小寫化

  • 刪除中止詞

  • 刪除數字與符號

  • lemma

咱們把這些功能合爲一個func:

In [64]:
# 中止詞
from nltk.corpus import stopwords stop = stopwords.words('english') # 數字 import re def hasNumbers(inputString): return bool(re.search(r'\d', inputString)) # 特殊符號 def isSymbol(inputString): return bool(re.match(r'[^\w]', inputString)) # lemma from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() def check(word): """  若是須要這個單詞,則True  若是應該去除,則False  """ word= word.lower() if word in stop: return False elif hasNumbers(word) or isSymbol(word): return False else: return True # 把上面的方法綜合起來 def preprocessing(sen): res = [] for word in sen: if check(word): # 這一段的用處僅僅是去除python裏面byte存str時候留下的標識。。以前數據沒處理好,其餘case裏不會有這個狀況 word = word.lower().replace("b'", '').replace('b"', '').replace('"', '').replace("'", '') res.append(wordnet_lemmatizer.lemmatize(word)) return res 
 

把咱們三個數據組都來處理一下:

In [65]:
corpus = [preprocessing(x) for x in corpus] X_train = [preprocessing(x) for x in X_train] X_test = [preprocessing(x) for x in X_test] 
 

咱們再來看看處理以後的數據長相:

In [66]:
print(corpus[553]) print(X_train[523]) 
 
['north', 'korean', 'leader', 'kim', 'jong-il', 'confirmed', 'ill']
['two', 'redditors', 'climbing', 'mt', 'kilimanjaro', 'charity', 'bidding', 'peak', 'nt', 'squander', 'opportunity', 'let', 'upvotes', 'something', 'awesome', 'estimated', 'take', 'year', 'clear', 'lao', 'explosive', 'remnant', 'left', 'behind', 'united', 'state', 'bomber', 'year', 'ago', 'people', 'died', 'unexploded', 'ordnance', 'since', 'conflict', 'ended', 'fidel', 'ahmadinejad', 'slandering', 'jew', 'mossad', 'america', 'israel', 'intelligence', 'agency', 'target', 'united', 'state', 'intensively', 'among', 'nation', 'considered', 'friendly', 'washington', 'israel', 'lead', 'others', 'active', 'espionage', 'directed', 'american', 'company', 'defense', 'department', 'australian', 'election', 'day', 'poll', 'rural/regional', 'independent', 'member', 'parliament', 'support', 'labor', 'minority', 'goverment', 'julia', 'gillard', 'prime', 'minister', 'france', 'plan', 'raise', 'retirement', 'age', 'set', 'strike', 'britain', 'parliament', 'police', 'murdoch', 'paper', 'adviser', 'pm', 'implicated', 'voicemail', 'hacking', 'scandal', 'british', 'policeman', 'jailed', 'month', 'cell', 'attack', 'woman', 'rest', 'email', 'display', 'fundemental', 'disdain', 'pluralistic', 'america', 'reveals', 'chilling', 'level', 'islamophobia', 'hatemongering', 'church', 'plan', 'burn', 'quran', 'endanger', 'troop', 'u', 'commander', 'warns', 'freed', 'journalist', 'tricked', 'captor', 'twitter', 'access', 'manila', 'water', 'crisis', 'expose', 'impact', 'privatisation', 'july', 'week-long', 'rationing', 'water', 'highlighted', 'reality', 'million', 'people', 'denied', 'basic', 'right', 'potable', 'water', 'sanitation', 'private', 'firm', 'rake', 'profit', 'expense', 'weird', 'uk', 'police', 'ask', 'help', 'case', 'slain', 'intelligence', 'agent', 'greenpeace', 'japan', 'anti-whaling', 'activist', 'found', 'guilty', 'theft', 'captured', 'journalist', 'trick', 'captor', 'revealing', 'alive', 'creepy', 'biometric', 'id', 'forced', 'onto', 'india', 'billion', 'inhabitant', 'fear', 'loss', 'privacy', 'government', 'abuse', 'abound', 'india', 'gear', 'biometrically', 'identify', 'number', 'billion', 'inhabitant', 'china', 'young', 'officer', 'syndrome', 'china', 'military', 'spending', 'growing', 'fast', 'overtaken', 'strategy', 'said', 'professor', 'huang', 'jing', 'school', 'public', 'policy', 'young', 'officer', 'taking', 'control', 'strategy', 'like', 'young', 'officer', 'japan', 'mexican', 'soldier', 'open', 'fire', 'family', 'car', 'military', 'checkpoint', 'killing', 'father', 'son', 'death', 'toll', 'continues', 'climb', 'guatemala', 'landslide', 'foreign', 'power', 'stop', 'interfering', 'case', 'iranian', 'woman', 'sentenced', 'death', 'stoning', 'iran', 'foreign', 'ministry', 'said', 'mexican', 'official', 'gunman', 'behind', 'massacre', 'killed', 'tv', 'anchor', 'stabbed', 'death', 'outside', 'kabul', 'home', 'mosque', 'menace', 'confined', 'lower', 'manhattan', 'many', 'european', 'country', 'similar', 'alarm', 'sounded', 'muslim', 'coming', 'french', 'citizen', 'barred', 'american', 'military', 'base', 'dutch', 'neo-nazi', 'donates', 'sperm', 'white', 'dutch', 'neo-nazi', 'offered', 'donate', 'sperm', 'four', 'fertility', 'clinic', 'netherlands', 'effort', 'promote', 'call', 'strong', 'white', 'race']
 

訓練NLP模型

有了這些乾淨的數據集,咱們能夠作咱們的NLP模型了。

咱們這裏要用的是FastText。

原理,我在課件上已經講過了,這裏咱們來進一步看看具體的使用。

因爲這篇paper剛剛發佈,不少社區貢獻者也都在給社區提供代碼,儘早實現python版本的開源編譯(我也是其中之一)。

固然,由於Facebook團隊自己已經在GitHub上放出了源代碼(C++),

因此,咱們能夠用一個python wrapper來造個interface,方便咱們調用。

首先,咱們講過,FT把label也看作一個元素,帶進了word2vec的網絡中。

那麼,咱們就須要把這個label塞進咱們的「句子」中:

In [67]:
for i in range(len(y_train)): label = '__label__' + str(y_train[i]) X_train[i].append(label) print(X_train[49]) 
 
['the', 'man', 'podium', 'dutch', 'non-profit', 'reproductive', 'health', 'organization', 'sail', 'ship', 'around', 'world', 'anchoring', 'international', 'water', 'provide', 'abortion', 'woman', 'country', 'abortion', 'banned', 'b', 'grand', 'ayatollah', 'issue', 'decree', 'calling', 'muslim', 'defend', 'iraq', 'christian', 'marx', 'da', 'kapital', 'sale', 'soar', 'among', 'young', 'german', 'a', 'man', 'england', 'killed', 'wife', 'changed', 'facebook', 'relationship', 'status', 'single', 'georgia', 'used', 'cluster', 'bomb', 'august', 'war', 'arctic', 'temperature', 'break', 'all-time', 'recorded', 'high', 'reddit', 'please', 'send', 'help', 'uk', 'politician', 'insane', 'apparently', 'monitoring', 'mobile', 'web', 'record', 'would', 'giving', 'licence', 'terrorist', 'kill', 'people', 'wow', 'secret', 'coded', 'message', 'embedded', 'child', 'pornographic', 'image', 'paedophile', 'website', 'exploited', 'secure', 'way', 'passing', 'information', 'terrorist', 'england', 'run', 'honey', 'christmas', 'catastrophic', 'honeybee', 'decline', 'b', 'iran', 'stop', 'executing', 'youth', 'china', 'watch', 'internet', 'caf', 'customer', 'web', 'crackdown', 'china\\', 'medium', 'freedom', 'reduced', 'new', 'measure', 'include', 'camera', 'internet', 'cafe', 'picture', 'taken', 'user', 'bali', 'bombing', 'new', 'suspect', 'hindu', 'american', 'foundation', 'petition', 'ny', 'time', 'focus', 'much', 'activity', 'christian', 'missionary', 'india', 'anti-christian', 'violence', 'a', 'quick', 'overview', 'islamic', 'terror', 'organization', 'get', 'funding', 'last', 'titantic', 'survivor', 'auction', 'memento', 'pay', 'nursing', 'home', 'better', 'hungary', 'get', 'loan', 'avert', 'meltdown', 'sao', 'paolo', 'hundred', 'black-clad', 'military', 'police', 'fired', 'teargas', 'stun', 'grenade', 'rubber', 'bullet', 'striking', 'civilian', 'officer', 'seeking', 'percent', 'pay', 'raise', 'austrailian', 'historian', 'arrested', 'holocaust', 'denial', 'defense', 'secretary', 'gate', 'said', 'prepared', 'reconciliation', 'taliban', 'part', 'political', 'outcome', 'afghanistan', 'is', 'switzerland', 'next', 'iceland', 'switzerland', 'forced', 'take', 'emergency', 'measure', 'yesterday', 'shore', 'two', 'biggest', 'lender', 'prevent', 'collapse', 'confidence', 'country\\', 'banking', 'system', 'police', 'battle', 'police', 'sao', 'paulo', 'civilian', 'killed', 'nato', 'air', 'strike', 'afghanistan', 'villager', 'the', 'west', 'loss', 'afghanistan', '__label__0']
 

而後,咱們把數據存成文件的形式。由於咱們這裏的FastText只是個python的interface。調用起來還得用C++的接口。

咱們須要存三個東西:

含有label的train集

不含label的test集

label單獨放一個文件

In [68]:
X_train = [' '.join(x) for x in X_train] print(X_train[12]) 
 
north korea halt denuclearisation u fails remove list state sponsoring terrorism child among dead u airstrike afghanistan the russian parliament voted overwhelmingly officially recognize independence abkhazia south ossetia violent animal right activist set fire scientist home little protection available scientist nbc censored olympic champion matthew mitcham gay un say convincing evidence show u airstrike afghanistan killed people including child italy try outlaw islam mystery virus kill israeli group peace say settlement construction occupied west bank nearly doubled since last year b revealed britain secret propaganda war al-qaida b israel settlement surge draw rice criticism solar powered carbon neutral pyramid house million people dubai russia claim proof genocide how nato transformed military alliance quasi-united nation cartwheeling banned school philly-area activist released china jeff said slapped around threatend saying want head cut want shot b vatican describes hindu attack christian orphanage god protester tell tale beijing detention- sleep deprivation threat oh python kill zookeeper kelly murdered say uk intelligence insider b fury image myra hindley appears british film olympics party b north korea suspend nuclear disablement german suspect bayer pesticide beehive collapse research terrorism invaluable fear arrest top u diplomat escape gun attack pakistan __label__1
 

同理,test集也這樣。

In [69]:
X_test = [' '.join(x) for x in X_test] with open('../input/train_ft.txt', 'w') as f: for sen in X_train: f.write(sen+'\n') with open('../input/test_ft.txt', 'w') as f: for sen in X_test: f.write(sen+'\n') with open('../input/test_label_ft.txt', 'w') as f: for label in y_test: f.write(str(label)+'\n') 
 

調用FastText模塊

In [95]:
import fasttext clf = fasttext.supervised('../input/train_ft.txt', 'model', dim=256, ws=5, neg=5, epoch=100, min_count=10, lr=0.1, lr_update_rate=1000, bucket=200000) 
 

訓練完咱們的FT模型後,咱們能夠測試咱們的Test集了

In [96]:
y_scores = [] # 咱們用predict來給出判斷 labels = clf.predict(X_test) y_preds = np.array(labels).flatten().astype(int) # 咱們來看看 print(len(y_test)) print(y_test) print(len(y_preds)) print(y_preds) from sklearn import metrics # 算個AUC準確率 fpr, tpr, thresholds = metrics.roc_curve(y_test, y_preds, pos_label=1) print(metrics.auc(fpr, tpr)) 
 
378
[1 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1
 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1
 0 1 1 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1
 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0
 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 1
 0 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1
 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1
 0 1 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1
 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 1 0
 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 0 1 1
 0 1 0 0 1 1 1 1]
378
[0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 0 1 1
 1 1 1 0 1 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1
 1 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 1
 0 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1
 1 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 0 0 1 0
 1 0 1 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 1 1
 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1
 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0
 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0
 1 1 1 0 1 1 0 1]
0.463877688172
 

同理,這裏,咱們經過parameter tuning或者是resampling,可讓咱們的結果更加好。

固然,由於FT自己也是一個word2vec。而且自帶了一個相似於二叉樹的分類器在後面。

這樣,在小量數據上,是跑不出很理想的結論的,還不如咱們本身帶上一個SVM的效果。

可是面對大量數據和大量label,它的效果就體現出來了。

In [ ]:
本站公眾號
   歡迎關注本站公眾號,獲取更多信息