Reference: An Introduction to Text Mining using Twitter Streaming API and Pythonjavascript
Reference: How to Register a Twitter App in 8 Easy Stepshtml
Key Methods:java
twitter_streaming.py, this file is used to extract information from Twitter.python
#Import the necessary methods from tweepy library from tweepy.streaming import StreamListener from tweepy import OAuthHandler from tweepy import Stream #Variables that contains the user credentials to access Twitter API access_token = "ENTER YOUR ACCESS TOKEN" access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET" consumer_key = "ENTER YOUR API KEY" consumer_secret = "ENTER YOUR API SECRET" #This is a basic listener that just prints received tweets to stdout. class StdOutListener(StreamListener): def on_data(self, data): print(data) return True def on_error(self, status): print(status) if __name__ == '__main__': #This handles Twitter authetification and the connection to Twitter Streaming API l = StdOutListener() auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) stream = Stream(auth, l) #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby' stream.filter(track=['python', 'javascript', 'ruby'])
You can use the following command to store information in the specific file. (By CMD)正則表達式
python twitter_streaming.py > twitter_data.txt
Then we will get the information from the above text file and store them in JSON format.express
import json tweets_data_path = r"..\twitter_data.txt" tweets_data = [] tweets_file = open(tweets_data_path, "r") for line in tweets_file: try: tweet = json.loads(line) tweets_data.append(tweet) except: continue
Data are stored in tweets_data, and we can get the specific information by the following scripts.json
Reference: python JSON only get keys in first levelruby
# get the text content, language from the specific tweets num = 0 for tweet in tweets_data: num += 1 if num == 10: break else: tweet_text = tweet["text"] tweet_lang = tweet["lang"] print(str(num)) print(tweet_lang) print(tweet_text) print() # get all the keys from json tweets_data[0].keys()
Now we can also get the specific key by list(), map() and lambda() with the following scripts.bash
Reference: Python中map與lambda的結合使用app
>>> a = list(map(lambda tweet: tweet['text'], tweets_data)) >>> len(a) 1633 >>> a[0] 'RT @neet_se: 案件數って點だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'
Or we can also use set() method to get the unique values of the list.
Reference: Python set() 函數
Reference: Python統計列表中的重複項出現的次數的方法
>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data)) >>> len(langs) 1633 >>> set(langs) {'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}
Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.
>>> import pandas as pd >>> tweets = pd.DataFrame() >>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data)) >>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data)) >>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data)) >>> tweets['lang'].value_counts() en 1119 ja 278 es 113 pt 36 und 26 ...
Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.
>>> tweets_by_lang = tweets['lang'].value_counts() >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots() >>> ax.tick_params(axis='x', labelsize=15) >>> ax.tick_params(axis='y', labelsize=10) >>> ax.set_xlabel('Languages', fontsize=15) Text(0.5, 0, 'Languages') >>> ax.set_ylabel('Number of tweets' , fontsize=15) Text(0, 0.5, 'Number of tweets') >>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold') Text(0.5, 1.0, 'Top 5 languages') >>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red') <matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630> >>> plt.show()
Next, we will create a chart describing the Top 5 countries from which the tweets were sent.
>>> tweets_by_country = tweets['country'].value_counts() >>> fig, ax = plt.subplots() >>> ax.tick_params(axis='x', labelsize=15) >>> ax.tick_params(axis='y', labelsize=10) >>> ax.set_xlabel('Countries', fontsize=15) Text(0.5, 0, 'Countries') >>> ax.set_ylabel('Number of tweets' , fontsize=15) Text(0, 0.5, 'Number of tweets') >>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold') Text(0.5, 1.0, 'Top 5 countries') >>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue') <matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0> >>> plt.show()
Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:
First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正則表達式).
Python provides a library for regular expression called re. We will start by importing this library.
Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.
>>> import re >>> def word_in_text(word, text): word = word.lower() text = text.lower() match = re.search(word, text) if match: return True return False
Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().
>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet)) >>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet)) >>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))
We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:
>>> print(tweets['python'].value_counts()[True]) 447 >>> print(tweets['ruby'].value_counts()[True]) 529 >>> print(tweets['javascript'].value_counts()[True]) 275
We can make a simple comparison chart by executing the following:
>>> prg_langs = ['python', 'ruby', 'javascript'] >>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]] >>> x_pos = list(range(len(prg_langs))) >>> width = 0.8 >>> fig, ax = plt.subplots() >>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g') <BarContainer object of 3 artists> >>> # Setting axis labels and ticks >>> ax.set_ylabel('Number of tweets', fontsize=15) Text(0, 0.5, 'Number of tweets') >>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold') Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)') >>> ax.set_xticks([p + 0.4 * width for p in x_pos]) [<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>] >>> ax.set_xticklabels(prg_langs) [Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')] >>> plt.grid() >>> plt.show()
This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.
We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.
>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet)) >>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))
We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.
>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))
We can print the counts of relevant tweet by executing the commands below.
>>> print(tweets['programming'].value_counts()[True]) 55 >>> print(tweets['tutorial'].value_counts()[True]) 22 >>> print(tweets['relevant'].value_counts()[True]) 74
We can compare now the popularity of the programming languages by executing the commands below.
tweets[tweets['relevant'] == True]['python'] # 將 relevant 爲 True 的索引對應 Python 組成一個新的列
>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True]) 31 >>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True]) 8 >>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]) 11
Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison
>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True], tweets[tweets['relevant'] == True]['ruby'].value_counts()[True], tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]] >>> x_pos = list(range(len(prg_langs))) >>> width = 0.8 >>> fig, ax = plt.subplots() >>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g') <BarContainer object of 3 artists> >>> ax.set_ylabel('Number of tweets', fontsize=15) Text(0, 0.5, 'Number of tweets') >>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold') Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)') >>> ax.set_xticks([p + 0.4 * width for p in x_pos]) [<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>] >>> ax.set_xticklabels(prg_langs) [Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')] >>> plt.grid() >>> plt.show()
Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.
>>> def extract_link(text): regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+' match = re.search(regex, text) if match: return match.group() return ''
Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.
>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.
將原有 DataFrame 進行截取。
>>> tweets_relevant = tweets[tweets['relevant'] == True] >>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']
We can now print out all links for python, ruby, and javascript by executing the commands below:
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link']) 40 https://t.co/zoAgyQuMAZ 105 https://t.co/ogaPbuIbEW 274 https://t.co/y4sUmovFOn 329 https://t.co/A030fqWeWA 339 https://t.co/LaaVc5T2rQ 391 https://t.co/8bYvlziCZb 413 https://t.co/8bYvlziCZb 436 https://t.co/EByqxT1qyN 444 https://t.co/8bYvlziCZb 445 https://t.co/5Jujg6h31B 462 https://t.co/UrFHlOaJYf 476 https://t.co/5Jujg6h31B 477 https://t.co/EByqxT1qyN 589 https://t.co/UrFHlOaJYf 603 https://t.co/5Jujg6h31B 822 https://t.co/Oc21FrzQc5 1060 https://t.co/qOAIuKfyD0 1097 https://t.co/qOAIuKfyD0 1248 https://t.co/V3ZNKuYsK7 1278 https://t.co/qOAIuKfyD0 1411 https://t.co/szHRHavQKy 1594 https://t.co/X6KWMlzlv6 Name: link, dtype: object >>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link']) 782 https://t.co/JgY40r2NSo 833 https://t.co/JgY40r2NSo 1177 https://t.co/xycOG3ndi9 1254 https://t.co/xycOG3ndi9 1293 https://t.co/LMHW050TGs 1328 https://t.co/SS4DzEnSBZ 1393 https://t.co/NZlUce5Ne8 1619 https://t.co/e4nwrn3N2j Name: link, dtype: object >>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link']) 130 https://t.co/AbJFaSI0B8 286 https://t.co/7dNBIsQ5Gq 467 https://t.co/3YIK588j8t 471 https://t.co/vjBJWWzvfv 830 https://t.co/T4mUjwUcgL 1093 https://t.co/wvLZLjuVKF 1180 https://t.co/luxL2qbxte 1526 https://t.co/G3ZTFL0RKv Name: link, dtype: object