Spark Machine Learning 03 Spark上數據的獲取、處理與準備

時間 2020-05-20

標籤 spark machine learning 數據獲取處理準備欄目 Spark 简体版

原文原文鏈接

Chap 03 Spark上數據的獲取處理

Spark上數據的獲取、處理與準備html

MovieStream，數據包括網站提供的電影數據、用戶的服務信息數據以及行爲數據。python

這些數據涉及電影和相關內容（好比標題、分類、圖片、演員和導演）、用戶信息（好比用戶屬性、位置和其餘信息）以及用戶活動數據（好比瀏覽數、預覽的標題和次數、評級、評論，以及如贊、分享之類的社交數據，還有包括像Facebook和Twitter之類的社交網絡屬性）。算法

其外部數據來源則可能包括天氣和地理定位信息，以及如IMDB和Rotten Tomators之類的第三方電影評級與評論信息等。sql

一個預測精準的好模型有着極高的商業價值（Netflix Prize 和 Kaggle 上機器學習比賽的成功就是很好的見證）shell

focus on數組

數據的處理、清理、探索和可視化方法；網絡
原始數據轉換爲可用於機器學習算法特徵的各類技術；app
學習如何使用外部庫或Spark內置函數來正則化輸入特徵.dom

3.1 獲取公開數據集

UCL機器學習知識庫機器學習

包括近300個不一樣大小和類型的數據集，可用於分類、迴歸、聚類和推薦系統任務。數據集列表位於：http://archive.ics.uci.edu/ml/。

Amazon AWS公開數據集

包含的一般是大型數據集，可經過Amazon S3訪問。這些數據集包括人類基因組項目、Common Crawl網頁語料庫、維基百科數據和Google Books Ngrams。
相關信息可參見：http://aws.amazon.com/publicdatasets/。

Kaggle

這裏集合了Kaggle舉行的各類機器學習競賽所用的數據集。
它們覆蓋分類、迴歸、排名、推薦系統以及圖像分析領域，可從Competitions區域下載：http://www.kaggle.com/competitions。

KDnuggets

這裏包含一個詳細的公開數據集列表，其中一些上面提到過的。
該列表位於：http://www.kdnuggets.com/datasets/index.html。

MovieLens 100k數據集

MovieLens 100k數據集包含表示多個用戶對多部電影的10萬次評級數據，也包含電影元數據和用戶屬性信息

http://files.grouplens.org/datasets/movielens/ml-100k.zip

ml-100k/ u.user（用戶屬性文件）、u.item（電影元數據）和u.data（用戶對電影的評級）

>unzip ml-100k.zip
  inflating: ml-100k/allbut.pl
  inflating: ml-100k/mku.sh
  inflating: ml-100k/README
  ...
  inflating: ml-100k/ub.base
  inflating: ml-100k/ub.test

u.user

user.id、age、gender、occupation、ZIP code

>head -5 u.user
  1|24|M|technician|85711
  2|53|F|other|94043
  3|23|M|writer|32067
  4|24|M|technician|43537
  5|33|F|other|15213

u.item

movie id、title、release date以及若干與IMDB link和電影分類相關的屬性

>head -5 u.item
  1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20 Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
  2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
  3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
  4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
  5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title- exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

u.data

user id、movie id、rating（從1到5）和timestamp屬性，各屬性間用製表符（t）分隔

>head -5 u.data
196    242    3    881250949
186    302    3    891717742
22     377    1    878887116
244    51     2    880606923
166    346    1    886397596

3.2 探索與可視化數據

IPython的安裝方法可參考以下指引：http://ipython.org/install.html。

若是這是你第一次使用IPython，這裏有一個教程：http://ipython.org/ipython-doc/stable/interactive/tutorial.html。

>IPYTHON=1 IPYTHON_OPTS="--pylab" ./bin/pyspark

終端裏的IPython 2.3.1 -- An enhanced Interactive Python和Using matplotlib backend: MacOSX輸出行表示IPython和pylab均已被PySpark啓用。

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Python version 2.7.10 (default, Jul 14 2015 19:46:27)
SparkContext available as sc, HiveContext available as sqlContext.

In [1]:

能夠將樣本代碼輸入到IPython終端，也可經過IPython提供的Notebook 應用來完成。Notebook支持HTML顯示，且在IPython終端的基礎上提供了一些加強功能，如即時繪圖、HTML標記，以及獨立運行代碼片斷的功能。

IPython Notebook 使用指南：http://ipython.org/ipython-doc/stable/interactive/notebook.html

3.2.1 探索用戶數據

user_data = sc.textFile("/Users/hp/ghome/ml/ml-100k/u.user")
user_data.first()
user_data.take(5)

user_fields = user_data.map(lambda line: line.split("|"))
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()
print "Users: %d, genders: %d, occupations: %d, ZIP codes: %d" % (num_users, num_genders, num_occupations, num_zipcodes)

Output

Users: 943, genders: 2, occupations: 21, ZIP codes: 795

matplotlib的hist個直方圖，以分析用戶年齡的分佈狀況：

age distribution

ages = user_fields.map(lambda x: int(x[1])).collect()
hist(ages, bins=20, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)

occupation distribution

count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()

x_axis1 = np.array([c[0] for c in count_by_occupation])

y_axis1 = np.array([c[1] for c in count_by_occupation])

print x_axis1
[u'administrator' u'retired' u'lawyer' u'none' u'student' u'technician'
 u'programmer' u'salesman' u'homemaker' u'writer' u'doctor'
 u'entertainment' u'marketing' u'executive' u'scientist' u'educator'
 u'healthcare' u'librarian' u'artist' u'other' u'engineer']

print y_axis1
[ 79  14  12   9 196  27  66  12   7  45   7  18  26  32  31  95  16  51
  28 105  67]

plt.xticks(rotation=30)之類的代碼是美化條形圖

pos = np.arange(len(x_axis))
width = 1.0

ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)

plt.bar(pos, y_axis, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)

Spark對RDD提供了一個名爲countByValue的便捷函數

count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()
print "Map-reduce approach:"
print dict(count_by_occupation2)
print ""
print "countByValue approach:"
print dict(count_by_occupation)

3.2.2 探索電影數據

movie_data = sc.textFile("/PATH/ml-100k/u.item")
print movie_data.first()
num_movies = movie_data.count()
print "Movies: %d" % num_movies

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
Movies: 1682

def convert_year(x):
  try:
    return int(x[-4:])
  except:
    return 1900

movie_fields = movie_data.map(lambda lines: lines.split("|"))
years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))

years_filtered = years.filter(lambda x: x != 1900)

movie_ages = years_filtered.map(lambda yr: 1998-yr).countByValue()
values = movie_ages.values()
bins = movie_ages.keys()
hist(values, bins=bins, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)

3.2.3 探索評級數據

rating_data = sc.textFile("/Users/hp/ghome/ml/ml-100k/u.data")
print rating_data.first()
num_ratings = rating_data.count()
print "Ratings: %d" % num_ratings

rating_data = rating_data.map(lambda line: line.split("\t"))
ratings = rating_data.map(lambda fields: int(fields[2]))
max_rating = ratings.reduce(lambda x, y: max(x, y))
min_rating = ratings.reduce(lambda x, y: min(x, y))
mean_rating = ratings.reduce(lambda x, y: x + y) / num_ratings
median_rating = np.median(ratings.collect())
ratings_per_user = num_ratings / num_users
ratings_per_movie = num_ratings / num_movies
print "Min rating: %d" % min_rating
print "Max rating: %d" % max_rating
print "Average rating: %2.2f" % mean_rating
print "Median rating: %d" % median_rating
print "Average # of ratings per user: %2.2f" % ratings_per_user
print "Average # of ratings per movie: %2.2f" % ratings_per_movie

Max rating: 5
Average rating: 3.00
Median rating: 4
Average # of ratings per user: 106.00
Average # of ratings per movie: 59.00

Spark對RDD也提供一個名爲states的函數。該函數包含一個數值變量用於作相似的統計：

ratings.stats()

其輸出爲：
(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)

count_by_rating = ratings.countByValue()
x_axis = np.array(count_by_rating.keys())
y_axis = np.array([float(c) for c in count_by_rating.values()])
# 這裏對y軸正則化，使它表示百分比
y_axis_normed = y_axis / y_axis.sum()
pos = np.arange(len(x_axis))
width = 1.0

ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)

plt.bar(pos, y_axis_normed, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)

各個用戶評級次數的分佈狀況

user_ratings_grouped = rating_data.map(lambda fields: (int(fields[0]), int(fields[2]))).groupByKey()

user_ratings_byuser = user_ratings_grouped.map(lambda (k, v): (k, len(v)))
user_ratings_byuser.take(10)

Out[91]:
[(2, 62),
 (4, 24),
 (6, 211),
 (8, 59),
 (10, 184),
 (12, 51),
 (14, 98),
 (16, 140),
 (18, 277),
 (20, 48)]

user_ratings_byuser_local = user_ratings_byuser.map(lambda (k, v): v).collect()
hist(user_ratings_byuser_local, bins=200, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)

3.3 處理與轉換數據

非規整數據和缺失數據的填充

3.4 從數據中提取有用特徵

在完成對數據的初步探索、處理和清理後，即可從中提取可供機器學習模型訓練用的特徵。

特徵（feature）指那些用於模型訓練的變量。每一行數據包含可供提取到訓練樣本中的各類信息。

幾乎全部機器學習模型都是與用向量表示的數值特徵打交道；需將原始數據轉換爲數值。

特徵能夠歸納地分爲以下幾種。

數值特徵（numerical feature）：這些特徵一般爲實數或整數，好比以前例子中提到的年齡。
類別特徵（categorical feature）：咱們數據集中的用戶性別、職業或電影類別即是這類。
文本特徵（text feature）：它們派生自數據中的文本內容，好比電影名、描述或是評論。
其餘特徵：... 地理位置則可由經緯度或地理散列（geohash）表示。

3.4.1 數值特徵

原始的數值和一個數值特徵之間的區別是什麼？

機器學習模型中所學習的是各個特徵所對應的向量的權值。這些權值在特徵值到輸出或是目標變量（指在監督學習模型中）is very important。

當數值特徵仍處於原始形式時，其可用性相對較低，但能夠轉化爲更有用的表示形式。

如 (位置信息 : 原始位置信息（好比用經緯度表示的），信息可用性很低。然若對位置進行聚合（好比聚焦爲一個city or country），和特定輸出之間存在某種關聯。

3.4.2 類別特徵

將類別特徵表示爲數字形式，常可藉助 k 之1（1-of-k）方法進行

好比，可取occupation 全部可能取值：

all_occupations = user_fields.map(lambda fields: fields[3]). distinct().collect()
all_occupations.sort()

然可依次對各可能的職業分配序號（注意從0開始編號）：

idx = 0
all_occupations_dict = {}
for o in all_occupations:
    all_occupations_dict[o] = idx
    idx +=1
# 看一下「k之1」編碼會對新的例子分配什麼值
print "Encoding of 'doctor': %d" % all_occupations_dict['doctor']
print "Encoding of 'programmer': %d" % all_occupations_dict['programmer']

其輸出以下：

Encoding of 'doctor': 2
Encoding of 'programmer': 14

3.4.3 派生特徵

從原始數據派生特徵的例子包括計算平均值、中位值、方差、和、差、最大值或最小值以及計數。從電影的發行年份和當前年份派生了新的movie age特徵的。這類轉換背後的想法經常是對數值數據進行某種歸納，並指望它能讓模型學習更容易。

數值特徵到類別特徵的轉換也很常見，好比劃分爲區間特徵。進行這類轉換的變量常見的有年齡、地理位置和時間。

如：將時間戳轉爲類別特

電影評級發生的時間

['afternoon', 'evening', 'morning', 'morning', 'morning']

3.4.4 文本特徵

文本特徵也是一種類別特徵或派生特徵

NLP 即是專一於文本內容的處理、表示和建模的一個領域。

介紹一種簡單且標準化的文本特徵提取方法。該方法被稱爲詞袋（bag-of-word）表示法。

詞袋法將一段文本視爲由其中的文本或數字組成的集合，其處理過程以下。

bag-of-word

(1) 分詞（tokenization）

首先會應用某些分詞方法來將文本分隔爲一個由詞（通常如單詞、數字等）組成的集合。

(2) 刪除停用詞（stop words removal)

刪除常見的單詞，好比the、and和but（這些詞被稱做停用詞）。

(3) 提取詞幹（stemming）：

是指將各個詞簡化爲其基本的形式或者幹詞。常見的例子如複數變爲單數（好比dogs變爲dog等）。提取的方法有不少種，文本處理算法庫中經常會包括多種詞幹提取方法。

(4) 向量化（vectorization） ：

向量來表示處理好的詞。二元向量多是最爲簡單的表示方式。它用1和0來分別表示是否存在某個詞。從根本上說，這與以前提到的 k 之1編碼相同。與 k 之1相同，它須要一個詞的字典來實現詞到索引序號的映射。隨着遇到的詞增多，各類詞可能達數百萬。由此，使用稀疏矩陣來表示就很關鍵。這種表示只記錄某個詞是否出現過，從而節省內存和磁盤空間，以及計算時間。

提取簡單的文本特徵

參見 : http://www.ituring.com.cn/tupubarticle/5567

如今每個電影標題都被轉換爲一個稀疏向量。

3.4.5 正則化特徵

在將特徵提取爲向量形式後，一種常見的預處理方式是將數值數據正則化（normalization）。其背後的思想是將各個數值特徵進行轉換，以將它們的值域規範到一個標準區間內。正則化的方法有以下幾種。

正則化特徵：這其實是對數據集中的單個特徵進行轉換。好比減去平均值（特徵對齊）或是進行標準的正則轉換（以使得該特徵的平均值和標準差分別爲0和1）。
正則化特徵向量：這一般是對數據中的某一行的全部特徵進行轉換，以讓轉換後的特徵向量的長度標準化。也就是縮放向量中的各個特徵以使得向量的範數爲1（常指一階或二階範數）。

向量正則化可經過numpy的norm函數來實現。具體來講，先計算一個隨機向量的二階範數，而後讓向量中的每個元素都除該範數，從而獲得正則化後的向量：

np.random.seed(42)
x = np.random.randn(10)
norm_x_2 = np.linalg.norm(x)
normalized_x = x / norm_x_2
print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x:\n%s" % normalized_x
print "2-Norm of normalized_x: %2.4f" % np.linalg.norm(normalized_x)

其輸出應該以下（上面將隨機種子的值設爲42，保證每次運行的結果相同）：

x: [ 0.49671415 -0.1382643  0.64768854  1.52302986 -0.23415337 -0.23413696
1.57921282  0.76743473 -0.46947439  0.54256004]
2-Norm of x: 2.5908
Normalized x: [ 0.19172213 -0.05336737  0.24999534  0.58786029 -0.09037871 -0.09037237  0.60954584  0.29621508 -0.1812081  0.20941776]
2-Norm of normalized_x: 1.0000

用 MLlib 正則化特徵

Spark在其MLlib機器學習庫中內置了一些函數用於特徵的縮放和標準化。它們包括供標準正態變換的StandardScaler，以及提供與上述相同的特徵向量正則化的 Normalizer。

比較一下MLlib的Normalizer與咱們本身函數的結果：

from pyspark.mllib.feature import Normalizer
normalizer = Normalizer()
vector =sc.parallelize([x])

在導入所需的類後，會要初始化Normalizer（其默認使用與以前相同的二階範數）。注意用Spark時，大部分狀況下Normalizer所需的輸入爲一個RDD（它包含numpy數值或MLlib向量）。做爲舉例，咱們會從x向量建立一個單元素的RDD。

以後將會對咱們的RDD調用Normalizer的transform函數。因爲該RDD只含有一個向量，可經過first函數來返回向量到驅動程序。接着調用toArray函數來將該向量轉換爲numpy數組：

normalized_x_mllib = normalizer.transform(vector).first().toArray()
#最後來看一下以前打印過的那些值，並作個比較：

print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x MLlib:\n%s" % normalized_x_mllib
print "2-Norm of normalized_x_mllib: %2.4f" % np.linalg.norm(normalized_x_mllib)

相比本身編寫的函數，使用 MLlib內置的函數更方便

3.4.6 用軟件包提取特徵

特徵提取可藉助的軟件包有scikit-learn、gensim、scikit-image、matplotlib、Python的NLTK、Java編寫的OpenNLP以及用Scala編寫的Breeze和Chalk。Breeze自Spark 1.0開始就成爲Spark的一部分了。Breeze有線性代數功能。