spark機器學習-第3章

時間 2019-12-13

標籤 spark 機器學習欄目 Spark 简体版

原文原文鏈接

1.安裝工具ipython

https://www.continuum.io/downloads

選擇本身須要的版本

2.安裝過程

（1）賦權限

chmod u+x ./Anaconda2-4.2.0-Linux-x86_64.sh

（2）回車

[root@hadoop161 tool]# ./Anaconda2-4.2.0-Linux-x86_64.sh

Welcome to Anaconda2 4.2.0 (by Continuum Analytics, Inc.)

In order to continue the installation process, please review the license

agreement.

Please, press ENTER to continue

>>>

================

Anaconda License

================

Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice,

this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,

this list of conditions and the following disclaimer in the documentation

and/or other materials provided with the distribution.

* Neither the name of Continuum Analytics, Inc. nor the names of its

contributors may be used to endorse or promote products derived from this

software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"

AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE

ARE DISCLAIMED. IN NO EVENT SHALL CONTINUUM ANALYTICS, INC. BE LIABLE FOR

ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR

SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER

CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT

LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY

OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH

DAMAGE.

Notice of Third Party Software Licenses

=======================================

Anaconda contains open source software packages from third parties. These

are available on an "as is" basis and subject to their individual license

agreements. These licenses are available in Anaconda or at

http://docs.continuum.io/anaconda/pkg-docs . Any binary packages of these

third party tools you obtain via Anaconda are subject to their individual

licenses as well as the Anaconda license. Continuum reserves the right to

change which third party tools are provided in Anaconda.

In particular, Anaconda contains re-distributable, run-time, shared-library

files from the Intel (TM) Math Kernel Library ("MKL binaries"). You are

specifically authorized to use the MKL binaries with your installation of

Anaconda. You are also authorized to redistribute the MKL binaries with

Anaconda or in the conda package that contains them. If needed,

instructions for removing the MKL binaries after installation of Anaconda

are available at http://www.continuum.io.

Cryptography Notice

===================

This distribution includes cryptographic software. The country in which you

currently reside may have restrictions on the import, possession, use,

and/or re-export to another country, of encryption software. BEFORE using

any encryption software, please check your country's laws, regulations and

policies concerning the import, possession, or use, and re-export of

encryption software, to see if this is permitted. See the Wassenaar

Arrangement <http://www.wassenaar.org/> for more information.

Continuum Analytics has self-classified this software as Export Commodity

Control Number (ECCN) 5D002.C.1, which includes information security

software using or performing cryptographic functions with asymmetric

algorithms. The form and manner of this distribution makes it eligible for

export under the License Exception ENC Technology Software Unrestricted

(TSU) exception (see the BIS Export Administration Regulations, Section

740.13) for both object code and source code.

The following packages are included in this distribution that relate to

cryptography:

openssl

The OpenSSL Project is a collaborative effort to develop a robust,

commercial-grade, full-featured, and Open Source toolkit implementing the

Transport Layer Security (TLS) and Secure Sockets Layer (SSL) protocols as

well as a full-strength general purpose cryptography library.

pycrypto

A collection of both secure hash functions (such as SHA256 and RIPEMD160),

and various encryption algorithms (AES, DES, RSA, ElGamal, etc.).

pyopenssl

A thin Python wrapper around (a subset of) the OpenSSL library.

kerberos (krb5, non-Windows platforms)

A network authentication protocol designed to provide strong authentication

for client/server applications by using secret-key cryptography.

cryptography

A Python library which exposes cryptographic recipes and primitives.

Do you approve the license terms? [yes|no]

>>>

Please answer 'yes' or 'no':

>>> yes

Anaconda2 will now be installed into this location:

/root/anaconda2

- Press ENTER to confirm the location

- Press CTRL-C to abort the installation

- Or specify a different location below

[/root/anaconda2] >>>

PREFIX=/root/anaconda2

installing: python-2.7.12-1 ...

installing: _license-1.1-py27_1 ...

installing: _nb_ext_conf-0.3.0-py27_0 ...

installing: alabaster-0.7.9-py27_0 ...

installing: anaconda-clean-1.0.0-py27_0 ...

installing: anaconda-client-1.5.1-py27_0 ...

installing: anaconda-navigator-1.3.1-py27_0 ...

installing: argcomplete-1.0.0-py27_1 ...

installing: astroid-1.4.7-py27_0 ...

installing: astropy-1.2.1-np111py27_0 ...

installing: babel-2.3.4-py27_0 ...

installing: backports-1.0-py27_0 ...

installing: backports_abc-0.4-py27_0 ...

installing: beautifulsoup4-4.5.1-py27_0 ...

installing: bitarray-0.8.1-py27_0 ...

installing: blaze-0.10.1-py27_0 ...

installing: bokeh-0.12.2-py27_0 ...

installing: boto-2.42.0-py27_0 ...

installing: bottleneck-1.1.0-np111py27_0 ...

installing: cairo-1.12.18-6 ...

installing: cdecimal-2.3-py27_2 ...

installing: cffi-1.7.0-py27_0 ...

installing: chest-0.2.3-py27_0 ...

installing: click-6.6-py27_0 ...

installing: cloudpickle-0.2.1-py27_0 ...

installing: clyent-1.2.2-py27_0 ...

installing: colorama-0.3.7-py27_0 ...

installing: configobj-5.0.6-py27_0 ...

installing: configparser-3.5.0-py27_0 ...

installing: contextlib2-0.5.3-py27_0 ...

installing: cryptography-1.5-py27_0 ...

installing: curl-7.49.0-1 ...

installing: cycler-0.10.0-py27_0 ...

installing: cython-0.24.1-py27_0 ...

installing: cytoolz-0.8.0-py27_0 ...

installing: dask-0.11.0-py27_0 ...

installing: datashape-0.5.2-py27_0 ...

installing: dbus-1.10.10-0 ...

installing: decorator-4.0.10-py27_0 ...

installing: dill-0.2.5-py27_0 ...

installing: docutils-0.12-py27_2 ...

installing: dynd-python-0.7.2-py27_0 ...

installing: entrypoints-0.2.2-py27_0 ...

installing: enum34-1.1.6-py27_0 ...

installing: et_xmlfile-1.0.1-py27_0 ...

installing: expat-2.1.0-0 ...

installing: fastcache-1.0.2-py27_1 ...

installing: filelock-2.0.6-py27_0 ...

installing: flask-0.11.1-py27_0 ...

installing: flask-cors-2.1.2-py27_0 ...

installing: fontconfig-2.11.1-6 ...

installing: freetype-2.5.5-1 ...

installing: funcsigs-1.0.2-py27_0 ...

installing: functools32-3.2.3.2-py27_0 ...

installing: futures-3.0.5-py27_0 ...

installing: get_terminal_size-1.0.0-py27_0 ...

installing: gevent-1.1.2-py27_0 ...

installing: glib-2.43.0-1 ...

installing: greenlet-0.4.10-py27_0 ...

installing: grin-1.2.1-py27_3 ...

installing: gst-plugins-base-1.8.0-0 ...

installing: gstreamer-1.8.0-0 ...

installing: h5py-2.6.0-np111py27_2 ...

installing: harfbuzz-0.9.39-1 ...

installing: hdf5-1.8.17-1 ...

installing: heapdict-1.0.0-py27_1 ...

installing: icu-54.1-0 ...

installing: idna-2.1-py27_0 ...

installing: imagesize-0.7.1-py27_0 ...

installing: ipaddress-1.0.16-py27_0 ...

installing: ipykernel-4.5.0-py27_0 ...

installing: ipython-5.1.0-py27_0 ...

installing: ipython_genutils-0.1.0-py27_0 ...

installing: ipywidgets-5.2.2-py27_0 ...

installing: itsdangerous-0.24-py27_0 ...

installing: jbig-2.1-0 ...

installing: jdcal-1.2-py27_1 ...

installing: jedi-0.9.0-py27_1 ...

installing: jinja2-2.8-py27_1 ...

installing: jpeg-8d-2 ...

installing: jsonschema-2.5.1-py27_0 ...

installing: jupyter-1.0.0-py27_3 ...

installing: jupyter_client-4.4.0-py27_0 ...

installing: jupyter_console-5.0.0-py27_0 ...

installing: jupyter_core-4.2.0-py27_0 ...

installing: lazy-object-proxy-1.2.1-py27_0 ...

installing: libdynd-0.7.2-0 ...

installing: libffi-3.2.1-0 ...

installing: libgcc-4.8.5-2 ...

installing: libgfortran-3.0.0-1 ...

installing: libpng-1.6.22-0 ...

installing: libsodium-1.0.10-0 ...

installing: libtiff-4.0.6-2 ...

installing: libxcb-1.12-0 ...

installing: libxml2-2.9.2-0 ...

installing: libxslt-1.1.28-0 ...

installing: llvmlite-0.13.0-py27_0 ...

installing: locket-0.2.0-py27_1 ...

installing: lxml-3.6.4-py27_0 ...

installing: markupsafe-0.23-py27_2 ...

installing: matplotlib-1.5.3-np111py27_0 ...

installing: mistune-0.7.3-py27_0 ...

installing: mkl-11.3.3-0 ...

installing: mkl-service-1.1.2-py27_2 ...

installing: mpmath-0.19-py27_1 ...

installing: multipledispatch-0.4.8-py27_0 ...

installing: nb_anacondacloud-1.2.0-py27_0 ...

installing: nb_conda-2.0.0-py27_0 ...

installing: nb_conda_kernels-2.0.0-py27_0 ...

installing: nbconvert-4.2.0-py27_0 ...

installing: nbformat-4.1.0-py27_0 ...

installing: nbpresent-3.0.2-py27_0 ...

installing: networkx-1.11-py27_0 ...

installing: nltk-3.2.1-py27_0 ...

installing: nose-1.3.7-py27_1 ...

installing: notebook-4.2.3-py27_0 ...

installing: numba-0.28.1-np111py27_0 ...

installing: numexpr-2.6.1-np111py27_0 ...

installing: numpy-1.11.1-py27_0 ...

installing: odo-0.5.0-py27_1 ...

installing: openpyxl-2.3.2-py27_0 ...

installing: openssl-1.0.2j-0 ...

installing: pandas-0.18.1-np111py27_0 ...

installing: partd-0.3.6-py27_0 ...

installing: patchelf-0.9-0 ...

installing: path.py-8.2.1-py27_0 ...

installing: pathlib2-2.1.0-py27_0 ...

installing: patsy-0.4.1-py27_0 ...

installing: pep8-1.7.0-py27_0 ...

installing: pexpect-4.0.1-py27_0 ...

installing: pickleshare-0.7.4-py27_0 ...

installing: pillow-3.3.1-py27_0 ...

installing: pip-8.1.2-py27_0 ...

installing: pixman-0.32.6-0 ...

installing: pkginfo-1.3.2-py27_0 ...

installing: ply-3.9-py27_0 ...

installing: prompt_toolkit-1.0.3-py27_0 ...

installing: psutil-4.3.1-py27_0 ...

installing: ptyprocess-0.5.1-py27_0 ...

installing: py-1.4.31-py27_0 ...

installing: pyasn1-0.1.9-py27_0 ...

installing: pycairo-1.10.0-py27_0 ...

installing: pycosat-0.6.1-py27_1 ...

installing: pycparser-2.14-py27_1 ...

installing: pycrypto-2.6.1-py27_4 ...

installing: pycurl-7.43.0-py27_0 ...

installing: pyflakes-1.3.0-py27_0 ...

installing: pygments-2.1.3-py27_0 ...

installing: pylint-1.5.4-py27_1 ...

installing: pyopenssl-16.0.0-py27_0 ...

installing: pyparsing-2.1.4-py27_0 ...

installing: pyqt-5.6.0-py27_0 ...

installing: pytables-3.2.3.1-np111py27_0 ...

installing: pytest-2.9.2-py27_0 ...

installing: python-dateutil-2.5.3-py27_0 ...

installing: pytz-2016.6.1-py27_0 ...

installing: pyyaml-3.12-py27_0 ...

installing: pyzmq-15.4.0-py27_0 ...

installing: qt-5.6.0-0 ...

installing: qtawesome-0.3.3-py27_0 ...

installing: qtconsole-4.2.1-py27_1 ...

installing: qtpy-1.1.2-py27_0 ...

installing: readline-6.2-2 ...

installing: redis-3.2.0-0 ...

installing: redis-py-2.10.5-py27_0 ...

installing: requests-2.11.1-py27_0 ...

installing: rope-0.9.4-py27_1 ...

installing: scikit-image-0.12.3-np111py27_1 ...

installing: scikit-learn-0.17.1-np111py27_2 ...

installing: scipy-0.18.1-np111py27_0 ...

installing: setuptools-27.2.0-py27_0 ...

installing: simplegeneric-0.8.1-py27_1 ...

installing: singledispatch-3.4.0.3-py27_0 ...

installing: sip-4.18-py27_0 ...

installing: six-1.10.0-py27_0 ...

installing: snowballstemmer-1.2.1-py27_0 ...

installing: sockjs-tornado-1.0.3-py27_0 ...

installing: sphinx-1.4.6-py27_0 ...

installing: spyder-3.0.0-py27_0 ...

installing: sqlalchemy-1.0.13-py27_0 ...

installing: sqlite-3.13.0-0 ...

installing: ssl_match_hostname-3.4.0.2-py27_1 ...

installing: statsmodels-0.6.1-np111py27_1 ...

installing: sympy-1.0-py27_0 ...

installing: terminado-0.6-py27_0 ...

installing: tk-8.5.18-0 ...

installing: toolz-0.8.0-py27_0 ...

installing: tornado-4.4.1-py27_0 ...

installing: traitlets-4.3.0-py27_0 ...

installing: unicodecsv-0.14.1-py27_0 ...

installing: wcwidth-0.1.7-py27_0 ...

installing: werkzeug-0.11.11-py27_0 ...

installing: wheel-0.29.0-py27_0 ...

installing: widgetsnbextension-1.2.6-py27_0 ...

installing: wrapt-1.10.6-py27_0 ...

installing: xlrd-1.0.0-py27_0 ...

installing: xlsxwriter-0.9.3-py27_0 ...

installing: xlwt-1.1.2-py27_0 ...

installing: xz-5.2.2-0 ...

installing: yaml-0.1.6-0 ...

installing: zeromq-4.1.4-0 ...

installing: zlib-1.2.8-3 ...

installing: anaconda-4.2.0-np111py27_0 ...

installing: ruamel_yaml-0.11.14-py27_0 ...

installing: conda-4.2.9-py27_0 ...

installing: conda-build-2.0.2-py27_0 ...

Python 2.7.12 :: Continuum Analytics, Inc.

creating default environment...

installation finished.

Do you wish the installer to prepend the Anaconda2 install location

to PATH in your /root/.bashrc ? [yes|no]

[no] >>> yes

Prepending PATH=/root/anaconda2/bin to PATH in /root/.bashrc

A backup will be made to: /root/.bashrc-anaconda2.bak

For this change to become active, you have to open a new terminal.

Thank you for installing Anaconda2!

Share your notebooks and packages on Anaconda Cloud!

[root@hadoop161 tool]#

3.啓動spark客戶端shell

就是修改 ~/.bashrc 文件，添加如下內容：html

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=」notebook」python

而後source ~/.bashrc，就能夠經過啓動 pyspark 來啓動 IPython Notebook 了。

4.運行代碼

http://www.cnblogs.com/NaughtyBaby/p/5469469.html

（1）啓動hadoop的hdfs

（2）啓動spark，進入spark安裝目錄

cd /hadoop/spark-2.0.1-bin-hadoop2.7

啓動

bin/pyspark

看到下面界面

（3）http://localhost:8889/tree#

選擇new ---->python default

進行命令行

好了，如今開始執行吧！

5.執行書中代碼

（1）獲取用戶第一行數據

本地文件

PATH = "file:///spark/sparkdata"

user_data = sc.textFile("%s/ml-100k/u.user" % PATH)

#user_data = sc.textFile("file:///spark/sparkdata/ml-100k/u.user")

user_data.first()

u'1|24|M|technician|85711'

hdfs上文件

user_data = sc.textFile("/spark/sparkdata/ml-100k/u.user")

user_data.first()

（2）統計用戶、性別、職業、郵編

user_fields = user_data.map(lambda line: line.split("|"))

num_users = user_fields.map(lambda fields: fields[0]).count()

num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()

num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()

num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()

print "Users: %d, genders: %d, occupations: %d, ZIP codes: %d" % (num_users, num_genders, num_occupations, num_zipcodes)

Users: 943, genders: 2, occupations: 21, ZIP codes: 795

（3）用直方圖分析用戶年齡

%pylab inline

ages = user_fields.map(lambda x: int(x[1])).collect()

hist(ages, bins=20, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

（4）用戶職業分佈狀況

count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()

x_axis1 = np.array([c[0] for c in count_by_occupation])

y_axis1 = np.array([c[1] for c in count_by_occupation])

x_axis = x_axis1[np.argsort(y_axis1)]

y_axis = y_axis1[np.argsort(y_axis1)]

pos = np.arange(len(x_axis))

width = 1.0

ax = plt.axes()

ax.set_xticks(pos + (width / 2))

ax.set_xticklabels(x_axis)

plt.bar(pos, y_axis, width, color='lightblue')

plt.xticks(rotation=30)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

（5）countByValue會計算RDD裏各不一樣值分別出現次數

count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()

print "Map-reduce approach:"

print dict(count_by_occupation2)

print ""

print "countByValue approach:"

print dict(count_by_occupation)

輸出結果

Map-reduce approach:

{u'administrator': 79, u'retired': 14, u'lawyer': 12, u'healthcare': 16, u'marketing': 26, u'executive': 32, u'scientist': 31, u'student': 196, u'technician': 27, u'librarian': 51, u'programmer': 66, u'salesman': 12, u'homemaker': 7, u'engineer': 67, u'none': 9, u'doctor': 7, u'writer': 45, u'entertainment': 18, u'other': 105, u'educator': 95, u'artist': 28}

countByValue approach:

{u'administrator': 79, u'writer': 45, u'retired': 14, u'lawyer': 12, u'doctor': 7, u'marketing': 26, u'executive': 32, u'none': 9, u'entertainment': 18, u'healthcare': 16, u'scientist': 31, u'student': 196, u'educator': 95, u'technician': 27, u'librarian': 51, u'programmer': 66, u'artist': 28, u'salesman': 12, u'other': 105, u'homemaker': 7, u'engineer': 67}

（6）獲取電影數據中的一行

movie_data = sc.textFile("%s/ml-100k/u.item" % PATH)

print movie_data.first()

num_movies = movie_data.count()

print "Movies: %d" % num_movies

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

Movies: 1682

（7）電影年齡分佈圖

處理年份異常

def convert_year(x):

try:

return int(x[-4:])

except:

return 1900 # there is a 'bad' data point with a blank year, which we set to 1900 and will filter out later

movie_fields = movie_data.map(lambda lines: lines.split("|"))

years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))

# we filter out any 'bad' data points here

years_filtered = years.filter(lambda x: x != 1900)

# plot the movie ages histogram

movie_ages = years_filtered.map(lambda yr: 1998-yr).countByValue()

values = movie_ages.values()

bins = movie_ages.keys()

hist(values, bins=bins, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16,10)

（8）評級數據分析

rating_data_raw = sc.textFile("%s/ml-100k/u.data" % PATH)

print rating_data_raw.first()

num_ratings = rating_data_raw.count()

print "Ratings: %d" % num_ratings

評級次數總共10萬

196 242 3 881250949

Ratings: 100000

（9）基本統計，最高評級，最低評級，

rating_data = rating_data_raw.map(lambda line: line.split("\t"))

ratings = rating_data.map(lambda fields: int(fields[2]))

max_rating = ratings.reduce(lambda x, y: max(x, y))

min_rating = ratings.reduce(lambda x, y: min(x, y))

mean_rating = ratings.reduce(lambda x, y: x + y) / float(num_ratings)

median_rating = np.median(ratings.collect())

ratings_per_user = num_ratings / num_users

ratings_per_movie = num_ratings / num_movies

print "Min rating: %d" % min_rating

print "Max rating: %d" % max_rating

print "Average rating: %2.2f" % mean_rating

print "Median rating: %d" % median_rating

print "Average # of ratings per user: %2.2f" % ratings_per_user

print "Average # of ratings per movie: %2.2f" % ratings_per_movie

結果：

Min rating: 1

Max rating: 5

Average rating: 3.53

Median rating: 4

Average # of ratings per user: 106.00

Average # of ratings per movie: 59.00

spark中相似函數states

# we can also use the stats function to get some similar information to the above

ratings.stats()

(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)

（10）評級分佈直方圖

# create plot of counts by rating value

count_by_rating = ratings.countByValue()

x_axis = np.array(count_by_rating.keys())

y_axis = np.array([float(c) for c in count_by_rating.values()])

# we normalize the y-axis here to percentages

y_axis_normed = y_axis / y_axis.sum()

pos = np.arange(len(x_axis))

width = 1.0

ax = plt.axes()

ax.set_xticks(pos + (width / 2))

ax.set_xticklabels(x_axis)

plt.bar(pos, y_axis_normed, width, color='lightblue')

plt.xticks(rotation=30)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

結果

（11）計算各用戶評級次數

# to compute the distribution of ratings per user, we first group the ratings by user id

user_ratings_grouped = rating_data.map(lambda fields: (int(fields[0]), int(fields[2]))). groupByKey()

# then, for each key (user id), we find the size of the set of ratings, which gives us the # ratings for that user

user_ratings_byuser = user_ratings_grouped.map(lambda (k, v): (k, len(v)))

user_ratings_byuser.take(5)

[(2, 62), (4, 24), (6, 211), (8, 59), (10, 184)]

（12）用戶評級分佈直方圖

# and finally plot the histogram

user_ratings_byuser_local = user_ratings_byuser.map(lambda (k, v): v).collect()

hist(user_ratings_byuser_local, bins=200, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16,10)

3.3 處理和轉換數據

1.處理方法以下

（1）過濾掉或刪除非規整或有值缺失的數據：者一般是必須的，可是會失去這些數據中好的信息。

（2）填充非規整或缺失的數據

（3）對異常值作魯棒性：如統計技術（魯棒迴歸）

（4）對可能的異常進行轉換

mean_year = np.mean(years_pre_processed_array[years_pre_processed_array!=1900])

median_year = np.median(years_pre_processed_array[years_pre_processed_array!=1900])

idx_bad_data = np.where(years_pre_processed_array==1900)[0]

years_pre_processed_array[idx_bad_data] = median_year

print "Mean year of release: %d" % mean_year

print "Median year of release: %d" % median_year

print "Index of '1900' after assigning median: %s" % np.where(years_pre_processed_array == 1900)[0]

Mean year of release: 1989

Median year of release: 1995

Index of '1900' after assigning median: []

3.4從數據中提取特徵

1.特徵：指那些用於模型訓練的變量。幾乎全部的機器學習模型都是與用向量表示的數值特徵打交道；因此須要將原始數據轉換爲數值

2.特徵概況

（1）數值特徵（numerical feature）：實數、整數

（2）類別特徵（categorical feature）:例如用戶性別，職業，字典值

（3）文本特徵（text feature）:它們派生自數據中的文本內容，好比電影名、描述、評論

（4）其它特徵：好比圖像、視頻、音頻、經緯度等

all_occupations = user_fields.map(lambda fields: fields[3]).distinct().collect()

all_occupations.sort()

# create a new dictionary to hold the occupations, and assign the "1-of-k" indexes

idx = 0

all_occupations_dict = {}

for o in all_occupations:

all_occupations_dict[o] = idx

idx +=1

# try a few examples to see what "1-of-k" encoding is assigned

print "Encoding of 'doctor': %d" % all_occupations_dict['doctor']

print "Encoding of 'programmer': %d" % all_occupations_dict['programmer']

運行結果

Encoding of 'doctor': 2

Encoding of 'programmer': 14

# create a vector representation for "programmer" and encode it into a binary vector

K = len(all_occupations_dict)

binary_x = np.zeros(K)

k_programmer = all_occupations_dict['programmer']

binary_x[k_programmer] = 1

print "Binary feature vector: %s" % binary_x

print "Length of binary vector: %d" % K

結果：

Binary feature vector: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.

0. 0. 0.]

Length of binary vector: 21

將時間戳轉爲類別特徵

# a function to extract the timestamps (in seconds) from the dataset

def extract_datetime(ts):

import datetime

return datetime.datetime.fromtimestamp(ts)

timestamps = rating_data.map(lambda fields: int(fields[3]))

hour_of_day = timestamps.map(lambda ts: extract_datetime(ts).hour)

hour_of_day.take(5)

[7, 11, 23, 21, 21]

這裏進行修改，源碼中night範圍錯誤

# a function for assigning "time-of-day" bucket given an hour of the day

def assign_tod(hr):

times_of_day = {

'morning' : range(7, 12),

'lunch' : range(12, 14),

'afternoon' : range(14, 18),

'evening' : range(18, 23),

'night' : {23,24,0,1,2,3,4,5,6,7}

}

for k, v in times_of_day.iteritems():

if hr in v:

return k

# now apply the "time of day" function to the "hour of day" RDD

time_of_day = hour_of_day.map(lambda hr: assign_tod(hr))

time_of_day.take(5)

['night', 'morning', 'night', 'evening', 'evening']

文本特徵處理方法：詞袋法（bag-of-word）

（1）分詞（tokenization）

（2）刪除停用詞（stop words removal）

（3）提取詞幹（stemming）

（4）向量化（vectorization）

前5個標題進行測試extract_title

# we define a function to extract just the title from the raw movie title, removing the year of release

def extract_title(raw):

import re

grps = re.search("\((\w+)\)", raw) # this regular expression finds the non-word (numbers) between parentheses

if grps:

return raw[:grps.start()].strip() # we strip the trailing whitespace from the title

else:

return raw

# first lets extract the raw movie titles from the movie fields

raw_titles = movie_fields.map(lambda fields: fields[1])

# next, we strip away the "year of release" to leave us with just the title text

# let's test our title extraction function on the first 5 titles

for raw_title in raw_titles.take(5):

print extract_title(raw_title)

Toy Story

GoldenEye

Four Rooms

Get Shorty

Copycat

空白分詞法，將標題分爲詞

# ok that looks good! let's apply it to all the titles

movie_titles = raw_titles.map(lambda m: extract_title(m))

# next we tokenize the titles into terms. We'll use simple whitespace tokenization

title_terms = movie_titles.map(lambda t: t.split(" "))

print title_terms.take(5)

[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'], [u'Get', u'Shorty'], [u'Copycat']]

建立詞典，並用一些詞來測試出現次數

# next we would like to collect all the possible terms, in order to build out dictionary of term <-> index mappings

all_terms = title_terms.flatMap(lambda x: x).distinct().collect()

# create a new dictionary to hold the terms, and assign the "1-of-k" indexes

idx = 0

all_terms_dict = {}

for term in all_terms:

all_terms_dict[term] = idx

idx +=1

num_terms = len(all_terms_dict)

print "Total number of terms: %d" % num_terms

print "Index of term 'Dead': %d" % all_terms_dict['Dead']

print "Index of term 'Rooms': %d" % all_terms_dict['Rooms']

結果

Total number of terms: 2645

Index of term 'Dead': 147

Index of term 'Rooms': 1963

spark中zipWithIndex能夠獲得一樣的效果

# we could also use Spark's 'zipWithIndex' RDD function to create the term dictionary

all_terms_dict2 = title_terms.flatMap(lambda x: x).distinct().zipWithIndex().collectAsMap()

print "Index of term 'Dead': %d" % all_terms_dict2['Dead']

print "Index of term 'Rooms': %d" % all_terms_dict2['Rooms']

結果

Index of term 'Dead': 147

Index of term 'Rooms': 1963

# this function takes a list of terms and encodes it as a scipy sparse vector using an approach

# similar to the 1-of-k encoding

def create_vector(terms, term_dict):

from scipy import sparse as sp

x = sp.csc_matrix((1, num_terms))

for t in terms:

if t in term_dict:

idx = term_dict[t]

x[0, idx] = 1

return x

all_terms_bcast = sc.broadcast(all_terms_dict)

term_vectors = title_terms.map(lambda terms: create_vector(terms, all_terms_bcast.value))

term_vectors.take(5)

[<1x2645 sparse matrix of type '<type 'numpy.float64'>'

with 2 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type '<type 'numpy.float64'>'

with 1 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type '<type 'numpy.float64'>'

with 2 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type '<type 'numpy.float64'>'

with 2 stored elements in Compressed Sparse Column format>,

<1x2645 sparse matrix of type '<type 'numpy.float64'>'

with 1 stored elements in Compressed Sparse Column format>]

正則化特徵：其實是對數據集中的單個特徵進行轉換。好比減去平均值、進行標準的正則轉換

np.random.seed(42)

x = np.random.randn(10)

norm_x_2 = np.linalg.norm(x)

normalized_x = x / norm_x_2

print "x:\n%s" % x

print "2-Norm of x: %2.4f" % norm_x_2

print "Normalized x:\n%s" % normalized_x

print "2-Norm of normalized_x: %2.4f" % np.linalg.norm(normalized_x)

結果

[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696

1.57921282 0.76743473 -0.46947439 0.54256004]

2-Norm of x: 2.5908

Normalized x:

[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237

0.60954584 0.29621508 -0.1812081 0.20941776]

2-Norm of normalized_x: 1.0000

正則化特徵向量：經過對數據中的某一行的全部特徵進行轉換，以讓轉換後的特徵的長度標準化。

from pyspark.mllib.feature import Normalizer

normalizer = Normalizer()

vector = sc.parallelize([x])

normalized_x_mllib = normalizer.transform(vector).first().toArray()

print "x:\n%s" % x

print "2-Norm of x: %2.4f" % norm_x_2

print "Normalized x MLlib:\n%s" % normalized_x_mllib

print "2-Norm of normalized_x_mllib: %2.4f" % np.linalg.norm(normalized_x_mllib)

結果

[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696

1.57921282 0.76743473 -0.46947439 0.54256004]

2-Norm of x: 2.5908

Normalized x MLlib:

[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237

0.60954584 0.29621508 -0.1812081 0.20941776]

2-Norm of normalized_x_mllib: 1.0000

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。