1.安裝工具ipython
選擇本身須要的版本
2.安裝過程
(1)賦權限
chmod u+x ./Anaconda2-4.2.0-Linux-x86_64.sh
(2)回車
[root@hadoop161 tool]# ./Anaconda2-4.2.0-Linux-x86_64.sh
Welcome to Anaconda2 4.2.0 (by Continuum Analytics, Inc.)
In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>
================
Anaconda License
================
Copyright 2016, Continuum Analytics, Inc.
All rights reserved under the 3-clause BSD License:
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of Continuum Analytics, Inc. nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL CONTINUUM ANALYTICS, INC. BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGE.
Notice of Third Party Software Licenses
=======================================
Anaconda contains open source software packages from third parties. These
are available on an "as is" basis and subject to their individual license
agreements. These licenses are available in Anaconda or at
http://docs.continuum.io/anaconda/pkg-docs . Any binary packages of these
third party tools you obtain via Anaconda are subject to their individual
licenses as well as the Anaconda license. Continuum reserves the right to
change which third party tools are provided in Anaconda.
In particular, Anaconda contains re-distributable, run-time, shared-library
files from the Intel (TM) Math Kernel Library ("MKL binaries"). You are
specifically authorized to use the MKL binaries with your installation of
Anaconda. You are also authorized to redistribute the MKL binaries with
Anaconda or in the conda package that contains them. If needed,
instructions for removing the MKL binaries after installation of Anaconda
are available at http://www.continuum.io.
Cryptography Notice
===================
This distribution includes cryptographic software. The country in which you
currently reside may have restrictions on the import, possession, use,
and/or re-export to another country, of encryption software. BEFORE using
any encryption software, please check your country's laws, regulations and
policies concerning the import, possession, or use, and re-export of
encryption software, to see if this is permitted. See the Wassenaar
Arrangement <http://www.wassenaar.org/> for more information.
Continuum Analytics has self-classified this software as Export Commodity
Control Number (ECCN) 5D002.C.1, which includes information security
software using or performing cryptographic functions with asymmetric
algorithms. The form and manner of this distribution makes it eligible for
export under the License Exception ENC Technology Software Unrestricted
(TSU) exception (see the BIS Export Administration Regulations, Section
740.13) for both object code and source code.
The following packages are included in this distribution that relate to
cryptography:
openssl
The OpenSSL Project is a collaborative effort to develop a robust,
commercial-grade, full-featured, and Open Source toolkit implementing the
Transport Layer Security (TLS) and Secure Sockets Layer (SSL) protocols as
well as a full-strength general purpose cryptography library.
pycrypto
A collection of both secure hash functions (such as SHA256 and RIPEMD160),
and various encryption algorithms (AES, DES, RSA, ElGamal, etc.).
pyopenssl
A thin Python wrapper around (a subset of) the OpenSSL library.
kerberos (krb5, non-Windows platforms)
A network authentication protocol designed to provide strong authentication
for client/server applications by using secret-key cryptography.
cryptography
A Python library which exposes cryptographic recipes and primitives.
Do you approve the license terms? [yes|no]
>>>
Please answer 'yes' or 'no':
>>> yes
Anaconda2 will now be installed into this location:
/root/anaconda2
- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below
[/root/anaconda2] >>>
PREFIX=/root/anaconda2
installing: python-2.7.12-1 ...
installing: _license-1.1-py27_1 ...
installing: _nb_ext_conf-0.3.0-py27_0 ...
installing: alabaster-0.7.9-py27_0 ...
installing: anaconda-clean-1.0.0-py27_0 ...
installing: anaconda-client-1.5.1-py27_0 ...
installing: anaconda-navigator-1.3.1-py27_0 ...
installing: argcomplete-1.0.0-py27_1 ...
installing: astroid-1.4.7-py27_0 ...
installing: astropy-1.2.1-np111py27_0 ...
installing: babel-2.3.4-py27_0 ...
installing: backports-1.0-py27_0 ...
installing: backports_abc-0.4-py27_0 ...
installing: beautifulsoup4-4.5.1-py27_0 ...
installing: bitarray-0.8.1-py27_0 ...
installing: blaze-0.10.1-py27_0 ...
installing: bokeh-0.12.2-py27_0 ...
installing: boto-2.42.0-py27_0 ...
installing: bottleneck-1.1.0-np111py27_0 ...
installing: cairo-1.12.18-6 ...
installing: cdecimal-2.3-py27_2 ...
installing: cffi-1.7.0-py27_0 ...
installing: chest-0.2.3-py27_0 ...
installing: click-6.6-py27_0 ...
installing: cloudpickle-0.2.1-py27_0 ...
installing: clyent-1.2.2-py27_0 ...
installing: colorama-0.3.7-py27_0 ...
installing: configobj-5.0.6-py27_0 ...
installing: configparser-3.5.0-py27_0 ...
installing: contextlib2-0.5.3-py27_0 ...
installing: cryptography-1.5-py27_0 ...
installing: curl-7.49.0-1 ...
installing: cycler-0.10.0-py27_0 ...
installing: cython-0.24.1-py27_0 ...
installing: cytoolz-0.8.0-py27_0 ...
installing: dask-0.11.0-py27_0 ...
installing: datashape-0.5.2-py27_0 ...
installing: dbus-1.10.10-0 ...
installing: decorator-4.0.10-py27_0 ...
installing: dill-0.2.5-py27_0 ...
installing: docutils-0.12-py27_2 ...
installing: dynd-python-0.7.2-py27_0 ...
installing: entrypoints-0.2.2-py27_0 ...
installing: enum34-1.1.6-py27_0 ...
installing: et_xmlfile-1.0.1-py27_0 ...
installing: expat-2.1.0-0 ...
installing: fastcache-1.0.2-py27_1 ...
installing: filelock-2.0.6-py27_0 ...
installing: flask-0.11.1-py27_0 ...
installing: flask-cors-2.1.2-py27_0 ...
installing: fontconfig-2.11.1-6 ...
installing: freetype-2.5.5-1 ...
installing: funcsigs-1.0.2-py27_0 ...
installing: functools32-3.2.3.2-py27_0 ...
installing: futures-3.0.5-py27_0 ...
installing: get_terminal_size-1.0.0-py27_0 ...
installing: gevent-1.1.2-py27_0 ...
installing: glib-2.43.0-1 ...
installing: greenlet-0.4.10-py27_0 ...
installing: grin-1.2.1-py27_3 ...
installing: gst-plugins-base-1.8.0-0 ...
installing: gstreamer-1.8.0-0 ...
installing: h5py-2.6.0-np111py27_2 ...
installing: harfbuzz-0.9.39-1 ...
installing: hdf5-1.8.17-1 ...
installing: heapdict-1.0.0-py27_1 ...
installing: icu-54.1-0 ...
installing: idna-2.1-py27_0 ...
installing: imagesize-0.7.1-py27_0 ...
installing: ipaddress-1.0.16-py27_0 ...
installing: ipykernel-4.5.0-py27_0 ...
installing: ipython-5.1.0-py27_0 ...
installing: ipython_genutils-0.1.0-py27_0 ...
installing: ipywidgets-5.2.2-py27_0 ...
installing: itsdangerous-0.24-py27_0 ...
installing: jbig-2.1-0 ...
installing: jdcal-1.2-py27_1 ...
installing: jedi-0.9.0-py27_1 ...
installing: jinja2-2.8-py27_1 ...
installing: jpeg-8d-2 ...
installing: jsonschema-2.5.1-py27_0 ...
installing: jupyter-1.0.0-py27_3 ...
installing: jupyter_client-4.4.0-py27_0 ...
installing: jupyter_console-5.0.0-py27_0 ...
installing: jupyter_core-4.2.0-py27_0 ...
installing: lazy-object-proxy-1.2.1-py27_0 ...
installing: libdynd-0.7.2-0 ...
installing: libffi-3.2.1-0 ...
installing: libgcc-4.8.5-2 ...
installing: libgfortran-3.0.0-1 ...
installing: libpng-1.6.22-0 ...
installing: libsodium-1.0.10-0 ...
installing: libtiff-4.0.6-2 ...
installing: libxcb-1.12-0 ...
installing: libxml2-2.9.2-0 ...
installing: libxslt-1.1.28-0 ...
installing: llvmlite-0.13.0-py27_0 ...
installing: locket-0.2.0-py27_1 ...
installing: lxml-3.6.4-py27_0 ...
installing: markupsafe-0.23-py27_2 ...
installing: matplotlib-1.5.3-np111py27_0 ...
installing: mistune-0.7.3-py27_0 ...
installing: mkl-11.3.3-0 ...
installing: mkl-service-1.1.2-py27_2 ...
installing: mpmath-0.19-py27_1 ...
installing: multipledispatch-0.4.8-py27_0 ...
installing: nb_anacondacloud-1.2.0-py27_0 ...
installing: nb_conda-2.0.0-py27_0 ...
installing: nb_conda_kernels-2.0.0-py27_0 ...
installing: nbconvert-4.2.0-py27_0 ...
installing: nbformat-4.1.0-py27_0 ...
installing: nbpresent-3.0.2-py27_0 ...
installing: networkx-1.11-py27_0 ...
installing: nltk-3.2.1-py27_0 ...
installing: nose-1.3.7-py27_1 ...
installing: notebook-4.2.3-py27_0 ...
installing: numba-0.28.1-np111py27_0 ...
installing: numexpr-2.6.1-np111py27_0 ...
installing: numpy-1.11.1-py27_0 ...
installing: odo-0.5.0-py27_1 ...
installing: openpyxl-2.3.2-py27_0 ...
installing: openssl-1.0.2j-0 ...
installing: pandas-0.18.1-np111py27_0 ...
installing: partd-0.3.6-py27_0 ...
installing: patchelf-0.9-0 ...
installing: path.py-8.2.1-py27_0 ...
installing: pathlib2-2.1.0-py27_0 ...
installing: patsy-0.4.1-py27_0 ...
installing: pep8-1.7.0-py27_0 ...
installing: pexpect-4.0.1-py27_0 ...
installing: pickleshare-0.7.4-py27_0 ...
installing: pillow-3.3.1-py27_0 ...
installing: pip-8.1.2-py27_0 ...
installing: pixman-0.32.6-0 ...
installing: pkginfo-1.3.2-py27_0 ...
installing: ply-3.9-py27_0 ...
installing: prompt_toolkit-1.0.3-py27_0 ...
installing: psutil-4.3.1-py27_0 ...
installing: ptyprocess-0.5.1-py27_0 ...
installing: py-1.4.31-py27_0 ...
installing: pyasn1-0.1.9-py27_0 ...
installing: pycairo-1.10.0-py27_0 ...
installing: pycosat-0.6.1-py27_1 ...
installing: pycparser-2.14-py27_1 ...
installing: pycrypto-2.6.1-py27_4 ...
installing: pycurl-7.43.0-py27_0 ...
installing: pyflakes-1.3.0-py27_0 ...
installing: pygments-2.1.3-py27_0 ...
installing: pylint-1.5.4-py27_1 ...
installing: pyopenssl-16.0.0-py27_0 ...
installing: pyparsing-2.1.4-py27_0 ...
installing: pyqt-5.6.0-py27_0 ...
installing: pytables-3.2.3.1-np111py27_0 ...
installing: pytest-2.9.2-py27_0 ...
installing: python-dateutil-2.5.3-py27_0 ...
installing: pytz-2016.6.1-py27_0 ...
installing: pyyaml-3.12-py27_0 ...
installing: pyzmq-15.4.0-py27_0 ...
installing: qt-5.6.0-0 ...
installing: qtawesome-0.3.3-py27_0 ...
installing: qtconsole-4.2.1-py27_1 ...
installing: qtpy-1.1.2-py27_0 ...
installing: readline-6.2-2 ...
installing: redis-3.2.0-0 ...
installing: redis-py-2.10.5-py27_0 ...
installing: requests-2.11.1-py27_0 ...
installing: rope-0.9.4-py27_1 ...
installing: scikit-image-0.12.3-np111py27_1 ...
installing: scikit-learn-0.17.1-np111py27_2 ...
installing: scipy-0.18.1-np111py27_0 ...
installing: setuptools-27.2.0-py27_0 ...
installing: simplegeneric-0.8.1-py27_1 ...
installing: singledispatch-3.4.0.3-py27_0 ...
installing: sip-4.18-py27_0 ...
installing: six-1.10.0-py27_0 ...
installing: snowballstemmer-1.2.1-py27_0 ...
installing: sockjs-tornado-1.0.3-py27_0 ...
installing: sphinx-1.4.6-py27_0 ...
installing: spyder-3.0.0-py27_0 ...
installing: sqlalchemy-1.0.13-py27_0 ...
installing: sqlite-3.13.0-0 ...
installing: ssl_match_hostname-3.4.0.2-py27_1 ...
installing: statsmodels-0.6.1-np111py27_1 ...
installing: sympy-1.0-py27_0 ...
installing: terminado-0.6-py27_0 ...
installing: tk-8.5.18-0 ...
installing: toolz-0.8.0-py27_0 ...
installing: tornado-4.4.1-py27_0 ...
installing: traitlets-4.3.0-py27_0 ...
installing: unicodecsv-0.14.1-py27_0 ...
installing: wcwidth-0.1.7-py27_0 ...
installing: werkzeug-0.11.11-py27_0 ...
installing: wheel-0.29.0-py27_0 ...
installing: widgetsnbextension-1.2.6-py27_0 ...
installing: wrapt-1.10.6-py27_0 ...
installing: xlrd-1.0.0-py27_0 ...
installing: xlsxwriter-0.9.3-py27_0 ...
installing: xlwt-1.1.2-py27_0 ...
installing: xz-5.2.2-0 ...
installing: yaml-0.1.6-0 ...
installing: zeromq-4.1.4-0 ...
installing: zlib-1.2.8-3 ...
installing: anaconda-4.2.0-np111py27_0 ...
installing: ruamel_yaml-0.11.14-py27_0 ...
installing: conda-4.2.9-py27_0 ...
installing: conda-build-2.0.2-py27_0 ...
Python 2.7.12 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /root/.bashrc ? [yes|no]
[no] >>> yes
Prepending PATH=/root/anaconda2/bin to PATH in /root/.bashrc
A backup will be made to: /root/.bashrc-anaconda2.bak
For this change to become active, you have to open a new terminal.
Thank you for installing Anaconda2!
Share your notebooks and packages on Anaconda Cloud!
Sign up for free: https://anaconda.org
[root@hadoop161 tool]#
3.啓動spark客戶端shell
就是修改 ~/.bashrc 文件,添加如下內容:html
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=」notebook」python
而後source ~/.bashrc,就能夠經過啓動 pyspark 來啓動 IPython Notebook 了。
4.運行代碼
(1)啓動hadoop的hdfs
(2)啓動spark,進入spark安裝目錄
cd /hadoop/spark-2.0.1-bin-hadoop2.7
啓動
bin/pyspark
看到下面界面
(3)http://localhost:8889/tree#
選擇new ---->python default
進行命令行
好了,如今開始執行吧!
5.執行書中代碼
(1)獲取用戶第一行數據
本地文件
PATH = "file:///spark/sparkdata"
user_data = sc.textFile("%s/ml-100k/u.user" % PATH)
#user_data = sc.textFile("file:///spark/sparkdata/ml-100k/u.user")
user_data.first()
u'1|24|M|technician|85711'
hdfs上文件
user_data = sc.textFile("/spark/sparkdata/ml-100k/u.user")
user_data.first()
(2)統計用戶、性別、職業、郵編
user_fields = user_data.map(lambda line: line.split("|"))
num_users = user_fields.map(lambda fields: fields[0]).count()
num_genders = user_fields.map(lambda fields: fields[2]).distinct().count()
num_occupations = user_fields.map(lambda fields: fields[3]).distinct().count()
num_zipcodes = user_fields.map(lambda fields: fields[4]).distinct().count()
print "Users: %d, genders: %d, occupations: %d, ZIP codes: %d" % (num_users, num_genders, num_occupations, num_zipcodes)
Users: 943, genders: 2, occupations: 21, ZIP codes: 795
(3)用直方圖分析用戶年齡
%pylab inline
ages = user_fields.map(lambda x: int(x[1])).collect()
hist(ages, bins=20, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
(4)用戶職業分佈狀況
count_by_occupation = user_fields.map(lambda fields: (fields[3], 1)).reduceByKey(lambda x, y: x + y).collect()
x_axis1 = np.array([c[0] for c in count_by_occupation])
y_axis1 = np.array([c[1] for c in count_by_occupation])
x_axis = x_axis1[np.argsort(y_axis1)]
y_axis = y_axis1[np.argsort(y_axis1)]
pos = np.arange(len(x_axis))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)
plt.bar(pos, y_axis, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
(5)countByValue會計算RDD裏各不一樣值分別出現次數
count_by_occupation2 = user_fields.map(lambda fields: fields[3]).countByValue()
print "Map-reduce approach:"
print dict(count_by_occupation2)
print ""
print "countByValue approach:"
print dict(count_by_occupation)
輸出結果
Map-reduce approach:
{u'administrator': 79, u'retired': 14, u'lawyer': 12, u'healthcare': 16, u'marketing': 26, u'executive': 32, u'scientist': 31, u'student': 196, u'technician': 27, u'librarian': 51, u'programmer': 66, u'salesman': 12, u'homemaker': 7, u'engineer': 67, u'none': 9, u'doctor': 7, u'writer': 45, u'entertainment': 18, u'other': 105, u'educator': 95, u'artist': 28}
countByValue approach:
{u'administrator': 79, u'writer': 45, u'retired': 14, u'lawyer': 12, u'doctor': 7, u'marketing': 26, u'executive': 32, u'none': 9, u'entertainment': 18, u'healthcare': 16, u'scientist': 31, u'student': 196, u'educator': 95, u'technician': 27, u'librarian': 51, u'programmer': 66, u'artist': 28, u'salesman': 12, u'other': 105, u'homemaker': 7, u'engineer': 67}
(6)獲取電影數據中的一行
movie_data = sc.textFile("%s/ml-100k/u.item" % PATH)
print movie_data.first()
num_movies = movie_data.count()
print "Movies: %d" % num_movies
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
Movies: 1682
(7)電影年齡分佈圖
處理年份異常
def convert_year(x):
try:
return int(x[-4:])
except:
return 1900 # there is a 'bad' data point with a blank year, which we set to 1900 and will filter out later
movie_fields = movie_data.map(lambda lines: lines.split("|"))
years = movie_fields.map(lambda fields: fields[2]).map(lambda x: convert_year(x))
# we filter out any 'bad' data points here
years_filtered = years.filter(lambda x: x != 1900)
# plot the movie ages histogram
movie_ages = years_filtered.map(lambda yr: 1998-yr).countByValue()
values = movie_ages.values()
bins = movie_ages.keys()
hist(values, bins=bins, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)
(8)評級數據分析
rating_data_raw = sc.textFile("%s/ml-100k/u.data" % PATH)
print rating_data_raw.first()
num_ratings = rating_data_raw.count()
print "Ratings: %d" % num_ratings
評級次數總共10萬
196 242 3 881250949
Ratings: 100000
(9)基本統計,最高評級,最低評級,
rating_data = rating_data_raw.map(lambda line: line.split("\t"))
ratings = rating_data.map(lambda fields: int(fields[2]))
max_rating = ratings.reduce(lambda x, y: max(x, y))
min_rating = ratings.reduce(lambda x, y: min(x, y))
mean_rating = ratings.reduce(lambda x, y: x + y) / float(num_ratings)
median_rating = np.median(ratings.collect())
ratings_per_user = num_ratings / num_users
ratings_per_movie = num_ratings / num_movies
print "Min rating: %d" % min_rating
print "Max rating: %d" % max_rating
print "Average rating: %2.2f" % mean_rating
print "Median rating: %d" % median_rating
print "Average # of ratings per user: %2.2f" % ratings_per_user
print "Average # of ratings per movie: %2.2f" % ratings_per_movie
結果:
Min rating: 1
Max rating: 5
Average rating: 3.53
Median rating: 4
Average # of ratings per user: 106.00
Average # of ratings per movie: 59.00
spark中相似函數states
# we can also use the stats function to get some similar information to the above
ratings.stats()
(count: 100000, mean: 3.52986, stdev: 1.12566797076, max: 5.0, min: 1.0)
(10)評級分佈直方圖
# create plot of counts by rating value
count_by_rating = ratings.countByValue()
x_axis = np.array(count_by_rating.keys())
y_axis = np.array([float(c) for c in count_by_rating.values()])
# we normalize the y-axis here to percentages
y_axis_normed = y_axis / y_axis.sum()
pos = np.arange(len(x_axis))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(x_axis)
plt.bar(pos, y_axis_normed, width, color='lightblue')
plt.xticks(rotation=30)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16, 10)
結果
(11)計算各用戶評級次數
# to compute the distribution of ratings per user, we first group the ratings by user id
user_ratings_grouped = rating_data.map(lambda fields: (int(fields[0]), int(fields[2]))). groupByKey()
# then, for each key (user id), we find the size of the set of ratings, which gives us the # ratings for that user
user_ratings_byuser = user_ratings_grouped.map(lambda (k, v): (k, len(v)))
user_ratings_byuser.take(5)
[(2, 62), (4, 24), (6, 211), (8, 59), (10, 184)]
(12)用戶評級分佈直方圖
# and finally plot the histogram
user_ratings_byuser_local = user_ratings_byuser.map(lambda (k, v): v).collect()
hist(user_ratings_byuser_local, bins=200, color='lightblue', normed=True)
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(16,10)
3.3 處理和轉換數據
1.處理方法以下
(1)過濾掉或刪除非規整或有值缺失的數據:者一般是必須的,可是會失去這些數據中好的信息。
(2)填充非規整或缺失的數據
(3)對異常值作魯棒性:如統計技術(魯棒迴歸)
(4)對可能的異常進行轉換
mean_year = np.mean(years_pre_processed_array[years_pre_processed_array!=1900])
median_year = np.median(years_pre_processed_array[years_pre_processed_array!=1900])
idx_bad_data = np.where(years_pre_processed_array==1900)[0]
years_pre_processed_array[idx_bad_data] = median_year
print "Mean year of release: %d" % mean_year
print "Median year of release: %d" % median_year
print "Index of '1900' after assigning median: %s" % np.where(years_pre_processed_array == 1900)[0]
Mean year of release: 1989
Median year of release: 1995
Index of '1900' after assigning median: []
3.4從數據中提取特徵
1.特徵:指那些用於模型訓練的變量。幾乎全部的機器學習模型都是與用向量表示的數值特徵打交道;因此須要將原始數據轉換爲數值
2.特徵概況
(1)數值特徵(numerical feature):實數、整數
(2)類別特徵(categorical feature):例如用戶性別,職業,字典值
(3)文本特徵(text feature):它們派生自數據中的文本內容,好比電影名、描述、評論
(4)其它特徵:好比圖像、視頻、音頻、經緯度等
all_occupations = user_fields.map(lambda fields: fields[3]).distinct().collect()
all_occupations.sort()
# create a new dictionary to hold the occupations, and assign the "1-of-k" indexes
idx = 0
all_occupations_dict = {}
for o in all_occupations:
all_occupations_dict[o] = idx
idx +=1
# try a few examples to see what "1-of-k" encoding is assigned
print "Encoding of 'doctor': %d" % all_occupations_dict['doctor']
print "Encoding of 'programmer': %d" % all_occupations_dict['programmer']
運行結果
Encoding of 'doctor': 2
Encoding of 'programmer': 14
# create a vector representation for "programmer" and encode it into a binary vector
K = len(all_occupations_dict)
binary_x = np.zeros(K)
k_programmer = all_occupations_dict['programmer']
binary_x[k_programmer] = 1
print "Binary feature vector: %s" % binary_x
print "Length of binary vector: %d" % K
結果:
Binary feature vector: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0.]
Length of binary vector: 21
將時間戳轉爲類別特徵
# a function to extract the timestamps (in seconds) from the dataset
def extract_datetime(ts):
import datetime
return datetime.datetime.fromtimestamp(ts)
timestamps = rating_data.map(lambda fields: int(fields[3]))
hour_of_day = timestamps.map(lambda ts: extract_datetime(ts).hour)
hour_of_day.take(5)
[7, 11, 23, 21, 21]
這裏進行修改,源碼中night範圍錯誤
# a function for assigning "time-of-day" bucket given an hour of the day
def assign_tod(hr):
times_of_day = {
'morning' : range(7, 12),
'lunch' : range(12, 14),
'afternoon' : range(14, 18),
'evening' : range(18, 23),
'night' : {23,24,0,1,2,3,4,5,6,7}
}
for k, v in times_of_day.iteritems():
if hr in v:
return k
# now apply the "time of day" function to the "hour of day" RDD
time_of_day = hour_of_day.map(lambda hr: assign_tod(hr))
time_of_day.take(5)
['night', 'morning', 'night', 'evening', 'evening']
文本特徵處理方法:詞袋法(bag-of-word)
(1)分詞(tokenization)
(2)刪除停用詞(stop words removal)
(3)提取詞幹(stemming)
(4)向量化(vectorization)
前5個標題進行測試extract_title
# we define a function to extract just the title from the raw movie title, removing the year of release
def extract_title(raw):
import re
grps = re.search("\((\w+)\)", raw) # this regular expression finds the non-word (numbers) between parentheses
if grps:
return raw[:grps.start()].strip() # we strip the trailing whitespace from the title
else:
return raw
# first lets extract the raw movie titles from the movie fields
raw_titles = movie_fields.map(lambda fields: fields[1])
# next, we strip away the "year of release" to leave us with just the title text
# let's test our title extraction function on the first 5 titles
for raw_title in raw_titles.take(5):
print extract_title(raw_title)
Toy Story
GoldenEye
Four Rooms
Get Shorty
Copycat
空白分詞法,將標題分爲詞
# ok that looks good! let's apply it to all the titles
movie_titles = raw_titles.map(lambda m: extract_title(m))
# next we tokenize the titles into terms. We'll use simple whitespace tokenization
title_terms = movie_titles.map(lambda t: t.split(" "))
print title_terms.take(5)
[[u'Toy', u'Story'], [u'GoldenEye'], [u'Four', u'Rooms'], [u'Get', u'Shorty'], [u'Copycat']]
建立詞典,並用一些詞來測試出現次數
# next we would like to collect all the possible terms, in order to build out dictionary of term <-> index mappings
all_terms = title_terms.flatMap(lambda x: x).distinct().collect()
# create a new dictionary to hold the terms, and assign the "1-of-k" indexes
idx = 0
all_terms_dict = {}
for term in all_terms:
all_terms_dict[term] = idx
idx +=1
num_terms = len(all_terms_dict)
print "Total number of terms: %d" % num_terms
print "Index of term 'Dead': %d" % all_terms_dict['Dead']
print "Index of term 'Rooms': %d" % all_terms_dict['Rooms']
結果
Total number of terms: 2645
Index of term 'Dead': 147
Index of term 'Rooms': 1963
spark中zipWithIndex能夠獲得一樣的效果
# we could also use Spark's 'zipWithIndex' RDD function to create the term dictionary
all_terms_dict2 = title_terms.flatMap(lambda x: x).distinct().zipWithIndex().collectAsMap()
print "Index of term 'Dead': %d" % all_terms_dict2['Dead']
print "Index of term 'Rooms': %d" % all_terms_dict2['Rooms']
結果
Index of term 'Dead': 147
Index of term 'Rooms': 1963
# this function takes a list of terms and encodes it as a scipy sparse vector using an approach
# similar to the 1-of-k encoding
def create_vector(terms, term_dict):
from scipy import sparse as sp
x = sp.csc_matrix((1, num_terms))
for t in terms:
if t in term_dict:
idx = term_dict[t]
x[0, idx] = 1
return x
all_terms_bcast = sc.broadcast(all_terms_dict)
term_vectors = title_terms.map(lambda terms: create_vector(terms, all_terms_bcast.value))
term_vectors.take(5)
[<1x2645 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Column format>,
<1x2645 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>,
<1x2645 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Column format>,
<1x2645 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Column format>,
<1x2645 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>]
正則化特徵:其實是對數據集中的單個特徵進行轉換。好比減去平均值、進行標準的正則轉換
np.random.seed(42)
x = np.random.randn(10)
norm_x_2 = np.linalg.norm(x)
normalized_x = x / norm_x_2
print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x:\n%s" % normalized_x
print "2-Norm of normalized_x: %2.4f" % np.linalg.norm(normalized_x)
結果
x:
[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696
1.57921282 0.76743473 -0.46947439 0.54256004]
2-Norm of x: 2.5908
Normalized x:
[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237
0.60954584 0.29621508 -0.1812081 0.20941776]
2-Norm of normalized_x: 1.0000
正則化特徵向量:經過對數據中的某一行的全部特徵進行轉換,以讓轉換後的特徵的長度標準化。
from pyspark.mllib.feature import Normalizer
normalizer = Normalizer()
vector = sc.parallelize([x])
normalized_x_mllib = normalizer.transform(vector).first().toArray()
print "x:\n%s" % x
print "2-Norm of x: %2.4f" % norm_x_2
print "Normalized x MLlib:\n%s" % normalized_x_mllib
print "2-Norm of normalized_x_mllib: %2.4f" % np.linalg.norm(normalized_x_mllib)
結果
x:
[ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337 -0.23413696
1.57921282 0.76743473 -0.46947439 0.54256004]
2-Norm of x: 2.5908
Normalized x MLlib:
[ 0.19172213 -0.05336737 0.24999534 0.58786029 -0.09037871 -0.09037237
0.60954584 0.29621508 -0.1812081 0.20941776]
2-Norm of normalized_x_mllib: 1.0000