原文連接:Step by step approach to perform data analysis using Python php
你已經決定來學習Python,可是你以前沒有編程經驗。所以,你經常對從哪兒着手而感到困惑,這麼多Python的知識須要去學習。如下這些是那些開始使用Python數據分析的初學者的廣泛遇到的問題:css
-
須要多久來學習Python?html
-
我須要學習Python到什麼程度才能來進行數據分析呢?vue
-
學習Python最好的書或者課程有哪些呢?python
-
爲了處理數據集,我應該成爲一個Python的編程專家嗎?ios
當開始學習一項新技術時,這些都是能夠理解的困惑,這是《在20小時內學會任何東西》的做者所說的。不要懼怕,我將會告訴你怎樣快速上手,而沒必要成爲一個Python編程「忍者」。c++
不要犯我以前犯過的錯
在開始使用Python以前,我對用Python進行數據分析有一個誤解:我必須不得不對Python編程特別精通。所以,我參加了Udacity的Python編程入門課程,完成了code academy上的Python教程,同時閱讀了若干本Python編程書籍。就這樣持續了3個月(平均天天3個小時),我那會兒經過完成小的軟件項目來學習Python。敲代碼是快樂的事兒,可是個人目標不是去成爲一個Python開發人員,而是要使用Python數據分析。以後,我意識到,我花了不少時間來學習用Python進行軟件開發,而不是數據分析。git
在幾個小時的深思熟慮以後,我發現,我須要學習5個Python庫來有效地解決一系列的數據分析問題。而後,我開始一個接一個的學習這些庫。es6
在我看來,精通用Python開發好的軟件纔可以高效地進行數據分析,這觀點是沒有必要的。github
忽略給大衆的資源
有許多優秀的Python書籍和在線課程,然而我不併不推薦它們中的一些,由於,有些是給大衆準備的而不是給那些用來數據分析的人準備的。一樣也有許多書是「用Python科學編程」的,但它們是面向各類數學爲導向的主題的,而不是成爲爲了數據分析和統計。不要浪費浪費你的時間去閱讀那些爲大衆準備的Python書籍。
在進一步繼續以前,首先設置好你的編程環境,而後學習怎麼使用IPython notebook
學習途徑
從code academy開始學起,完成上面的全部練習。天天投入3個小時,你應該在20天內完成它們。Code academy涵蓋了Python基本概念。可是,它不像Udacity那樣以項目爲導向;不要緊,由於你的目標是從事數據科學,而不是使用Python開發軟件。
當完成了code academy練習以後,看看這個Ipython notebook:
Python必備教程(在總結部分我已經提供了下載連接)。
它包括了code academy中沒有提到的一些概念。你能在1到2小時內學完這個教程。
如今,你知道足夠的基礎知識來學習Python庫了。
Numpy
首先,開始學習Numpy吧,由於它是利用Python科學計算的基礎包。對Numpy好的掌握將會幫助你有效地使用其餘工具例如Pandas。
我已經準備好了IPython筆記,這包含了Numpy的一些基本概念。這個教程包含了Numpy中最頻繁使用的操做,例如,N維數組,索引,數組切片,整數索引,數組轉換,通用函數,使用數組處理數據,經常使用的統計方法,等等。
Index Numpy 遇到Numpy陌生函數,查詢用法,推薦!
Pandas
Pandas包含了高級的數據結構和操做工具,它們使得Python數據分析更加快速和容易。
教程包含了series, data frams,從一個axis刪除數據,缺失數據處理,等等。
Index Pandas 遇到陌生函數,查詢用法,推薦!
Matplotlib
這是一個分爲四部分的Matplolib教程。
1st 部分:
第一部分介紹了Matplotlib基本功能,基本figure類型。
Simple Plotting example
%matplotlib inline import matplotlib.pyplot as plt #importing matplot lib library import numpy as np x = range(100) #print x, print and check what is x y =[val**2 for val in x] #print y plt.plot(x,y) #plotting x and y
fig, axes = plt.subplots(nrows=1, ncols=2) for ax in axes: ax.plot(x, y, 'r') ax.set_xlabel('x') ax.set_ylabel('y') ax.set_title('title') fig.tight_layout()
Using Numpy
x = np.linspace(0, 2*np.pi, 100) y =np.sin(x) plt.plot(x,y)
x= np.linspace(-3,2, 200) Y = x ** 2 - 2 * x + 1. plt.plot(x,Y)
# plotting multiple plots
x =np.linspace(0, 2 * np.pi, 100) y = np.sin(x) z = np.cos(x) plt.plot(x,y) plt.plot(x,z) plt.show() # Matplot lib picks different colors for different plot.
cd C:\Users\tk\Desktop\Matplot
data = np.loadtxt('numpy.txt') plt.plot(data[:,0], data[:,1]) # plotting column 1 vs column 2 # The text in the numpy.txt should look like this # 0 0 # 1 1 # 2 4 # 4 16 # 5 25 # 6 36
data1 = np.loadtxt('scipy.txt') # load the file print data1.T for val in data1.T: #loop over each and every value in data1.T plt.plot(data1[:,0], val) #data1[:,0] is the first row in data1.T # data in scipy.txt looks like this: # 0 0 6 # 1 1 5 # 2 4 4 # 4 16 3 # 5 25 2 # 6 36 1
Scatter Plots and Bar Graphs
sct = np.random.rand(20, 2) print sct plt.scatter(sct[:,0], sct[:,1]) # I am plotting a scatter plot.
ghj =[5, 10 ,15, 20, 25] it =[ 1, 2, 3, 4, 5] plt.bar(ghj, it) # simple bar graph
ghj =[5, 10 ,15, 20, 25] it =[ 1, 2, 3, 4, 5] plt.bar(ghj, it, width =5)# you can change the thickness of a bar, by default the bar will have a thickness of 0.8 units
ghj =[5, 10 ,15, 20, 25] it =[ 1, 2, 3, 4, 5] plt.barh(ghj, it) # barh is a horizontal bar graph
Multiple bar charts
new_list = [[5., 25., 50., 20.], [4., 23., 51., 17.], [6., 22., 52., 19.]] x = np.arange(4) plt.bar(x + 0.00, new_list[0], color ='b', width =0.25) plt.bar(x + 0.25, new_list[1], color ='r', width =0.25) plt.bar(x + 0.50, new_list[2], color ='g', width =0.25) #plt.show()
#Stacked Bar charts
p = [5., 30., 45., 22.] q = [5., 25., 50., 20.] x =range(4) plt.bar(x, p, color ='b') plt.bar(x, q, color ='y', bottom =p)
# plotting more than 2 values
A = np.array([5., 30., 45., 22.]) B = np.array([5., 25., 50., 20.]) C = np.array([1., 2., 1., 1.]) X = np.arange(4) plt.bar(X, A, color = 'b') plt.bar(X, B, color = 'g', bottom = A) plt.bar(X, C, color = 'r', bottom = A + B) # for the third argument, I use A+B plt.show()
black_money = np.array([5., 30., 45., 22.]) white_money = np.array([5., 25., 50., 20.]) z = np.arange(4) plt.barh(z, black_money, color ='g') plt.barh(z, -white_money, color ='r')# - notation is needed for generating, back to back charts
Other Plots
#Pie charts
y = [5, 25, 45, 65] plt.pie(y)
#Histograms
d = np.random.randn(100) plt.hist(d, bins = 20)
d = np.random.randn(100) plt.boxplot(d) #1) The red bar is the median of the distribution #2) The blue box includes 50 percent of the data from the lower quartile to the upper quartile. # Thus, the box is centered on the median of the data.
d = np.random.randn(100, 5) # generating multiple box plots plt.boxplot(d)
2nd 部分:
包含了怎麼調整figure的樣式和顏色,例如:makers,line,thicness,line patterns和color map.
%matplotlib inline import numpy as np import matplotlib.pyplot as plt
p =np.random.standard_normal((50,2)) p += np.array((-1,1)) # center the distribution at (-1,1) q =np.random.standard_normal((50,2)) q += np.array((1,1)) #center the distribution at (-1,1) plt.scatter(p[:,0], p[:,1], color ='.25') plt.scatter(q[:,0], q[:,1], color = '.75')
dd =np.random.standard_normal((50,2)) plt.scatter(dd[:,0], dd[:,1], color ='1.0', edgecolor ='0.0') # edge color controls the color of the edge
Custom Color for Bar charts,Pie charts and box plots:
The below bar graph, plots x(1 to 50) (vs) y(50 random integers, within 0-100. But you need different colors for each value. For which we create a list containing four colors(color_set). The list comprehension creates 50 different color values from color_set
vals = np.random.random_integers(99, size =50) color_set = ['.00', '.25', '.50','.75'] color_lists = [color_set[(len(color_set)* val) // 100] for val in vals] c = plt.bar(np.arange(50), vals, color = color_lists)
hi =np.random.random_integers(8, size =10) color_set =['.00', '.25', '.50', '.75'] plt.pie(hi, colors = color_set)# colors attribute accepts a range of values plt.show() #If there are less colors than values, then pyplot.pie() will simply cycle through the color list. In the preceding #example, we gave a list of four colors to color a pie chart that consisted of eight values. Thus, each color will be used twice
values = np.random.randn(100) w = plt.boxplot(values) for att, lines in w.iteritems(): for l in lines: l.set_color('k')
Color Maps
know more about hsv
# how to color scatter plots
#Colormaps are defined in the matplotib.cm module. This module provides #functions to create and use colormaps. It also provides an exhaustive choice of predefined color maps. import matplotlib.cm as cm N = 256 angle = np.linspace(0, 8 * 2 * np.pi, N) radius = np.linspace(.5, 1., N) X = radius * np.cos(angle) Y = radius * np.sin(angle) plt.scatter(X,Y, c=angle, cmap = cm.hsv)
#Color in bar graphs
import matplotlib.cm as cm vals = np.random.random_integers(99, size =50) cmap = cm.ScalarMappable(col.Normalize(0,99), cm.binary) plt.bar(np.arange(len(vals)),vals, color =cmap.to_rgba(vals))
Line Styles
# I am creating 3 levels of gray plots, with different line shades
def pq(I, mu, sigma): a = 1. / (sigma * np.sqrt(2. * np.pi)) b = -1. / (2. * sigma ** 2) return a * np.exp(b * (I - mu) ** 2) I =np.linspace(-6,6, 1024) plt.plot(I, pq(I, 0., 1.), color = 'k', linestyle ='solid') plt.plot(I, pq(I, 0., .5), color = 'k', linestyle ='dashed') plt.plot(I, pq(I, 0., .25), color = 'k', linestyle ='dashdot')
N = 15 A = np.random.random(N) B= np.random.random(N) X = np.arange(N) plt.bar(X, A, color ='.75') plt.bar(X, A+B , bottom = A, color ='W', linestyle ='dashed') # plot a bar graph plt.show()
def gf(X, mu, sigma): a = 1. / (sigma * np.sqrt(2. * np.pi)) b = -1. / (2. * sigma ** 2) return a * np.exp(b * (X - mu) ** 2) X = np.linspace(-6, 6, 1024) for i in range(64): samples = np.random.standard_normal(50) mu,sigma = np.mean(samples), np.std(samples) plt.plot(X, gf(X, mu, sigma), color = '.75', linewidth = .5) plt.plot(X, gf(X, 0., 1.), color ='.00', linewidth = 3.)
Fill surfaces with pattern
N = 15 A = np.random.random(N) B= np.random.random(N) X = np.arange(N) plt.bar(X, A, color ='w', hatch ='x') plt.bar(X, A+B,bottom =A, color ='r', hatch ='/') # some other hatch attributes are : #/ #\ #| #- #+ #x #o #O #. #*
Marker styles
cd C:\Users\tk\Desktop\Matplot
Come back to this section later
X= np.linspace(-6,6,1024) Ya =np.sinc(X) Yb = np.sinc(X) +1 plt.plot(X, Ya, marker ='o', color ='.75') plt.plot(X, Yb, marker ='^', color='.00', markevery= 32)# this one marks every 32 nd element
# Marker Size
A = np.random.standard_normal((50,2)) A += np.array((-1,1)) B = np.random.standard_normal((50,2)) B += np.array((1, 1)) plt.scatter(A[:,0], A[:,1], color ='k', s =25.0) plt.scatter(B[:,0], B[:,1], color ='g', s = 100.0) # size of the marker is specified using 's' attribute
Own Marker Shapes- come back to this later
# more about markers
X =np.linspace(-6,6, 1024) Y =np.sinc(X) plt.plot(X,Y, color ='r', marker ='o', markersize =9, markevery = 30, markerfacecolor='w', linewidth = 3.0, markeredgecolor = 'b')
import matplotlib as mpl mpl.rc('lines', linewidth =3) mpl.rc('xtick', color ='w') # color of x axis numbers mpl.rc('ytick', color = 'w') # color of y axis numbers mpl.rc('axes', facecolor ='g', edgecolor ='y') # color of axes mpl.rc('figure', facecolor ='.00',edgecolor ='w') # color of figure mpl.rc('axes', color_cycle = ('y','r')) # color of plots x = np.linspace(0, 7, 1024) plt.plot(x, np.sin(x)) plt.plot(x, np.cos(x))
3rd 部分:
圖的註釋--包含若干圖,控制座標軸範圍,長款比和座標軸。
Annotation
%matplotlib inline import numpy as np import matplotlib.pyplot as plt
X =np.linspace(-6,6, 1024) Y =np.sinc(X) plt.title('A simple marker exercise')# a title notation plt.xlabel('array variables') # adding xlabel plt.ylabel(' random variables') # adding ylabel plt.text(-5, 0.4, 'Matplotlib') # -5 is the x value and 0.4 is y value plt.plot(X,Y, color ='r', marker ='o', markersize =9, markevery = 30, markerfacecolor='w', linewidth = 3.0, markeredgecolor = 'b')
def pq(I, mu, sigma): a = 1. / (sigma * np.sqrt(2. * np.pi)) b = -1. / (2. * sigma ** 2) return a * np.exp(b * (I - mu) ** 2) I =np.linspace(-6,6, 1024) plt.plot(I, pq(I, 0., 1.), color = 'k', linestyle ='solid') plt.plot(I, pq(I, 0., .5), color = 'k', linestyle ='dashed') plt.plot(I, pq(I, 0., .25), color = 'k', linestyle ='dashdot') # I have created a dictinary of styles design = { 'facecolor' : 'y', # color used for the text box 'edgecolor' : 'g', 'boxstyle' : 'round' } plt.text(-4, 1.5, 'Matplot Lib', bbox = design) plt.plot(X, Y, c='k') plt.show() #This sets the style of the box, which can either be 'round' or 'square' #'pad': If 'boxstyle' is set to 'square', it defines the amount of padding between the text and the box's sides
Alignment Control
The text is bound by a box. This box is used to relatively align the text to the coordinates passed to pyplot.text(). Using the verticalalignment and horizontalalignment parameters (respective shortcut equivalents are va and ha), we can control how the alignment is done.
The vertical alignment options are as follows:
'center': This is relative to the center of the textbox
'top': This is relative to the upper side of the textbox
'bottom': This is relative to the lower side of the textbox
'baseline': This is relative to the text's baseline
Horizontal alignment options are as follows:
align ='bottom' align ='baseline'
------------------------align = center--------------------------------------
align= 'top
cd C:\Users\tk\Desktop
from IPython.display import Image Image(filename='text alignment.png') #The horizontal alignment options are as follows: #'center': This is relative to the center of the textbox #'left': This is relative to the left side of the textbox #'right': This is relative to the right-hand side of the textbox
X = np.linspace(-4, 4, 1024) Y = .25 * (X + 4.) * (X + 1.) * (X - 2.) plt.annotate('Big Data', ha ='center', va ='bottom', xytext =(-1.5, 3.0), xy =(0.75, -2.7), arrowprops ={'facecolor': 'green', 'shrink':0.05, 'edgecolor': 'black'}) #arrow properties plt.plot(X, Y)
#arrow styles are :
from IPython.display import Image Image(filename='arrows.png')
Legend properties:
'loc': This is the location of the legend. The default value is 'best', which will place it automatically. Other valid values are
'upper left', 'lower left', 'lower right', 'right', 'center left', 'center right', 'lower center', 'upper center', and 'center'.
'shadow': This can be either True or False, and it renders the legend with a shadow effect.
'fancybox': This can be either True or False and renders the legend with a rounded box.
'title': This renders the legend with the title passed as a parameter.
'ncol': This forces the passed value to be the number of columns for the legend
x =np.linspace(0, 6,1024) y1 =np.sin(x) y2 =np.cos(x) plt.xlabel('Sin Wave') plt.ylabel('Cos Wave') plt.plot(x, y1, c='b', lw =3.0, label ='Sin(x)') # labels are specified plt.plot(x, y2, c ='r', lw =3.0, ls ='--', label ='Cos(x)') plt.legend(loc ='best', shadow = True, fancybox = False, title ='Waves', ncol =1) # displays the labels plt.grid(True, lw = 2, ls ='--', c='.75') # adds grid lines to the figure plt.show()
Shapes
#Paths for several kinds of shapes are available in the matplotlib.patches module
import matplotlib.patches as patches dis = patches.Circle((0,0), radius = 1.0, color ='.75' ) plt.gca().add_patch(dis) # used to render the image. dis = patches.Rectangle((2.5, -.5), 2.0, 1.0, color ='.75') #patches.rectangle((x & y coordinates), length, breadth) plt.gca().add_patch(dis) dis = patches.Ellipse((0, -2.0), 2.0, 1.0, angle =45, color ='.00') plt.gca().add_patch(dis) dis = patches.FancyBboxPatch((2.5, -2.5), 2.0, 1.0, boxstyle ='roundtooth', color ='g') plt.gca().add_patch(dis) plt.grid(True) plt.axis('scaled') # displays the images within the prescribed axis plt.show() #FancyBox: This is like a rectangle but takes an additional boxstyle parameter #(either 'larrow', 'rarrow', 'round', 'round4', 'roundtooth', 'sawtooth', or 'square')
import matplotlib.patches as patches theta = np.linspace(0, 2 * np.pi, 8) # generates an array vertical = np.vstack((np.cos(theta), np.sin(theta))).transpose() # vertical stack clubs the two arrays. #print vertical, print and see how the array looks plt.gca().add_patch(patches.Polygon(vertical, color ='y')) plt.axis('scaled') plt.grid(True) plt.show() #The matplotlib.patches.Polygon()constructor takes a list of coordinates as the inputs, that is, the vertices of the polygon
# a polygon can be imbided into a circle
theta = np.linspace(0, 2 * np.pi, 6) # generates an array vertical = np.vstack((np.cos(theta), np.sin(theta))).transpose() # vertical stack clubs the two arrays. #print vertical, print and see how the array looks plt.gca().add_patch(plt.Circle((0,0), radius =1.0, color ='b')) plt.gca().add_patch(plt.Polygon(vertical, fill =None, lw =4.0, ls ='dashed', edgecolor ='w')) plt.axis('scaled') plt.grid(True) plt.show()
Ticks in Matplotlib
#In matplotlib, ticks are small marks on both the axes of a figure
import matplotlib.ticker as ticker X = np.linspace(-12, 12, 1024) Y = .25 * (X + 4.) * (X + 1.) * (X - 2.) pl =plt.axes() #the object that manages the axes of a figure pl.xaxis.set_major_locator(ticker.MultipleLocator(5)) pl.xaxis.set_minor_locator(ticker.MultipleLocator(1)) plt.plot(X, Y, c = 'y') plt.grid(True, which ='major') # which can take three values: minor, major and both plt.show()
name_list = ('Omar', 'Serguey', 'Max', 'Zhou', 'Abidin') value_list = np.random.randint(0, 99, size = len(name_list)) pos_list = np.arange(len(name_list)) ax = plt.axes() ax.xaxis.set_major_locator(ticker.FixedLocator((pos_list))) ax.xaxis.set_major_formatter(ticker.FixedFormatter((name_list))) plt.bar(pos_list, value_list, color = '.75',align = 'center') plt.show()
4th 部分:
包含了一些複雜圖形。
Working with figures
%matplotlib inline import numpy as np import matplotlib.pyplot as plt
T = np.linspace(-np.pi, np.pi, 1024) # fig, (ax0, ax1) = plt.subplots(ncols =2) ax0.plot(np.sin(2 * T), np.cos(0.5 * T), c = 'k') ax1.plot(np.cos(3 * T), np.sin(T), c = 'k') plt.show()
Setting aspect ratio
T = np.linspace(0, 2 * np.pi, 1024) plt.plot(2. * np.cos(T), np.sin(T), c = 'k', lw = 3.) plt.axes().set_aspect('equal') # remove this line of code and see how the figure looks plt.show()
X = np.linspace(-6, 6, 1024) Y1, Y2 = np.sinc(X), np.cos(X) plt.figure(figsize=(10.24, 2.56)) #sets size of the figure plt.plot(X, Y1, c='r', lw = 3.) plt.plot(X, Y2, c='.75', lw = 3.) plt.show()
X = np.linspace(-6, 6, 1024) plt.ylim(-.5, 1.5) plt.plot(X, np.sinc(X), c = 'k') plt.show()
X = np.linspace(-6, 6, 1024) Y = np.sinc(X) X_sub = np.linspace(-3, 3, 1024)#coordinates of subplot Y_sub = np.sinc(X_sub) # coordinates of sub plot plt.plot(X, Y, c = 'b') sub_axes = plt.axes([.6, .6, .25, .25])# coordinates, length and width of the subplot frame sub_axes.plot(X_detail, Y_detail, c = 'r') plt.show()
Log Scale
X = np.linspace(1, 10, 1024) plt.yscale('log') # set y scale as log. we would use plot.xscale() plt.plot(X, X, c = 'k', lw = 2., label = r'$f(x)=x$') plt.plot(X, 10 ** X, c = '.75', ls = '--', lw = 2., label = r'$f(x)=e^x$') plt.plot(X, np.log(X), c = '.75', lw = 2., label = r'$f(x)=\log(x)$') plt.legend() plt.show() #The logarithm base is 10 by default, but it can be changed with the optional parameters basex and basey.
Polar Coordinates
T = np.linspace(0 , 2 * np.pi, 1024) plt.axes(polar = True) # show polar coordinates plt.plot(T, 1. + .25 * np.sin(16 * T), c= 'k') plt.show()
import matplotlib.patches as patches # import patch module from matplotlib ax = plt.axes(polar = True) theta = np.linspace(0, 2 * np.pi, 8, endpoint = False) radius = .25 + .75 * np.random.random(size = len(theta)) points = np.vstack((theta, radius)).transpose() plt.gca().add_patch(patches.Polygon(points, color = '.75')) plt.show()
x = np.linspace(-6,6,1024) y= np.sin(x) plt.plot(x,y) plt.savefig('bigdata.png', c= 'y', transparent = True) #savefig function writes that data to a file # will create a file named bigdata.png. Its resolution will be 800 x 600 pixels, in 8-bit colors (24-bits per pixel)
theta =np.linspace(0, 2 *np.pi, 8) points =np.vstack((np.cos(theta), np.sin(theta))).T plt.figure(figsize =(6.0, 6.0)) plt.gca().add_patch(plt.Polygon(points, color ='r')) plt.axis('scaled') plt.grid(True) plt.savefig('pl.png', dpi =300) # try 'pl.pdf', pl.svg' #dpi is dots per inch. 300*8 x 6*300 = 2400 x 1800 pixels