萌新向Python數據分析及數據挖掘 第二章 pandas 第一節 pandas使用基礎Q&A 1-15

這是油管上的一個帥哥的網課地址以下 https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9yhtml

 

Python pandas Q&A video series by Data School

YouTube playlist and GitHub repository

Table of contents

  1. What is pandas?
  2. How do I read a tabular data file into pandas?
  3. How do I select a pandas Series from a DataFrame?
  4. Why do some pandas commands end with parentheses (and others don't)?
  5. How do I rename columns in a pandas DataFrame?
  6. How do I remove columns from a pandas DataFrame?
  7. How do I sort a pandas DataFrame or a Series?
  8. How do I filter rows of a pandas DataFrame by column value?
  9. How do I apply multiple filter criteria to a pandas DataFrame?
  10. Your pandas questions answered!
  11. How do I use the "axis" parameter in pandas?
  12. How do I use string methods in pandas?
  13. How do I change the data type of a pandas Series?
  14. When should I use a "groupby" in pandas?
  15. How do I explore a pandas Series?
  16. How do I handle missing values in pandas?
  17. What do I need to know about the pandas index? (Part 1)
  18. What do I need to know about the pandas index? (Part 2)
  19. How do I select multiple rows and columns from a pandas DataFrame?
  20. When should I use the "inplace" parameter in pandas?
  21. How do I make my pandas DataFrame smaller and faster?
  22. How do I use pandas with scikit-learn to create Kaggle submissions?
  23. More of your pandas questions answered!
  24. How do I create dummy variables in pandas?
  25. How do I work with dates and times in pandas?
  26. How do I find and remove duplicate rows in pandas?
  27. How do I avoid a SettingWithCopyWarning in pandas?
  28. How do I change display options in pandas?
  29. How do I create a pandas DataFrame from another object?
  30. How do I apply a function to a pandas Series or DataFrame?
In [1]:
 
 
 
 
 
# 傳統方式
import pandas as pd
 
 
 

2. How do I read a tabular data file into pandas? (video)

In [2]:
 
 
 
 
 
# 直接從URL中讀取Chipotle訂單的數據集,並將結果存儲在數據庫中
url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv" 
#定義地址
orders =pd.read_table(url1)#使用read_table()打開
 
 
In [3]:
 
 
 
 
 
# 檢查前5行
orders.head()
 
 
Out[3]:
  order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
 

Documentation for read_tablepython

In [4]:
 
 
 
 
 
# 讀取電影評論員的數據集(修改read_table的默認參數值)
user_cols = ['user_id','age','gender','occupation','zipcode']#定義列名
url2 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/u.user"
#定義地址
#users=pd.read_table(url2,sep='|',header=None,names= user_clos,skiprows=2,skipfooter=3)
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols)
#加入參數sep 分隔符,header 頭部 標題,names 列名
 
 
In [5]:
 
 
 
 
 
# 檢查前5行
users.head()
 
 
Out[5]:
  user_id age gender occupation zipcode
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
 
 

3. How do I select a pandas Series from a DataFrame? (video)

In [6]:
 
 
 
 
 
# 將UFO報告的數據集讀入DataFrame
url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定義列名
ufo = pd.read_table(url3, sep=',')
 
 
In [7]:
 
 
 
 
 
# #用read_table打開csv文件,區別是 read_csv直接是用逗號隔開
ufo = pd.read_csv(url3)
 
 
In [8]:
 
 
 
 
 
# 檢查前5行
ufo.head()
 
 
Out[8]:
  City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [9]:
 
 
 
 
 
# #用括號法查看Series
ufo['City']
# #用點法查看Series,要注意 名字裏面有空格或者是python專有字符的時候不能用,可是方便輸入
ufo.City
 
 
Out[9]:
0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, Length: 18241, dtype: object
 

括號表示法老是有效,而點表示法有侷限性:github

  • 若是系列名稱中有空格,則點符號不起做用
  • 若是系列與DataFrame方法或屬性(如'head'或'shape')具備相同的名稱,則點符號不起做用
  • 點符號不能用於定義新series的名
In [10]:
 
 
 
 
 
# #這裏的拼接也不能用點的方法
ufo['Location'] = ufo.City + ', ' + ufo.State
ufo.head()
 
 
Out[10]:
  City Colors Reported Shape Reported State Time Location
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 Ithaca, NY
1 Willingboro NaN OTHER NJ 6/30/1930 20:00 Willingboro, NJ
2 Holyoke NaN OVAL CO 2/15/1931 14:00 Holyoke, CO
3 Abilene NaN DISK KS 6/1/1931 13:00 Abilene, KS
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 New York Worlds Fair, NY
 

[Back to top]正則表達式

 

4. Why do some pandas commands end with parentheses (and others don't)? (video)

In [11]:
 
 
 
 
 
# 將頂級IMDb電影的數據集讀入DataFrame
url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"
movies = pd.read_csv(url4)
 
 
 

#方法以括號結尾,而屬性則沒有:數據庫

In [12]:
 
 
 
 
 
# 示例方法:顯示前5行
movies.head()
 
 
Out[12]:
  star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
In [13]:
 
 
 
 
 
#示例方法:計算摘要統計信息
movies.describe()
 
 
Out[13]:
  star_rating duration
count 979.000000 979.000000
mean 7.889785 120.979571
std 0.336069 26.218010
min 7.400000 64.000000
25% 7.600000 102.000000
50% 7.800000 117.000000
75% 8.100000 134.000000
max 9.300000 242.000000
In [14]:
 
 
 
 
 
movies.describe(include=['object'])
 
 
Out[14]:
  title content_rating genre actors_list
count 979 976 979 979
unique 975 12 16 969
top Dracula R Drama [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq 2 460 278 6
In [15]:
 
 
 
 
 
# 示例屬性:行數和列數
movies.shape
 
 
Out[15]:
(979, 6)
In [16]:
 
 
 
 
 
# 示例屬性:每列的數據類型
movies.dtypes
 
 
Out[16]:
star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object
In [17]:
 
 
 
 
 
# 使用describe方法的可選參數來僅彙總'object'列
movies.describe(include=['object'])
 
 
Out[17]:
  title content_rating genre actors_list
count 979 976 979 979
unique 975 12 16 969
top Dracula R Drama [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq 2 460 278 6
 

Documentation for describeapi

[Back to top]app

 

5. How do I rename columns in a pandas DataFrame? (video)

In [18]:
 
 
 
 
 
# 檢查列名稱
ufo.columns
 
 
Out[18]:
Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time',
       'Location'],
      dtype='object')
In [19]:
 
 
 
 
 
 
# 使用'rename'方法重命名其中兩列
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)
ufo.columns
 
 
Out[19]:
Index(['City', 'Colors_Reported', 'Shape_Reported', 'State', 'Time',
       'Location'],
      dtype='object')
 

Documentation for renameide

In [20]:
 
 
 
 
 
# 經過覆蓋'columns'屬性替換全部列名
ufo = pd.read_table(url3, sep=',')
ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
ufo.columns
 
 
Out[20]:
Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')
In [21]:
 
 
 
 
 
# 使用'names'參數替換文件讀取過程當中的列名
ufo = pd.read_csv(url3, header=0, names=ufo_cols)
ufo.columns
 
 
Out[21]:
Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')
 

Documentation for read_csv函數

In [22]:
 
 
 
 
 
ufo.columns = ufo.columns.str.replace(' ', '_') #如何批量修改替換使得列名無空格
ufo.columns
 
 
Out[22]:
Index(['city', 'colors_reported', 'shape_reported', 'state', 'time'], dtype='object')
 

Documentation for str.replace

[Back to top]

 

6. How do I remove columns from a pandas DataFrame? (video)

In [35]:
 
 
 
 
 
ufo = pd.read_table(url3, sep=',')
ufo.head()
 
 
Out[35]:
  City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [37]:
 
 
 
 
 
# #axis=1 是縱向,inplace = True:不建立新的對象,直接對原始對象進行修改;
ufo.drop('Colors Reported', axis=1, inplace=True)
ufo.head()
 
 
Out[37]:
  City Shape Reported State Time
0 Ithaca TRIANGLE NY 6/1/1930 22:00
1 Willingboro OTHER NJ 6/30/1930 20:00
2 Holyoke OVAL CO 2/15/1931 14:00
3 Abilene DISK KS 6/1/1931 13:00
4 New York Worlds Fair LIGHT NY 4/18/1933 19:00
 

Documentation for drop

In [38]:
 
 
 
 
 
# 一次刪除多個列
ufo.drop(['City', 'State'], axis=1, inplace=True)
ufo.head()
 
 
Out[38]:
  Shape Reported Time
0 TRIANGLE 6/1/1930 22:00
1 OTHER 6/30/1930 20:00
2 OVAL 2/15/1931 14:00
3 DISK 6/1/1931 13:00
4 LIGHT 4/18/1933 19:00
In [39]:
 
 
 
 
 
# 一次刪除多行(axis = 0表示行)
ufo.drop([0, 1], axis=0, inplace=True)
ufo.head()
#刪除4行 按標籤,axis=0 是橫向,默認爲橫向,但建議寫出來
 
 
Out[39]:
  Shape Reported Time
2 OVAL 2/15/1931 14:00
3 DISK 6/1/1931 13:00
4 LIGHT 4/18/1933 19:00
5 DISK 9/15/1934 15:30
6 CIRCLE 6/15/1935 0:00
 

7. How do I sort a pandas DataFrame or a Series? (video)

In [40]:
 
 
 
 
 
movies.head()
 
 
Out[40]:
  star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
 

#注意:如下任何排序方法都不會影響基礎數據。 (換句話說,排序是暫時的)。

In [41]:
 
 
 
 
 
#排序單個Series
movies.title.sort_values().head()
 
 
Out[41]:
542     (500) Days of Summer
5               12 Angry Men
201         12 Years a Slave
698                127 Hours
110    2001: A Space Odyssey
Name: title, dtype: object
In [42]:
 
 
 
 
 
# #排序單個Series 倒序
movies.title.sort_values(ascending=False).head()
 
 
Out[42]:
864               [Rec]
526                Zulu
615          Zombieland
677              Zodiac
955    Zero Dark Thirty
Name: title, dtype: object
 

Documentation for sort_values for a Series. (Prior to version 0.17, use order instead.)

In [43]:
 
 
 
 
 
# #以單個Series排序DataFrame
movies.sort_values('title').head()
 
 
Out[43]:
  star_rating title content_rating genre duration actors_list
542 7.8 (500) Days of Summer PG-13 Comedy 95 [u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
201 8.1 12 Years a Slave R Biography 134 [u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
698 7.6 127 Hours R Adventure 94 [u'James Franco', u'Amber Tamblyn', u'Kate Mara']
110 8.3 2001: A Space Odyssey G Mystery 160 [u'Keir Dullea', u'Gary Lockwood', u'William S...
In [44]:
 
 
 
 
 
# 改成按降序排序
movies.sort_values('title', ascending=False).head()
 
 
Out[44]:
  star_rating title content_rating genre duration actors_list
864 7.5 [Rec] R Horror 78 [u'Manuela Velasco', u'Ferran Terraza', u'Jorg...
526 7.8 Zulu UNRATED Drama 138 [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
615 7.7 Zombieland R Comedy 88 [u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
677 7.7 Zodiac R Crime 157 [u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
955 7.4 Zero Dark Thirty R Drama 157 [u'Jessica Chastain', u'Joel Edgerton', u'Chri...
 

Documentation for sort_values for a DataFrame. (Prior to version 0.17, use sort instead.)

In [45]:
 
 
 
 
 
# 首先按'content_rating',而後按duration'排序DataFrame
movies.sort_values(['content_rating', 'duration']).head()
 
 
Out[45]:
  star_rating title content_rating genre duration actors_list
713 7.6 The Jungle Book APPROVED Animation 78 [u'Phil Harris', u'Sebastian Cabot', u'Louis P...
513 7.8 Invasion of the Body Snatchers APPROVED Horror 80 [u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
272 8.1 The Killing APPROVED Crime 85 [u'Sterling Hayden', u'Coleen Gray', u'Vince E...
703 7.6 Dracula APPROVED Horror 85 [u'Bela Lugosi', u'Helen Chandler', u'David Ma...
612 7.7 A Hard Day's Night APPROVED Comedy 87 [u'John Lennon', u'Paul McCartney', u'George H...
 

8. How do I filter rows of a pandas DataFrame by column value? (video)

In [46]:
 
 
 
 
 
movies.head()
 
 
Out[46]:
  star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
In [47]:
 
 
 
 
 
# 檢查行數和列數
movies.shape
 
 
Out[47]:
(979, 6)
 

##目標:過濾DataFrame行,僅顯示「持續時間」至少爲200分鐘的電影

In [48]:
 
 
 
 
 
# 
#先展現一個比較複雜的方法,用一個for循環制造一個和原數據同樣行數,判斷每一行是否符合條件,列表元素均爲boolean
#建立一個列表,其中每一個元素引用一個DataFrame行:若是行知足條件,則返回true,不然返回False
booleans = []
for length in movies.duration:
 if length >= 200:
 booleans.append(True)
 else:
 booleans.append(False)
 
 
In [49]:
 
 
 
 
 
# 確認列表與DataFrame的長度相同
len(booleans)
 
 
Out[49]:
979
In [50]:
 
 
 
 
 
# 檢查前五個列表元素
booleans[0:5]
 
 
Out[50]:
[False, False, True, False, False]
In [51]:
 
 
 
 
 
# 將列表轉換爲Series
is_long = pd.Series(booleans)
is_long.head()
 
 
Out[51]:
0    False
1    False
2     True
3    False
4    False
dtype: bool
In [52]:
 
 
 
 
 
# 使用帶有布爾Series的括號表示法告訴DataFrame movies[is_long]要顯示哪些行
movies[is_long]
 
 
Out[52]:
  star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
In [53]:
 
 
 
 
 
# 簡化上面的步驟:不須要編寫for循環來建立is_long'
is_long = movies.duration >= 200
movies[is_long]#運用這種寫法,pandas就知道,按照這個series去篩選
# 或等效地,將其寫在一行(無需建立'is_long'對象)
movies[movies.duration >= 200]
 
 
Out[53]:
  star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
In [54]:
 
 
 
 
 
# 從過濾後的DataFrame中選擇「流派」系列
movies[movies.duration >= 200].genre
# 或者等效地,使用'loc'方法
movies.loc[movies.duration >= 200, 'genre']
 
 
Out[54]:
2          Crime
7      Adventure
17         Drama
78         Crime
85     Adventure
142    Adventure
157        Drama
204    Adventure
445    Adventure
476        Drama
630    Biography
767       Action
Name: genre, dtype: object
 

Documentation for loc

[Back to top]

 

9. How do I apply multiple filter criteria to a pandas DataFrame? (video)

In [55]:
 
 
 
 
 
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()
 
 
Out[55]:
  star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
In [56]:
 
 
 
 
 
# 過濾DataFrame僅顯示「持續時間」至少爲200分鐘的電影
movies[movies.duration >= 200]
 
 
Out[56]:
  star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
 

理解邏輯運算符:

  • and:僅當運算符的兩邊都爲True時才爲真
  • or:若是運算符的任何一側爲True,則爲真
In [57]:
 
 
 
 
 
print(True and True)
print(True and False)
print(False and False)
 
 
 
True
False
False
In [58]:
 
 
 
 
 
print(True or True)
print(True or False)
print(False or False)
 
 
 
True
True
False
 

在pandas中指定多個過濾條件的規則:

使用&而不是和 使用|而不是或 在每一個條件周圍添加括號以指定評估順序

 

Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'

In [59]:
 
 
 
 
 
# 使用'&'運算符指定兩個條件都是必需的
movies[(movies.duration >=200) & (movies.genre == 'Drama')]
 
 
Out[59]:
  star_rating title content_rating genre duration actors_list
17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
In [60]:
 
 
 
 
 
# I不正確:使用'|'運算符會展現長或戲劇的電影
movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()
 
 
Out[60]:
  star_rating title content_rating genre duration actors_list
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
9 8.9 Fight Club R Drama 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
13 8.8 Forrest Gump PG-13 Drama 142 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise']
 

##過濾原始數據框以顯示「類型」爲「犯罪」或「戲劇」或「動做」的電影

In [61]:
 
 
 
 
 
 
# 使用'|'運算符指定行能夠匹配三個條件中的任何一個
movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].head(10)
# 用isin等效
movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)
 
 
Out[61]:
  star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
9 8.9 Fight Club R Drama 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11 8.8 Inception PG-13 Action 148 [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
12 8.8 Star Wars: Episode V - The Empire Strikes Back PG Action 124 [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...
13 8.8 Forrest Gump PG-13 Drama 142 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise']
 

Documentation for isin

[Back to top]

 

10. Your pandas questions answered! (video)

 

Question: When reading from a file, how do I read in only a subset of the columns?

In [62]:
 
 
 
 
 
url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定義列名
ufo = pd.read_csv(url3)#用read_csv打開csv文件
ufo.columns
 
 
Out[62]:
Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')
In [63]:
 
 
 
 
 
# 列名篩選
ufo = pd.read_csv(url3, usecols=['City', 'State'])
# 用位置切片等效
ufo = pd.read_csv(url3, usecols=[0, 4])
ufo.columns
 
 
Out[63]:
Index(['City', 'Time'], dtype='object')
 

Question: When reading from a file, how do I read in only a subset of the rows?

In [64]:
 
 
 
 
 
# 讀3行數據
ufo = pd.read_csv(url3, nrows=3)
ufo
 
 
Out[64]:
  City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
 

Documentation for read_csv

 

Question: How do I iterate through a Series?

In [65]:
 
 
 
 
 
# Series可直接迭代(如列表)
for c in ufo.City:
 print(c)
 
 
 
Ithaca
Willingboro
Holyoke
 

Question: How do I iterate through a DataFrame?

In [66]:
 
 
 
 
 
# 可使用各類方法迭代DataFrame
for index, row in ufo.iterrows():
 print(index, row.City, row.State)
 
 
 
0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO
 

Documentation for iterrows

 

Question: How do I drop all non-numeric columns from a DataFrame?

In [67]:
 
 
 
 
 
# 將酒精消耗數據集讀入DataFrame,並檢查數據類型
url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
drinks = pd.read_csv(url7)
drinks.dtypes
 
 
Out[67]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
In [68]:
 
 
 
 
 
# 僅包含DataFrame中的數字列
import numpy as np
drinks.select_dtypes(include=[np.number]).dtypes
 
 
Out[68]:
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
dtype: object
 

Documentation for select_dtypes

 

Question: How do I know whether I should pass an argument as a string or a list?

In [69]:
 
 
 
 
 
# 描述全部數字列
drinks.describe()
 
 
Out[69]:
  beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000
In [70]:
 
 
 
 
 
# 傳遞字符串'all'來描述全部列
drinks.describe(include='all')
 
 
Out[70]:
  country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
count 193 193.000000 193.000000 193.000000 193.000000 193
unique 193 NaN NaN NaN NaN 6
top Marshall Islands NaN NaN NaN NaN Africa
freq 1 NaN NaN NaN NaN 53
mean NaN 106.160622 80.994819 49.450777 4.717098 NaN
std NaN 101.143103 88.284312 79.697598 3.773298 NaN
min NaN 0.000000 0.000000 0.000000 0.000000 NaN
25% NaN 20.000000 4.000000 1.000000 1.300000 NaN
50% NaN 76.000000 56.000000 8.000000 4.200000 NaN
75% NaN 188.000000 128.000000 59.000000 7.200000 NaN
max NaN 376.000000 438.000000 370.000000 14.400000 NaN
In [71]:
 
 
 
 
 
# 傳遞數據類型列表以僅描述多個類型
drinks.describe(include=['object', 'float64'])
 
 
Out[71]:
  country total_litres_of_pure_alcohol continent
count 193 193.000000 193
unique 193 NaN 6
top Marshall Islands NaN Africa
freq 1 NaN 53
mean NaN 4.717098 NaN
std NaN 3.773298 NaN
min NaN 0.000000 NaN
25% NaN 1.300000 NaN
50% NaN 4.200000 NaN
75% NaN 7.200000 NaN
max NaN 14.400000 NaN
In [72]:
 
 
 
 
 
# 即便您只想描述單個數據類型,也要傳遞一個列表
drinks.describe(include=['object'])
 
 
Out[72]:
  country continent
count 193 193
unique 193 6
top Marshall Islands Africa
freq 1 53
 

Documentation for describe

[Back to top]

 

11. How do I use the "axis" parameter in pandas? (video)

In [73]:
 
 
 
 
 
url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
drinks = pd.read_csv(url7)
drinks.head()
 
 
Out[73]:
  country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [74]:
 
 
 
 
 
# drop a column (temporarily)
drinks.drop('continent', axis=1).head()
 
 
Out[74]:
  country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
0 Afghanistan 0 0 0 0.0
1 Albania 89 132 54 4.9
2 Algeria 25 0 14 0.7
3 Andorra 245 138 312 12.4
4 Angola 217 57 45 5.9
 

Documentation for drop

In [75]:
 
 
 
 
 
# 刪除一列(暫時)
drinks.drop(2, axis=0).head()
 
 
Out[75]:
  country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
5 Antigua & Barbuda 102 128 45 4.9 North America
 

使用axis參數引用行或列時:

axis 0表示行 axis 1指的是列

In [76]:
 
 
 
 
 
# 計算每一個數字列的平均值
drinks.mean()
# 或等效地,明確指定軸
drinks.mean(axis=0)
 
 
Out[76]:
beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64
 

Documentation for mean

In [77]:
 
 
 
 
 
# 計算每一行的平均值
drinks.mean(axis=1).head()
 
 
Out[77]:
0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64
 

使用axis參數執行數學運算時:

  • *axis0 *表示操做應「向下移動」行軸
  • *axis1 *表示操做應「移過」列軸
In [78]:
 
 
 
 
 
# 'index' 等效 axis 0
drinks.mean(axis='index')
 
 
Out[78]:
beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64
In [79]:
 
 
 
 
 
# 'columns' 等效 axis 1
drinks.mean(axis='columns').head()
 
 
Out[79]:
0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64
 

12. How do I use string methods in pandas? (video)

In [80]:
 
 
 
 
 
url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv" 
#定義地址
orders =pd.read_table(url1)#使用read_table()打開
orders.head()
 
 
Out[80]:
  order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
In [81]:
 
 
 
 
 
# 在Python中訪問字符串方法的經常使用方法
'hello'.upper()
 
 
Out[81]:
'HELLO'
In [82]:
 
 
 
 
 
# spandas Series 的字符串方法經過'str'訪問
orders.item_name.str.upper().head()
 
 
Out[82]:
0             CHIPS AND FRESH TOMATO SALSA
1                                     IZZE
2                         NANTUCKET NECTAR
3    CHIPS AND TOMATILLO-GREEN CHILI SALSA
4                             CHICKEN BOWL
Name: item_name, dtype: object
In [83]:
 
 
 
 
 
# string方法'contains'檢查子字符串並返回一個布爾Series
orders.item_name.str.contains('Chicken').head()
 
 
Out[83]:
0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool
In [84]:
 
 
 
 
 
# 布爾Series篩選DataFrame
orders[orders.item_name.str.contains('Chicken')].head()
 
 
Out[84]:
  order_id quantity item_name choice_description item_price
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
11 6 1 Chicken Crispy Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75
12 6 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Black Beans,... $8.75
13 7 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25
In [85]:
 
 
 
 
 
# 字符串方法能夠連接在一塊兒
orders.choice_description.str.replace('[', '').str.replace(']', '').head()
 
 
Out[85]:
0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object
In [86]:
 
 
 
 
 
# 許多pandas字符串方法支持正則表達式
orders.choice_description.str.replace('[\[\]]', '').head()
 
 
Out[86]:
0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object
 

String handling section of the pandas API reference

[Back to top]

 

13. How do I change the data type of a pandas Series? (video)

In [87]:
 
 
 
 
 
url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
drinks = pd.read_csv(url7)
drinks.head()
 
 
Out[87]:
  country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [88]:
 
 
 
 
 
# 檢查每一個系列的數據類型
drinks.dtypes
 
 
Out[88]:
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
In [89]:
 
 
 
 
 
# 更改現有系列的數據類型
drinks['beer_servings'] = drinks.beer_servings.astype(float)
drinks.dtypes
 
 
Out[89]:
country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
 

Documentation for astype

In [90]:
 
 
 
 
 
# 或者,在讀取文件時更改系列的數據類型
drinks = pd.read_csv(url7, dtype={'beer_servings':float})
drinks.dtypes
 
 
Out[90]:
country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
In [91]:
 
 
 
 
 
orders = pd.read_table(url1)
orders.head()
 
 
Out[91]:
  order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
In [92]:
 
 
 
 
 
# 檢查每一個系列的數據類型
orders.dtypes
 
 
Out[92]:
order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object
In [93]:
 
 
 
 
 
# 將字符串轉換爲數字以進行數學運算
orders.item_price.str.replace('$', '').astype(float).mean()
 
 
Out[93]:
7.464335785374397
In [94]:
 
 
 
 
 
# 字符串方法'contains'檢查子字符串並返回一個布爾系列
orders.item_name.str.contains('Chicken').head()
 
 
Out[94]:
0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool
In [95]:
 
 
 
 
 
# 將布爾系列轉換爲整數(False = 0,True = 1)
orders.item_name.str.contains('Chicken').astype(int).head()
 
 
Out[95]:
0    0
1    0
2    0
3    0
4    1
Name: item_name, dtype: int32
 

14. When should I use a "groupby" in pandas? (video)

In [96]:
 
 
 
 
 
drinks = pd.read_csv(url7)
drinks.head()
 
 
Out[96]:
  country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 Asia
1 Albania 89 132 54 4.9 Europe
2 Algeria 25 0 14 0.7 Africa
3 Andorra 245 138 312 12.4 Europe
4 Angola 217 57 45 5.9 Africa
In [97]:
 
 
 
 
 
# 計算整個數據集中的平均beer_servings
drinks.beer_servings.mean()
 
 
Out[97]:
106.16062176165804
In [98]:
 
 
 
 
 
# 計算非洲國家的平均beer_servings
drinks[drinks.continent=='Africa'].beer_servings.mean()
 
 
Out[98]:
61.471698113207545
In [99]:
 
 
 
 
 
#計算每一個大陸的平均beer_servings
drinks.groupby('continent').beer_servings.mean()
 
 
Out[99]:
continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64
 

Documentation for groupby

In [100]:
 
 
 
 
 
# 其餘聚合函數(例如'max')也能夠與groupby一塊兒使用
drinks.groupby('continent').beer_servings.max()
 
 
Out[100]:
continent
Africa           376
Asia             247
Europe           361
North America    285
Oceania          306
South America    333
Name: beer_servings, dtype: int64
In [101]:
 
 
 
 
 
# 多個聚合函數能夠同時應用
drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])
 
 
Out[101]:
  count mean min max
continent        
Africa 53 61.471698 0 376
Asia 44 37.045455 0 247
Europe 45 193.777778 0 361
North America 23 145.434783 1 285
Oceania 16 89.687500 0 306
South America 12 175.083333 93 333
 

Documentation for agg

In [102]:
 
 
 
 
 
# 不指定列,就會算出全部數值列
drinks.groupby('continent').mean()
 
 
Out[102]:
  beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent        
Africa 61.471698 16.339623 16.264151 3.007547
Asia 37.045455 60.840909 9.068182 2.170455
Europe 193.777778 132.555556 142.222222 8.617778
North America 145.434783 165.739130 24.521739 5.995652
Oceania 89.687500 58.437500 35.625000 3.381250
South America 175.083333 114.750000 62.416667 6.308333
In [103]:
 
 
 
 
 
# 容許繪圖出如今jupyter notebook中
%matplotlib inline
 
 
In [104]:
 
 
 
 
 
# 直接在上面的DataFrame的並排條形圖
drinks.groupby('continent').mean().plot(kind='bar')
 
 
Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x296edf23b70>
 
 

Documentation for plot

[Back to top]

 

15. How do I explore a pandas Series? (video)

In [105]:
 
 
 
 
 
# read a dataset of top-rated IMDb movies into a DataFrame
url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"
movies = pd.read_csv(url4)
movies.head()
 
 
Out[105]:
  star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
In [106]:
 
 
 
 
 
# 檢查數據類型
movies.dtypes
 
 
Out[106]:
star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object
 

探索非數字系列

In [107]:
 
 
 
 
 
# 計算最多見值的非空值,惟一值和頻率
movies.genre.describe()
 
 
Out[107]:
count       979
unique       16
top       Drama
freq        278
Name: genre, dtype: object
 

Documentation for describe

In [108]:
 
 
 
 
 
# 數Series中每一個值發生的次數
movies.genre.value_counts()
 
 
Out[108]:
Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Sci-Fi         5
Thriller       5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64
 

Documentation for value_counts

In [109]:
 
 
 
 
 
# 顯示百分比而不是原始計數
movies.genre.value_counts(normalize=True)
 
 
Out[109]:
Drama        0.283963
Comedy       0.159346
Action       0.138917
Crime        0.126660
Biography    0.078652
Adventure    0.076609
Animation    0.063330
Horror       0.029622
Mystery      0.016343
Western      0.009193
Sci-Fi       0.005107
Thriller     0.005107
Film-Noir    0.003064
Family       0.002043
Fantasy      0.001021
History      0.001021
Name: genre, dtype: float64
In [110]:
 
 
 
 
 
# '輸出的是一個Series
type(movies.genre.value_counts())
 
 
Out[110]:
pandas.core.series.Series
In [111]:
 
 
 
 
 
# 可使用Series方法
movies.genre.value_counts().head()
 
 
Out[111]:
Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Name: genre, dtype: int64
In [112]:
 
 
 
 
 
# 顯示Series中惟一值
movies.genre.unique()
 
 
Out[112]:
array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)
In [113]:
 
 
 
 
 
#數Series中惟一值的數量
movies.genre.nunique()
 
 
Out[113]:
16
 

Documentation for unique and nunique

In [114]:
 
 
 
 
 
# 兩個Series的交叉列表
pd.crosstab(movies.genre, movies.content_rating)
 
 
Out[114]:
content_rating APPROVED G GP NC-17 NOT RATED PASSED PG PG-13 R TV-MA UNRATED X
genre                        
Action 3 1 1 0 4 1 11 44 67 0 3 0
Adventure 3 2 0 0 5 1 21 23 17 0 2 0
Animation 3 20 0 0 3 0 25 5 5 0 1 0
Biography 1 2 1 0 1 0 6 29 36 0 0 0
Comedy 9 2 1 1 16 3 23 23 73 0 4 1
Crime 6 0 0 1 7 1 6 4 87 0 11 1
Drama 12 3 0 4 24 1 25 55 143 1 9 1
Family 0 1 0 0 0 0 1 0 0 0 0 0
Fantasy 0 0 0 0 0 0 0 0 1 0 0 0
Film-Noir 1 0 0 0 1 0 0 0 0 0 1 0
History 0 0 0 0 0 0 0 0 0 0 1 0
Horror 2 0 0 1 1 0 1 2 16 0 5 1
Mystery 4 1 0 0 1 0 1 2 6 0 1 0
Sci-Fi 1 0 0 0 0 0 0 1 3 0 0 0
Thriller 1 0 0 0 0 0 1 0 3 0 0 0
Western 1 0 0 0 2 0 2 1 3 0 0 0
 

Documentation for crosstab

 

探索數字系列:

In [115]:
 
 
 
 
 
# 計算各類彙總統計
movies.duration.describe()
 
 
Out[115]:
count    979.000000
mean     120.979571
std       26.218010
min       64.000000
25%      102.000000
50%      117.000000
75%      134.000000
max      242.000000
Name: duration, dtype: float64
In [116]:
 
 
 
 
 
# 許多統計數據都是做爲Series方法實現的
movies.duration.mean()
 
 
Out[116]:
120.97957099080695
 

Documentation for mean

In [117]:
 
 
 
 
 
# 'value_counts' 主要用於分類數據,而不是數字數據
movies.duration.value_counts().head()
 
 
Out[117]:
112    23
113    22
102    20
101    20
129    19
Name: duration, dtype: int64
In [118]:
 
 
 
 
 
# 容許繪圖出如今jupyter notebook中
%matplotlib inline
 
 
In [119]:
 
 
 
 
 
# 'duration'Series的直方圖(顯示數值變量的分佈)
movies.duration.plot(kind='hist')
 
 
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x296ee26ba58>
 
In [120]:
 
 
 
 
 
# 'genre'Series'value_counts'的條形圖
movies.genre.value_counts().plot(kind='bar')
 
 
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x296ee2ccba8>
 
 

Documentation for plot

[Back to top]

 

 

原文出處:https://www.cnblogs.com/romannista/p/10659805.html

相關文章
相關標籤/搜索