輸入、輸出一般能夠劃分爲幾個大類:html
函數 | 說明 |
---|---|
read_csv | 從文件、url、文件型對象中加載帶分隔符的數據,默認分隔符爲逗號 |
read_table | 從文件、url、文件型對象中加載帶分隔符的數據,默認製表符「\t」 |
read_fwf | 讀取定寬列格式數據(也就是說沒有分隔符) |
read_clipboard | 讀取剪切板數據,網頁轉化爲表格時很是有用 |
下面分析一個csv文件,ex1.csv。將其讀入一個DataFrame:python
import numpy as np import pandas as pd from pandas import Series, DataFrame df = pd.read_csv('ex1.csv')
df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
咱們也能夠用read_table,只不過須要指定分隔符:ios
pd.read_table('ex1.csv', sep=',')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
然而,並非全部的文件都有標題行。好比下面這個:web
!type ex2.csv
1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
讀入該文件的方法有兩個。你可讓pandas爲其分配默認的列名,也能夠本身定義:正則表達式
pd.read_csv('ex2.csv', header=None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
pd.read_csv('ex2.csv', names=['a','b','c','d','message'])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
假設你但願將message列作成DataFrame的索引。你能夠明確表示要將該列放到索引4的位置上,也能夠經過index_col參數指定「message」:數據庫
names=['a','b','c','d','message']
pd.read_csv('ex2.csv', names=names, index_col='message')
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
!type csv_mindex.csv
key1,key2,value1,value2 one,a,1,2 one,b,3,4 one,c,5,6 one,d,7,8 two,a,9,10 two,b,11,12 two,c,13,14 two,d,15,16
parsed = pd.read_csv('csv_mindex.csv',index_col=['key1','key2'])
parsed
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
有些表格可能不是用固定的分隔符分割字段的,若是這樣,能夠編寫一個正則表達式來做爲read_table的分隔符。下面這個文件各個字段由數量不定的空白符分割,能夠用正則表達式\s表示。json
list(open('ex3.txt'))
[' A B C\n', 'aaa -0.264438 -1.026059 -0.619500\n', 'bbb 0.927272 0.302904 -0.032399\n', 'ccc -0.264273 -0.386314 -0.217601\n', 'ddd -0.871858 -0.348382 1.100491\n']
結果以下:數組
result = pd.read_table('ex3.txt', sep='\s+')
這裏因爲列名比數據行的數量少,read_table推斷第一列是索引瀏覽器
result
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
下面這例,跳過文件的第一行、第三行和第四行:網絡
!type ex4.csv
# hey! a,b,c,d,message # just wanted to make things more difficult for you # who reads CSV files with computers, anyway? 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
pd.read_csv('ex4.csv', skiprows=[0,2,3])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
缺失值處理是文件解析任務中的重要組成部分。缺失數據常常要麼是沒有(空字符串),要麼用某個標記值表示。默認狀況下,pandas會用一組常常出現的標記值,如NA、-一、#IND以及NULL等:
!type ex5.csv
something,a,b,c,d,message one,1,2,3,4,NA two,5,6,,8,world three,9,10,11,12,foo
result = pd.read_csv('ex5.csv')
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
pd.isnull(result)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | True | False | False |
2 | False | False | False | False | False | False |
na_values能夠接受一組用於表示缺失值的字符串:
result = pd.read_csv('ex5.csv', na_values=['NULL'])
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | foo |
能夠用一個字典爲各列指定不一樣的NA標記值:
sentinels = {'message':['foo','NA'],'something':['two']}
pd.read_csv('ex5.csv', na_values=sentinels)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3.0 | 4 | NaN |
1 | NaN | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11.0 | 12 | NaN |
函數 | 說明 |
---|---|
parse_dates | 嘗試將數據解析爲日期,默認爲False。若是爲True, 則嘗試解析全部列。此外,還能夠指定 須要解析的一組列號或列名。若是列表的元素爲 列表或元組,就會將多個列組合到一塊兒再進行日期解析 工做分析(例如,日期/時間/分別位於兩個列中) |
keep_data_col | 若是鏈接多列解析日期,則保持參與鏈接的列 默認爲False |
converters | 由列號/列名跟函數之間的映射關係組成的字典。例如{'foo':f} 會對foo列 |
dayfirst | 默認爲False |
date_parse | 用於解析日期的函數 |
nrows | 須要讀取的行數 |
iterator | 返回一個TextParse以便逐塊讀取 |
chunksize | 文件塊的大小(用於迭代) |
skip_footer | 須要忽略的行數(從文件末尾處算起) |
處理大文件時,只想讀取文件的一小部分或逐塊對文件進行迭代。
result = pd.read_csv('ex6.csv')
result
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
5 | 1.817480 | 0.742273 | 0.419395 | -2.251035 | Q |
6 | -0.776764 | 0.935518 | -0.332872 | -1.875641 | U |
7 | -0.913135 | 1.530624 | -0.572657 | 0.477252 | K |
8 | 0.358480 | -0.497572 | -0.367016 | 0.507702 | S |
9 | -1.740877 | -1.160417 | -1.637830 | 2.172201 | G |
10 | 0.240564 | -0.328249 | 1.252155 | 1.072796 | 8 |
11 | 0.764018 | 1.165476 | -0.639544 | 1.495258 | R |
12 | 0.571035 | -0.310537 | 0.582437 | -0.298765 | 1 |
13 | 2.317658 | 0.430710 | -1.334216 | 0.199679 | P |
14 | 1.547771 | -1.119753 | -2.277634 | 0.329586 | J |
15 | -1.310608 | 0.401719 | -1.000987 | 1.156708 | E |
16 | -0.088496 | 0.634712 | 0.153324 | 0.415335 | B |
17 | -0.018663 | -0.247487 | -1.446522 | 0.750938 | A |
18 | -0.070127 | -1.579097 | 0.120892 | 0.671432 | F |
19 | -0.194678 | -0.492039 | 2.359605 | 0.319810 | H |
20 | -0.248618 | 0.868707 | -0.492226 | -0.717959 | W |
21 | -1.091549 | -0.867110 | -0.647760 | -0.832562 | C |
22 | 0.641404 | -0.138822 | -0.621963 | -0.284839 | C |
23 | 1.216408 | 0.992687 | 0.165162 | -0.069619 | V |
24 | -0.564474 | 0.792832 | 0.747053 | 0.571675 | I |
25 | 1.759879 | -0.515666 | -0.230481 | 1.362317 | S |
26 | 0.126266 | 0.309281 | 0.382820 | -0.239199 | L |
27 | 1.334360 | -0.100152 | -0.840731 | -0.643967 | 6 |
28 | -0.737620 | 0.278087 | -0.053235 | -0.950972 | J |
29 | -1.148486 | -0.986292 | -0.144963 | 0.124362 | Y |
... | ... | ... | ... | ... | ... |
9970 | 0.633495 | -0.186524 | 0.927627 | 0.143164 | 4 |
9971 | 0.308636 | -0.112857 | 0.762842 | -1.072977 | 1 |
9972 | -1.627051 | -0.978151 | 0.154745 | -1.229037 | Z |
9973 | 0.314847 | 0.097989 | 0.199608 | 0.955193 | P |
9974 | 1.666907 | 0.992005 | 0.496128 | -0.686391 | S |
9975 | 0.010603 | 0.708540 | -1.258711 | 0.226541 | K |
9976 | 0.118693 | -0.714455 | -0.501342 | -0.254764 | K |
9977 | 0.302616 | -2.011527 | -0.628085 | 0.768827 | H |
9978 | -0.098572 | 1.769086 | -0.215027 | -0.053076 | A |
9979 | -0.019058 | 1.964994 | 0.738538 | -0.883776 | F |
9980 | -0.595349 | 0.001781 | -1.423355 | -1.458477 | M |
9981 | 1.392170 | -1.396560 | -1.425306 | -0.847535 | H |
9982 | -0.896029 | -0.152287 | 1.924483 | 0.365184 | 6 |
9983 | -2.274642 | -0.901874 | 1.500352 | 0.996541 | N |
9984 | -0.301898 | 1.019906 | 1.102160 | 2.624526 | I |
9985 | -2.548389 | -0.585374 | 1.496201 | -0.718815 | D |
9986 | -0.064588 | 0.759292 | -1.568415 | -0.420933 | E |
9987 | -0.143365 | -1.111760 | -1.815581 | 0.435274 | 2 |
9988 | -0.070412 | -1.055921 | 0.338017 | -0.440763 | X |
9989 | 0.649148 | 0.994273 | -1.384227 | 0.485120 | Q |
9990 | -0.370769 | 0.404356 | -1.051628 | -1.050899 | 8 |
9991 | -0.409980 | 0.155627 | -0.818990 | 1.277350 | W |
9992 | 0.301214 | -1.111203 | 0.668258 | 0.671922 | A |
9993 | 1.821117 | 0.416445 | 0.173874 | 0.505118 | X |
9994 | 0.068804 | 1.322759 | 0.802346 | 0.223618 | H |
9995 | 2.311896 | -0.417070 | -1.409599 | -0.515821 | L |
9996 | -0.479893 | -0.650419 | 0.745152 | -0.646038 | E |
9997 | 0.523331 | 0.787112 | 0.486066 | 1.093156 | K |
9998 | -0.362559 | 0.598894 | -1.843201 | 0.887292 | G |
9999 | -0.096376 | -1.012999 | -0.657431 | -0.573315 | 0 |
10000 rows × 5 columns
若是隻想讀取幾行,能夠經過nrows進行指定便可:
pd.read_csv('ex6.csv', nrows=5)
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
要逐塊讀取文件,須要設置chunksize:
chunker = pd.read_csv('ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x1e6ff1e1e48>
read_csv所返回的這個TextParse對象使你能夠根據chunksize對文件進行逐塊迭代。好比,ex6.csv中,將計數值聚合到「key」中。
chunker = pd.read_csv('ex6.csv', chunksize=1000) tot = Series([]) for piece in chunker: tot = tot.add(piece['key'].value_counts(),fill_value=0) tot = tot.sort_values(ascending=False) #原文tot.order(ascending=False),新版沒有order
tot[:10]
E 368.0 X 364.0 L 346.0 O 343.0 Q 340.0 M 338.0 J 337.0 F 335.0 K 334.0 H 330.0 dtype: float64
TextParse還有一個get_chunk方法,可使你讀取任意大小的塊。
數據也能夠被輸出爲分隔符格式的文本。咱們再來看看以前讀過的一個csv文件:
data = pd.read_csv('ex5.csv') data.to_csv('out.csv')
!type out.csv
,something,a,b,c,d,message 0,one,1,2,3.0,4, 1,two,5,6,,8,world 2,three,9,10,11.0,12,foo
#還可使用其餘分隔符 data.to_csv('out.csv',sep='|')
!type out.csv
|something|a|b|c|d|message 0|one|1|2|3.0|4| 1|two|5|6||8|world 2|three|9|10|11.0|12|foo
#直接輸出結果到屏幕 import sys data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message 0|one|1|2|3.0|4| 1|two|5|6||8|world 2|three|9|10|11.0|12|foo
#缺失值在輸出結果中會被表示爲空字符串。你可能但願將其標記爲別的標記值: data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message 0,one,1,2,3.0,4,NULL 1,two,5,6,NULL,8,world 2,three,9,10,11.0,12,foo
#若是沒有設置其餘選項,則會寫出行和列的標籤。固然,它們也能夠被禁用。 data.to_csv(sys.stdout, index=False,header=False)
one,1,2,3.0,4, two,5,6,,8,world three,9,10,11.0,12,foo
#此外,你還能夠只寫出一部分的列,並以你指定的順序排列: data.to_csv(sys.stdout, index=False, columns=['a','b','c'])
a,b,c 1,2,3.0 5,6, 9,10,11.0
#Series也有一個to_csv方法: dates = pd.date_range('1/1/2000', periods=7) ts = Series(np.arange(7),index=dates) ts.to_csv('tseries.csv')
!type tseries.csv
2000-01-01,0 2000-01-02,1 2000-01-03,2 2000-01-04,3 2000-01-05,4 2000-01-06,5 2000-01-07,6
雖然只需一點整理工做(無header行,第一列作索引)就能用read_csv將csv文件讀取爲Series,但還有一個更爲方便的from_csv方法:
Series.from_csv('tseries.csv',parse_dates=True)
D:\Anaconda3\lib\site-packages\pandas\core\series.py:2890: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls infer_datetime_format=infer_datetime_format) 2000-01-01 0 2000-01-02 1 2000-01-03 2 2000-01-04 3 2000-01-05 4 2000-01-06 5 2000-01-07 6 dtype: int64
大部分存儲在磁盤上的表格型數據都能用Pandas.read_table進行加載。然而,有時仍是須要作一些手工處理。因爲接收到含有畸形行的文件而使read_table出毛病的狀況並很多見。爲了說明這些基本工具,看看下面這個簡單的csv文件:
!type ex7.csv
"a","b","c" "1","2","3" "1","2","3","4"
對於任何單字符分隔符文件,能夠直接使用Python內置的csv模塊。將任意已打開的文件或文件型的對象傳給csv.reader:
import csv f = open('ex7.csv') reader = csv.reader(f)
對這個reader進行迭代將會爲每行產生一個元組(並移除了全部的引號):
for line in reader: print(line)
['a', 'b', 'c'] ['1', '2', '3'] ['1', '2', '3', '4']
如今,爲了使數據格式合乎要求,你須要對其作一些整理工做:
lines =list(csv.reader(open('ex7.csv'))) header, values = lines[0],lines[1:] data_dict = {h:v for h, v in zip(header, zip(*values))} data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
csv文件的形式有不少。只需定義csv.Dialect的一個子類便可定義出新格式(如專門的分隔字符串引用約行結束符等):
class my_dialect(csv.Dialect): lineterminator = '\n' delimiter = ';' quotechar = '"' quoting = 0 reader = csv.reader(f, dialect=my_dialect)
各個CSV語支的參數也能夠關鍵字的形式提供給csv.reader,而無需定義子類:
reader = csv.reader(f,delimiter='|')
可用的選項(csv.Dialect)及功能如表6-3所示。
參數 | 說明 |
---|---|
dilimiter | 用於分隔字段的單字符字符串。默認「,」 |
lineterminator | 用於寫操做的行結束符,默認「\r\n」。讀操做符將忽略此選項,它能認出跨平臺的行結束符 |
quotechar | 用於帶有特殊字符(如分隔符)的字段的引用符號。默認爲「"」 |
quoting | 引用約定。可選值包括csv.QUOTE_ALL(引用全部字段),csv.QUOTE_MINIMAL(只引用帶有諸如分隔符之類特殊字符的字段),csv.QUOTE_NONNUMERIC以及csv.QUOTE_NON(不引用)。 |
skipinitialspace | 忽略分隔符後面的空白符。默認爲False |
doublequote | 如何處理字段內的引用符號。若是爲TRUE,則雙寫。 |
escapecher | 用於對分隔符進行轉義的字符串(若是quoting被設置成csv.QUOTE_NONE的話)。默認禁用 |
要手工輸出分隔符文件,你可使用csv.writer.它接受一個已經打開且可寫的文件對象以及和csv.reader相同的那些語支和格式化選項:
with open('mydata.csc','w') as f: writer = csv.writer(f,dialect = my_dialect) writer.writerow(('one','two','three')) writer.writerow(('1','2','3')) writer.writerow(('4','5','6')) writer.writerow(('7','8','9'))
JSON(JavaScript Object Notation)已經成爲經過HTTP請求在web瀏覽器和其餘應用程序之間發送數據的標準格式之一。它是一種比表格型文件格式(如CSV)靈活的多的數據格式。下面是一個例子:
obj = """ {"name":"Wes", "places_lived":["United States","Spain","Germany"], "pet":null, "siblings":[{"name":"Scott","age":25,"pet":"Zuko"}, {"name":"Katie","age":33,"pet":"Cosco"}] } """
除其空值null和一些其餘的細微差異(如列表末尾不容許存在多餘的逗號)以外,JSON很是接近於有效的Python代碼。基本類型有對象(字典)、數組(列表)、字符串、數值、布爾值及null。對象中全部的鍵都必須是字符串。許多Python庫均可以讀寫JSON數據。我將使用json,由於它是構建於Python標準庫中的。經過json.loads便可將JSON字符串轉換成Python形式:;
import json result = json.loads(obj)
result
{'name': 'Wes', 'pet': None, 'places_lived': ['United States', 'Spain', 'Germany'], 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'}, {'age': 33, 'name': 'Katie', 'pet': 'Cosco'}]}
#相反,json.dumps則將Python對象轉換成JSON格式: asjon = json.dumps(result)
若是將(一個或一組)JSON對象轉換爲DataFrame或其餘便於分析的數據結構就由你決定了。最簡單方便的方式是:向DataFrame構造器傳入一組JSON對象,並選取數據字段的子集。
siblings = DataFrame(result['siblings'],columns=['name','age'])
siblings
name | age | |
---|---|---|
0 | Scott | 25 |
1 | Katie | 33 |
https://finance.yahoo.com/quote/AAPL/options?ltr=1
Python有許多能夠讀寫HTML和XML格式數據的庫。lxml就是其中之一,它可以高效且可靠地解析大文件。好比上面那個Yahoo的網頁:
from lxml.html import parse from urllib.request import urlopen parsed = parse(urlopen('https://finance.yahoo.com/quote/AAPL/options?ltr=1&guccounter=1.html')) doc = parsed.getroot()
經過上面這個對象doc,咱們能夠獲取特定類型的全部HTML標籤(tag),好比含有所需數據的table標籤。假設,咱們想獲得該文檔中全部的url鏈接。HTML中的連接是a標籤。使用文檔根節點的findall方法以及一個XPath(對文檔的「查詢」的一種表示手段):
links = doc.findall('.//a') print('Find out links :',len(links)) links[15:20]
Find out links : 127 [<Element a at 0x1e680279548>, <Element a at 0x1e680279598>, <Element a at 0x1e6802795e8>, <Element a at 0x1e680279638>, <Element a at 0x1e680279688>]
但這些link是表示HTML元素的對象。要獲得URL和連接文本,你必須使用各類對象的get方法(針對URL)和text_content方法(針對顯式文本):
lnk = links[28] lnk.get('href')
'/quote/AAPL/options?strike=187.5&straddle=false'
lnk.text_content()
'187.50'
所以,編寫下面這條列表推導式(list comprehension)便可獲取文檔中的所有URL:
urls = [lnk.get('href') for lnk in doc.findall('.//a')] urls[:]
['https://finance.yahoo.com/', 'https://yahoo.uservoice.com/forums/439018', 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.partner=none&.src=finance', '/quote/AAPL?p=AAPL', '/quote/AAPL/key-statistics?p=AAPL', '/quote/AAPL/history?p=AAPL', '/quote/AAPL/profile?p=AAPL', '/quote/AAPL/financials?p=AAPL', '/quote/AAPL/analysis?p=AAPL', '/quote/AAPL/options?p=AAPL', '/quote/AAPL/holders?p=AAPL', '/quote/AAPL/sustainability?p=AAPL', '/quote/AAPL/options?ltr=1&guccounter=1.html&straddle=true', '/quote/AAPL190524C00167500?p=AAPL190524C00167500', '/quote/AAPL/options?strike=167.5&straddle=false', '/quote/AAPL190524C00172500?p=AAPL190524C00172500', '/quote/AAPL/options?strike=172.5&straddle=false', '/quote/AAPL190524C00175000?p=AAPL190524C00175000', '/quote/AAPL/options?strike=175&straddle=false', '/quote/AAPL190524C00177500?p=AAPL190524C00177500', '/quote/AAPL/options?strike=177.5&straddle=false', '/quote/AAPL190524C00180000?p=AAPL190524C00180000', '/quote/AAPL/options?strike=180&straddle=false', '/quote/AAPL190524C00182500?p=AAPL190524C00182500', '/quote/AAPL/options?strike=182.5&straddle=false', '/quote/AAPL190524C00185000?p=AAPL190524C00185000', '/quote/AAPL/options?strike=185&straddle=false', '/quote/AAPL190524C00187500?p=AAPL190524C00187500', '/quote/AAPL/options?strike=187.5&straddle=false', '/quote/AAPL190524C00190000?p=AAPL190524C00190000', '/quote/AAPL/options?strike=190&straddle=false', '/quote/AAPL190524C00192500?p=AAPL190524C00192500', '/quote/AAPL/options?strike=192.5&straddle=false', '/quote/AAPL190524C00195000?p=AAPL190524C00195000', '/quote/AAPL/options?strike=195&straddle=false', '/quote/AAPL190524C00197500?p=AAPL190524C00197500', '/quote/AAPL/options?strike=197.5&straddle=false', '/quote/AAPL190524C00200000?p=AAPL190524C00200000', '/quote/AAPL/options?strike=200&straddle=false', '/quote/AAPL190524C00202500?p=AAPL190524C00202500', '/quote/AAPL/options?strike=202.5&straddle=false', '/quote/AAPL190524C00205000?p=AAPL190524C00205000', '/quote/AAPL/options?strike=205&straddle=false', '/quote/AAPL190524C00207500?p=AAPL190524C00207500', '/quote/AAPL/options?strike=207.5&straddle=false', '/quote/AAPL190524C00210000?p=AAPL190524C00210000', '/quote/AAPL/options?strike=210&straddle=false', '/quote/AAPL190524C00212500?p=AAPL190524C00212500', '/quote/AAPL/options?strike=212.5&straddle=false', '/quote/AAPL190524C00215000?p=AAPL190524C00215000', '/quote/AAPL/options?strike=215&straddle=false', '/quote/AAPL190524C00217500?p=AAPL190524C00217500', '/quote/AAPL/options?strike=217.5&straddle=false', '/quote/AAPL190524C00220000?p=AAPL190524C00220000', '/quote/AAPL/options?strike=220&straddle=false', '/quote/AAPL190524C00222500?p=AAPL190524C00222500', '/quote/AAPL/options?strike=222.5&straddle=false', '/quote/AAPL190524C00225000?p=AAPL190524C00225000', '/quote/AAPL/options?strike=225&straddle=false', '/quote/AAPL190524C00230000?p=AAPL190524C00230000', '/quote/AAPL/options?strike=230&straddle=false', '/quote/AAPL190524C00232500?p=AAPL190524C00232500', '/quote/AAPL/options?strike=232.5&straddle=false', '/quote/AAPL190524C00235000?p=AAPL190524C00235000', '/quote/AAPL/options?strike=235&straddle=false', '/quote/AAPL190524C00237500?p=AAPL190524C00237500', '/quote/AAPL/options?strike=237.5&straddle=false', '/quote/AAPL190524C00242500?p=AAPL190524C00242500', '/quote/AAPL/options?strike=242.5&straddle=false', '/quote/AAPL190524P00150000?p=AAPL190524P00150000', '/quote/AAPL/options?strike=150&straddle=false', '/quote/AAPL190524P00160000?p=AAPL190524P00160000', '/quote/AAPL/options?strike=160&straddle=false', '/quote/AAPL190524P00165000?p=AAPL190524P00165000', '/quote/AAPL/options?strike=165&straddle=false', '/quote/AAPL190524P00172500?p=AAPL190524P00172500', '/quote/AAPL/options?strike=172.5&straddle=false', '/quote/AAPL190524P00175000?p=AAPL190524P00175000', '/quote/AAPL/options?strike=175&straddle=false', '/quote/AAPL190524P00177500?p=AAPL190524P00177500', '/quote/AAPL/options?strike=177.5&straddle=false', '/quote/AAPL190524P00180000?p=AAPL190524P00180000', '/quote/AAPL/options?strike=180&straddle=false', '/quote/AAPL190524P00182500?p=AAPL190524P00182500', '/quote/AAPL/options?strike=182.5&straddle=false', '/quote/AAPL190524P00185000?p=AAPL190524P00185000', '/quote/AAPL/options?strike=185&straddle=false', '/quote/AAPL190524P00187500?p=AAPL190524P00187500', '/quote/AAPL/options?strike=187.5&straddle=false', '/quote/AAPL190524P00190000?p=AAPL190524P00190000', '/quote/AAPL/options?strike=190&straddle=false', '/quote/AAPL190524P00192500?p=AAPL190524P00192500', '/quote/AAPL/options?strike=192.5&straddle=false', '/quote/AAPL190524P00195000?p=AAPL190524P00195000', '/quote/AAPL/options?strike=195&straddle=false', '/quote/AAPL190524P00197500?p=AAPL190524P00197500', '/quote/AAPL/options?strike=197.5&straddle=false', '/quote/AAPL190524P00200000?p=AAPL190524P00200000', '/quote/AAPL/options?strike=200&straddle=false', '/quote/AAPL190524P00202500?p=AAPL190524P00202500', '/quote/AAPL/options?strike=202.5&straddle=false', '/quote/AAPL190524P00227500?p=AAPL190524P00227500', '/quote/AAPL/options?strike=227.5&straddle=false', '/quote/AAPL190524P00230000?p=AAPL190524P00230000', '/quote/AAPL/options?strike=230&straddle=false', '/quote/AAPL190524P00250000?p=AAPL190524P00250000', '/quote/AAPL/options?strike=250&straddle=false', 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US&id=SLN2310', 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US', 'https://yahoo.uservoice.com/forums/382977', 'http://info.yahoo.com/privacy/us/yahoo/', 'http://info.yahoo.com/relevantads/', 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html', 'https://finance.yahoo.com/sitemap/', 'http://twitter.com/YahooFinance', 'http://facebook.com/yahoofinance', 'http://yahoofinance.tumblr.com', '/', '/watchlists', '/portfolios', '/screener', '/calendar', '/industries', '/videos/', '/news/', '/personal-finance', '/tech']
如今,從文檔中找到正確表格的辦法就是反覆試驗了。有些網站會給目標表格加上一個id屬性。我肯定有兩個分別放置看漲數據和看跌數據的表格:
tables = doc.findall('.//table') tables #只有兩個對象
[<Element table at 0x1e68027ae58>, <Element table at 0x1e68027a9a8>]
calls = tables[0] #AAPL calls數據 puts = tables[1] #AAPL puts數據 rows = calls.findall('.//tr') #每一個表格都有一個標題行,而後纔是數據行。
len(rows)
29
對於標題行和數據行,咱們但願獲取每一個單元格內的文本。對於標題行,就是th單元格,而對於數據行,則是td單元格:
def _unpack(row,kind='td'): elts = row.findall('.//%s' % kind) return[val.text_content() for val in elts]
這樣,咱們就獲得了:
_unpack(rows[0],kind='th')
['Contract Name', 'Last Trade Date', 'Strike', 'Last Price', 'Bid', 'Ask', 'Change', '% Change', 'Volume', 'Open Interest', 'Implied Volatility']
_unpack(rows[1],kind='td')
['AAPL190524C00167500', '2019-05-14 3:49PM EDT', '167.50', '21.90', '20.65', '21.90', '0.00', '-', '5', '5', '57.52%']
如今,把全部步驟結合起來,將數據轉換爲一個DataFrame。因爲數值型數據仍然是字符串格式,因此咱們但願將部分列(可能不是所有)轉換爲浮點格式。雖然你能夠手動實現該功能,可是pandas剛好就有一個TextParser類可用於自動類型轉換:
from pandas.io.parsers import TextParser def parse_options_data(table): rows =table.findall('.//tr') header = _unpack(rows[0],kind='th') data = [_unpack(r) for r in rows[1:]] return TextParser(data, names=header).get_chunk()
最後,我對那兩個lxml表格對象調用該解析函數並獲得最終的DataFrame:
call_data = parse_options_data(calls) put_data = parse_options_data(puts) call_data[:10]
Contract Name | Last Trade Date | Strike | Last Price | Bid | Ask | Change | % Change | Volume | Open Interest | Implied Volatility | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAPL190524C00167500 | 2019-05-14 3:49PM EDT | 167.5 | 21.90 | 20.65 | 21.90 | 0.00 | - | 5 | 5 | 57.52% |
1 | AAPL190524C00172500 | 2019-05-17 3:50PM EDT | 172.5 | 16.60 | 16.60 | 17.00 | -2.35 | -12.40% | 204 | 142 | 48.88% |
2 | AAPL190524C00175000 | 2019-05-17 3:44PM EDT | 175.0 | 14.55 | 14.20 | 14.60 | -1.30 | -8.20% | 138 | 99 | 45.17% |
3 | AAPL190524C00177500 | 2019-05-17 3:50PM EDT | 177.5 | 11.90 | 11.85 | 12.50 | -1.80 | -13.14% | 120 | 234 | 46.00% |
4 | AAPL190524C00180000 | 2019-05-17 3:51PM EDT | 180.0 | 9.70 | 9.65 | 10.00 | -0.60 | -5.83% | 1,550 | 411 | 39.09% |
5 | AAPL190524C00182500 | 2019-05-17 3:39PM EDT | 182.5 | 7.68 | 7.50 | 7.85 | -0.47 | -5.77% | 272 | 406 | 36.40% |
6 | AAPL190524C00185000 | 2019-05-17 3:59PM EDT | 185.0 | 5.90 | 5.85 | 6.00 | -0.50 | -7.81% | 3,184 | 1,001 | 35.40% |
7 | AAPL190524C00187500 | 2019-05-17 3:59PM EDT | 187.5 | 4.05 | 4.00 | 4.10 | -0.59 | -12.72% | 6,159 | 1,645 | 31.69% |
8 | AAPL190524C00190000 | 2019-05-17 3:59PM EDT | 190.0 | 2.69 | 2.59 | 2.67 | -0.36 | -11.80% | 19,710 | 4,230 | 30.03% |
9 | AAPL190524C00192500 | 2019-05-17 3:59PM EDT | 192.5 | 1.60 | 1.56 | 1.61 | -0.34 | -17.53% | 11,506 | 4,427 | 28.91% |
XML(Extensible Markup Language)是另外一種常見的支持分層、嵌套數據以及元數據的結構化數據格式。本書所使用的這些文件實際上來自於一個很大的XML文檔。
紐約大都會運輸署(Metropolitan Transportation Authority,MTA)發佈了一些有關其公交和列車服務的數據資料(http://www.mta.info/developers/download.html )。這裏,咱們將看看包含在一組XML文件中的運行狀況數據。每項列車或公交服務都各有各自的文件(如Metro-North Railroad的文件是Performance_MNR.xml),其中每條XML記錄就是一條月度數據。這裏須要注意,前文所述的網址和文件都已經有了些許變化,下面的文本展現,實際是「./Performance_MNRR.xml」文本內容,咱們的後續學習也將基於該文件。
<?xml version="1.0" encoding="UTF-8" standalone="true"?> -<PERFORMANCE> -<INDICATOR> <INDICATOR_SEQ>28345</INDICATOR_SEQ> <PARENT_SEQ>55526</PARENT_SEQ> <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME> <INDICATOR_NAME>Hudson Line - OTP</INDICATOR_NAME> <DESCRIPTION>Percent of commuter trains that arrive at their destinations within 5 minutes and 59 seconds of the scheduled time.</DESCRIPTION> <CATEGORY>Service Indicators</CATEGORY> <FREQUENCY>M</FREQUENCY> <DESIRED_CHANGE>U</DESIRED_CHANGE> <INDICATOR_UNIT>%</INDICATOR_UNIT> <DECIMAL_PLACES>1</DECIMAL_PLACES> -<YEAR> <PERIOD_YEAR>2008</PERIOD_YEAR> -<MONTH> <PERIOD_MONTH>1</PERIOD_MONTH> -<MONTHLYVALUES> <YTD_TARGET>98.00</YTD_TARGET>
利用lxml.objectify解析文件,經過getroot獲得XML文件的根節點的引用:
from lxml import objectify import numpy as np import pandas as pd from pandas import Series, DataFrame path="Performance_MNRR.xml" parsed = objectify.parse(open(path)) root = parsed.getroot()
root.INDICATOR返回一個用於產生各個
data = [] skip_fields = ['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES'] for elt in root.INDICATOR: el_data = {} for child in elt.getchildren(): if child.tag in skip_fields: continue el_data[child.tag] = child.text data.append(el_data)
最後,將這組字典轉換爲一個DataFrame:
perf = DataFrame(data)
perf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13 entries, 0 to 12 Data columns (total 7 columns): AGENCY_NAME 13 non-null object CATEGORY 13 non-null object DESCRIPTION 13 non-null object FREQUENCY 13 non-null object INDICATOR_NAME 13 non-null object INDICATOR_UNIT 13 non-null object YEAR 0 non-null object dtypes: object(7) memory usage: 808.0+ bytes
XML數據能夠比本例複雜得多。每一個標記均可以有元數據。看看下面這個HTML的連接標記(它也算是一段有效的XML):
from io import StringIO tag = '<a href="http://www.baidu.com">BaiDu</a>' root = objectify.parse(StringIO(tag)).getroot()
如今能夠訪問連接文本或標記中的任何字段了(如href):
root
<Element a at 0x137f2861e88>
root.get('href')
'http://www.baidu.com'
root.text
'BaiDu'