利用PYTHON進行數據分析 學習筆記(二)

輸入、輸出一般能夠劃分爲幾個大類:html

  • 讀取文本文件
  • 其餘高效磁盤存儲格式
  • 加載數據庫中的數據
  • 利用網絡WEB API操做網絡資源

讀取文本格式數據

函數 說明
read_csv 從文件、url、文件型對象中加載帶分隔符的數據,默認分隔符爲逗號
read_table 從文件、url、文件型對象中加載帶分隔符的數據,默認製表符「\t」
read_fwf 讀取定寬列格式數據(也就是說沒有分隔符)
read_clipboard 讀取剪切板數據,網頁轉化爲表格時很是有用

下面分析一個csv文件,ex1.csv。將其讀入一個DataFrame:python

import numpy as np
import pandas as pd
from pandas import Series, DataFrame


df = pd.read_csv('ex1.csv')
df
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

咱們也能夠用read_table,只不過須要指定分隔符:ios

pd.read_table('ex1.csv', sep=',')
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

然而,並非全部的文件都有標題行。好比下面這個:web

!type ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

讀入該文件的方法有兩個。你可讓pandas爲其分配默認的列名,也能夠本身定義:正則表達式

pd.read_csv('ex2.csv', header=None)
0 1 2 3 4
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
pd.read_csv('ex2.csv', names=['a','b','c','d','message'])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

假設你但願將message列作成DataFrame的索引。你能夠明確表示要將該列放到索引4的位置上,也能夠經過index_col參數指定「message」:數據庫

names=['a','b','c','d','message']
pd.read_csv('ex2.csv', names=names, index_col='message')
a b c d
message
hello 1 2 3 4
world 5 6 7 8
foo 9 10 11 12
!type csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16
parsed = pd.read_csv('csv_mindex.csv',index_col=['key1','key2'])
parsed
value1 value2
key1 key2
one a 1 2
b 3 4
c 5 6
d 7 8
two a 9 10
b 11 12
c 13 14
d 15 16

有些表格可能不是用固定的分隔符分割字段的,若是這樣,能夠編寫一個正則表達式來做爲read_table的分隔符。下面這個文件各個字段由數量不定的空白符分割,能夠用正則表達式\s表示。json

list(open('ex3.txt'))
['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

結果以下:數組

result = pd.read_table('ex3.txt', sep='\s+')

這裏因爲列名比數據行的數量少,read_table推斷第一列是索引瀏覽器

result
A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

下面這例,跳過文件的第一行、第三行和第四行:網絡

!type ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('ex4.csv', skiprows=[0,2,3])
a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

缺失值處理

缺失值處理是文件解析任務中的重要組成部分。缺失數據常常要麼是沒有(空字符串),要麼用某個標記值表示。默認狀況下,pandas會用一組常常出現的標記值,如NA、-一、#IND以及NULL等:

!type ex5.csv
something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo
result = pd.read_csv('ex5.csv')
result
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo
pd.isnull(result)
something a b c d message
0 False False False False False True
1 False False False True False False
2 False False False False False False

na_values能夠接受一組用於表示缺失值的字符串:

result = pd.read_csv('ex5.csv', na_values=['NULL'])
result
something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

能夠用一個字典爲各列指定不一樣的NA標記值:

sentinels = {'message':['foo','NA'],'something':['two']}
pd.read_csv('ex5.csv', na_values=sentinels)
something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN
函數 說明
parse_dates 嘗試將數據解析爲日期,默認爲False。若是爲True,
則嘗試解析全部列。此外,還能夠指定
須要解析的一組列號或列名。若是列表的元素爲
列表或元組,就會將多個列組合到一塊兒再進行日期解析
工做分析(例如,日期/時間/分別位於兩個列中)
keep_data_col 若是鏈接多列解析日期,則保持參與鏈接的列
默認爲False
converters 由列號/列名跟函數之間的映射關係組成的字典。例如{'foo':f}
會對foo列
dayfirst 默認爲False
date_parse 用於解析日期的函數
nrows 須要讀取的行數
iterator 返回一個TextParse以便逐塊讀取
chunksize 文件塊的大小(用於迭代)
skip_footer 須要忽略的行數(從文件末尾處算起)

逐塊讀取文本文件

處理大文件時,只想讀取文件的一小部分或逐塊對文件進行迭代。

result = pd.read_csv('ex6.csv')
result
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
5 1.817480 0.742273 0.419395 -2.251035 Q
6 -0.776764 0.935518 -0.332872 -1.875641 U
7 -0.913135 1.530624 -0.572657 0.477252 K
8 0.358480 -0.497572 -0.367016 0.507702 S
9 -1.740877 -1.160417 -1.637830 2.172201 G
10 0.240564 -0.328249 1.252155 1.072796 8
11 0.764018 1.165476 -0.639544 1.495258 R
12 0.571035 -0.310537 0.582437 -0.298765 1
13 2.317658 0.430710 -1.334216 0.199679 P
14 1.547771 -1.119753 -2.277634 0.329586 J
15 -1.310608 0.401719 -1.000987 1.156708 E
16 -0.088496 0.634712 0.153324 0.415335 B
17 -0.018663 -0.247487 -1.446522 0.750938 A
18 -0.070127 -1.579097 0.120892 0.671432 F
19 -0.194678 -0.492039 2.359605 0.319810 H
20 -0.248618 0.868707 -0.492226 -0.717959 W
21 -1.091549 -0.867110 -0.647760 -0.832562 C
22 0.641404 -0.138822 -0.621963 -0.284839 C
23 1.216408 0.992687 0.165162 -0.069619 V
24 -0.564474 0.792832 0.747053 0.571675 I
25 1.759879 -0.515666 -0.230481 1.362317 S
26 0.126266 0.309281 0.382820 -0.239199 L
27 1.334360 -0.100152 -0.840731 -0.643967 6
28 -0.737620 0.278087 -0.053235 -0.950972 J
29 -1.148486 -0.986292 -0.144963 0.124362 Y
... ... ... ... ... ...
9970 0.633495 -0.186524 0.927627 0.143164 4
9971 0.308636 -0.112857 0.762842 -1.072977 1
9972 -1.627051 -0.978151 0.154745 -1.229037 Z
9973 0.314847 0.097989 0.199608 0.955193 P
9974 1.666907 0.992005 0.496128 -0.686391 S
9975 0.010603 0.708540 -1.258711 0.226541 K
9976 0.118693 -0.714455 -0.501342 -0.254764 K
9977 0.302616 -2.011527 -0.628085 0.768827 H
9978 -0.098572 1.769086 -0.215027 -0.053076 A
9979 -0.019058 1.964994 0.738538 -0.883776 F
9980 -0.595349 0.001781 -1.423355 -1.458477 M
9981 1.392170 -1.396560 -1.425306 -0.847535 H
9982 -0.896029 -0.152287 1.924483 0.365184 6
9983 -2.274642 -0.901874 1.500352 0.996541 N
9984 -0.301898 1.019906 1.102160 2.624526 I
9985 -2.548389 -0.585374 1.496201 -0.718815 D
9986 -0.064588 0.759292 -1.568415 -0.420933 E
9987 -0.143365 -1.111760 -1.815581 0.435274 2
9988 -0.070412 -1.055921 0.338017 -0.440763 X
9989 0.649148 0.994273 -1.384227 0.485120 Q
9990 -0.370769 0.404356 -1.051628 -1.050899 8
9991 -0.409980 0.155627 -0.818990 1.277350 W
9992 0.301214 -1.111203 0.668258 0.671922 A
9993 1.821117 0.416445 0.173874 0.505118 X
9994 0.068804 1.322759 0.802346 0.223618 H
9995 2.311896 -0.417070 -1.409599 -0.515821 L
9996 -0.479893 -0.650419 0.745152 -0.646038 E
9997 0.523331 0.787112 0.486066 1.093156 K
9998 -0.362559 0.598894 -1.843201 0.887292 G
9999 -0.096376 -1.012999 -0.657431 -0.573315 0

10000 rows × 5 columns

若是隻想讀取幾行,能夠經過nrows進行指定便可:

pd.read_csv('ex6.csv', nrows=5)
one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q

要逐塊讀取文件,須要設置chunksize:

chunker = pd.read_csv('ex6.csv', chunksize=1000)
chunker
<pandas.io.parsers.TextFileReader at 0x1e6ff1e1e48>

read_csv所返回的這個TextParse對象使你能夠根據chunksize對文件進行逐塊迭代。好比,ex6.csv中,將計數值聚合到「key」中。

chunker = pd.read_csv('ex6.csv', chunksize=1000)
tot = Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(),fill_value=0)
tot = tot.sort_values(ascending=False) #原文tot.order(ascending=False),新版沒有order
tot[:10]
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

TextParse還有一個get_chunk方法,可使你讀取任意大小的塊。

將數據寫出到文本格式

數據也能夠被輸出爲分隔符格式的文本。咱們再來看看以前讀過的一個csv文件:

data = pd.read_csv('ex5.csv')
data.to_csv('out.csv')
!type out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo
#還可使用其餘分隔符
data.to_csv('out.csv',sep='|')
!type out.csv
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
#直接輸出結果到屏幕
import sys
data.to_csv(sys.stdout,sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo
#缺失值在輸出結果中會被表示爲空字符串。你可能但願將其標記爲別的標記值:
data.to_csv(sys.stdout,na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
#若是沒有設置其餘選項,則會寫出行和列的標籤。固然,它們也能夠被禁用。
data.to_csv(sys.stdout, index=False,header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
#此外,你還能夠只寫出一部分的列,並以你指定的順序排列:
data.to_csv(sys.stdout, index=False, columns=['a','b','c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0
#Series也有一個to_csv方法:
dates = pd.date_range('1/1/2000', periods=7)
ts = Series(np.arange(7),index=dates)
ts.to_csv('tseries.csv')
!type tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

雖然只需一點整理工做(無header行,第一列作索引)就能用read_csv將csv文件讀取爲Series,但還有一個更爲方便的from_csv方法:

Series.from_csv('tseries.csv',parse_dates=True)
D:\Anaconda3\lib\site-packages\pandas\core\series.py:2890: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls
  infer_datetime_format=infer_datetime_format)





2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

手工處理分隔符

大部分存儲在磁盤上的表格型數據都能用Pandas.read_table進行加載。然而,有時仍是須要作一些手工處理。因爲接收到含有畸形行的文件而使read_table出毛病的狀況並很多見。爲了說明這些基本工具,看看下面這個簡單的csv文件:

!type ex7.csv
"a","b","c"
"1","2","3"
"1","2","3","4"

對於任何單字符分隔符文件,能夠直接使用Python內置的csv模塊。將任意已打開的文件或文件型的對象傳給csv.reader:

import csv
f = open('ex7.csv')

reader = csv.reader(f)

對這個reader進行迭代將會爲每行產生一個元組(並移除了全部的引號):

for line in reader:
    print(line)
['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']

如今,爲了使數據格式合乎要求,你須要對其作一些整理工做:

lines =list(csv.reader(open('ex7.csv')))
header, values = lines[0],lines[1:]
data_dict = {h:v for h, v in zip(header, zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

csv文件的形式有不少。只需定義csv.Dialect的一個子類便可定義出新格式(如專門的分隔字符串引用約行結束符等):

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = 0
    
reader = csv.reader(f, dialect=my_dialect)

各個CSV語支的參數也能夠關鍵字的形式提供給csv.reader,而無需定義子類:

reader = csv.reader(f,delimiter='|')

可用的選項(csv.Dialect)及功能如表6-3所示。

表6-3 CSV語支選擇

參數 說明
dilimiter 用於分隔字段的單字符字符串。默認「,」
lineterminator 用於寫操做的行結束符,默認「\r\n」。讀操做符將忽略此選項,它能認出跨平臺的行結束符
quotechar 用於帶有特殊字符(如分隔符)的字段的引用符號。默認爲「"」
quoting 引用約定。可選值包括csv.QUOTE_ALL(引用全部字段),csv.QUOTE_MINIMAL(只引用帶有諸如分隔符之類特殊字符的字段),csv.QUOTE_NONNUMERIC以及csv.QUOTE_NON(不引用)。
skipinitialspace 忽略分隔符後面的空白符。默認爲False
doublequote 如何處理字段內的引用符號。若是爲TRUE,則雙寫。
escapecher 用於對分隔符進行轉義的字符串(若是quoting被設置成csv.QUOTE_NONE的話)。默認禁用

要手工輸出分隔符文件,你可使用csv.writer.它接受一個已經打開且可寫的文件對象以及和csv.reader相同的那些語支和格式化選項:

with open('mydata.csc','w') as f:
    writer = csv.writer(f,dialect = my_dialect)
    writer.writerow(('one','two','three'))
    writer.writerow(('1','2','3'))
    writer.writerow(('4','5','6'))
    writer.writerow(('7','8','9'))

JSON數據

JSON(JavaScript Object Notation)已經成爲經過HTTP請求在web瀏覽器和其餘應用程序之間發送數據的標準格式之一。它是一種比表格型文件格式(如CSV)靈活的多的數據格式。下面是一個例子:

obj = """
{"name":"Wes",
"places_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
                {"name":"Katie","age":33,"pet":"Cosco"}]
}
"""

除其空值null和一些其餘的細微差異(如列表末尾不容許存在多餘的逗號)以外,JSON很是接近於有效的Python代碼。基本類型有對象(字典)、數組(列表)、字符串、數值、布爾值及null。對象中全部的鍵都必須是字符串。許多Python庫均可以讀寫JSON數據。我將使用json,由於它是構建於Python標準庫中的。經過json.loads便可將JSON字符串轉換成Python形式:;

import json
result = json.loads(obj)
result
{'name': 'Wes',
 'pet': None,
 'places_lived': ['United States', 'Spain', 'Germany'],
 'siblings': [{'age': 25, 'name': 'Scott', 'pet': 'Zuko'},
  {'age': 33, 'name': 'Katie', 'pet': 'Cosco'}]}
#相反,json.dumps則將Python對象轉換成JSON格式:
asjon = json.dumps(result)

若是將(一個或一組)JSON對象轉換爲DataFrame或其餘便於分析的數據結構就由你決定了。最簡單方便的方式是:向DataFrame構造器傳入一組JSON對象,並選取數據字段的子集。

siblings = DataFrame(result['siblings'],columns=['name','age'])
siblings
name age
0 Scott 25
1 Katie 33

https://finance.yahoo.com/quote/AAPL/options?ltr=1

XML和HTML:web信息收集

Python有許多能夠讀寫HTML和XML格式數據的庫。lxml就是其中之一,它可以高效且可靠地解析大文件。好比上面那個Yahoo的網頁:

from lxml.html import parse
from urllib.request import urlopen

parsed = parse(urlopen('https://finance.yahoo.com/quote/AAPL/options?ltr=1&guccounter=1.html'))

doc = parsed.getroot()

經過上面這個對象doc,咱們能夠獲取特定類型的全部HTML標籤(tag),好比含有所需數據的table標籤。假設,咱們想獲得該文檔中全部的url鏈接。HTML中的連接是a標籤。使用文檔根節點的findall方法以及一個XPath(對文檔的「查詢」的一種表示手段):

links = doc.findall('.//a')
print('Find out links :',len(links))
links[15:20]
Find out links : 127





[<Element a at 0x1e680279548>,
 <Element a at 0x1e680279598>,
 <Element a at 0x1e6802795e8>,
 <Element a at 0x1e680279638>,
 <Element a at 0x1e680279688>]

但這些link是表示HTML元素的對象。要獲得URL和連接文本,你必須使用各類對象的get方法(針對URL)和text_content方法(針對顯式文本):

lnk = links[28]
lnk.get('href')
'/quote/AAPL/options?strike=187.5&straddle=false'
lnk.text_content()
'187.50'

所以,編寫下面這條列表推導式(list comprehension)便可獲取文檔中的所有URL:

urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[:]
['https://finance.yahoo.com/',
 'https://yahoo.uservoice.com/forums/439018',
 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.partner=none&.src=finance',
 '/quote/AAPL?p=AAPL',
 '/quote/AAPL/key-statistics?p=AAPL',
 '/quote/AAPL/history?p=AAPL',
 '/quote/AAPL/profile?p=AAPL',
 '/quote/AAPL/financials?p=AAPL',
 '/quote/AAPL/analysis?p=AAPL',
 '/quote/AAPL/options?p=AAPL',
 '/quote/AAPL/holders?p=AAPL',
 '/quote/AAPL/sustainability?p=AAPL',
 '/quote/AAPL/options?ltr=1&guccounter=1.html&straddle=true',
 '/quote/AAPL190524C00167500?p=AAPL190524C00167500',
 '/quote/AAPL/options?strike=167.5&straddle=false',
 '/quote/AAPL190524C00172500?p=AAPL190524C00172500',
 '/quote/AAPL/options?strike=172.5&straddle=false',
 '/quote/AAPL190524C00175000?p=AAPL190524C00175000',
 '/quote/AAPL/options?strike=175&straddle=false',
 '/quote/AAPL190524C00177500?p=AAPL190524C00177500',
 '/quote/AAPL/options?strike=177.5&straddle=false',
 '/quote/AAPL190524C00180000?p=AAPL190524C00180000',
 '/quote/AAPL/options?strike=180&straddle=false',
 '/quote/AAPL190524C00182500?p=AAPL190524C00182500',
 '/quote/AAPL/options?strike=182.5&straddle=false',
 '/quote/AAPL190524C00185000?p=AAPL190524C00185000',
 '/quote/AAPL/options?strike=185&straddle=false',
 '/quote/AAPL190524C00187500?p=AAPL190524C00187500',
 '/quote/AAPL/options?strike=187.5&straddle=false',
 '/quote/AAPL190524C00190000?p=AAPL190524C00190000',
 '/quote/AAPL/options?strike=190&straddle=false',
 '/quote/AAPL190524C00192500?p=AAPL190524C00192500',
 '/quote/AAPL/options?strike=192.5&straddle=false',
 '/quote/AAPL190524C00195000?p=AAPL190524C00195000',
 '/quote/AAPL/options?strike=195&straddle=false',
 '/quote/AAPL190524C00197500?p=AAPL190524C00197500',
 '/quote/AAPL/options?strike=197.5&straddle=false',
 '/quote/AAPL190524C00200000?p=AAPL190524C00200000',
 '/quote/AAPL/options?strike=200&straddle=false',
 '/quote/AAPL190524C00202500?p=AAPL190524C00202500',
 '/quote/AAPL/options?strike=202.5&straddle=false',
 '/quote/AAPL190524C00205000?p=AAPL190524C00205000',
 '/quote/AAPL/options?strike=205&straddle=false',
 '/quote/AAPL190524C00207500?p=AAPL190524C00207500',
 '/quote/AAPL/options?strike=207.5&straddle=false',
 '/quote/AAPL190524C00210000?p=AAPL190524C00210000',
 '/quote/AAPL/options?strike=210&straddle=false',
 '/quote/AAPL190524C00212500?p=AAPL190524C00212500',
 '/quote/AAPL/options?strike=212.5&straddle=false',
 '/quote/AAPL190524C00215000?p=AAPL190524C00215000',
 '/quote/AAPL/options?strike=215&straddle=false',
 '/quote/AAPL190524C00217500?p=AAPL190524C00217500',
 '/quote/AAPL/options?strike=217.5&straddle=false',
 '/quote/AAPL190524C00220000?p=AAPL190524C00220000',
 '/quote/AAPL/options?strike=220&straddle=false',
 '/quote/AAPL190524C00222500?p=AAPL190524C00222500',
 '/quote/AAPL/options?strike=222.5&straddle=false',
 '/quote/AAPL190524C00225000?p=AAPL190524C00225000',
 '/quote/AAPL/options?strike=225&straddle=false',
 '/quote/AAPL190524C00230000?p=AAPL190524C00230000',
 '/quote/AAPL/options?strike=230&straddle=false',
 '/quote/AAPL190524C00232500?p=AAPL190524C00232500',
 '/quote/AAPL/options?strike=232.5&straddle=false',
 '/quote/AAPL190524C00235000?p=AAPL190524C00235000',
 '/quote/AAPL/options?strike=235&straddle=false',
 '/quote/AAPL190524C00237500?p=AAPL190524C00237500',
 '/quote/AAPL/options?strike=237.5&straddle=false',
 '/quote/AAPL190524C00242500?p=AAPL190524C00242500',
 '/quote/AAPL/options?strike=242.5&straddle=false',
 '/quote/AAPL190524P00150000?p=AAPL190524P00150000',
 '/quote/AAPL/options?strike=150&straddle=false',
 '/quote/AAPL190524P00160000?p=AAPL190524P00160000',
 '/quote/AAPL/options?strike=160&straddle=false',
 '/quote/AAPL190524P00165000?p=AAPL190524P00165000',
 '/quote/AAPL/options?strike=165&straddle=false',
 '/quote/AAPL190524P00172500?p=AAPL190524P00172500',
 '/quote/AAPL/options?strike=172.5&straddle=false',
 '/quote/AAPL190524P00175000?p=AAPL190524P00175000',
 '/quote/AAPL/options?strike=175&straddle=false',
 '/quote/AAPL190524P00177500?p=AAPL190524P00177500',
 '/quote/AAPL/options?strike=177.5&straddle=false',
 '/quote/AAPL190524P00180000?p=AAPL190524P00180000',
 '/quote/AAPL/options?strike=180&straddle=false',
 '/quote/AAPL190524P00182500?p=AAPL190524P00182500',
 '/quote/AAPL/options?strike=182.5&straddle=false',
 '/quote/AAPL190524P00185000?p=AAPL190524P00185000',
 '/quote/AAPL/options?strike=185&straddle=false',
 '/quote/AAPL190524P00187500?p=AAPL190524P00187500',
 '/quote/AAPL/options?strike=187.5&straddle=false',
 '/quote/AAPL190524P00190000?p=AAPL190524P00190000',
 '/quote/AAPL/options?strike=190&straddle=false',
 '/quote/AAPL190524P00192500?p=AAPL190524P00192500',
 '/quote/AAPL/options?strike=192.5&straddle=false',
 '/quote/AAPL190524P00195000?p=AAPL190524P00195000',
 '/quote/AAPL/options?strike=195&straddle=false',
 '/quote/AAPL190524P00197500?p=AAPL190524P00197500',
 '/quote/AAPL/options?strike=197.5&straddle=false',
 '/quote/AAPL190524P00200000?p=AAPL190524P00200000',
 '/quote/AAPL/options?strike=200&straddle=false',
 '/quote/AAPL190524P00202500?p=AAPL190524P00202500',
 '/quote/AAPL/options?strike=202.5&straddle=false',
 '/quote/AAPL190524P00227500?p=AAPL190524P00227500',
 '/quote/AAPL/options?strike=227.5&straddle=false',
 '/quote/AAPL190524P00230000?p=AAPL190524P00230000',
 '/quote/AAPL/options?strike=230&straddle=false',
 '/quote/AAPL190524P00250000?p=AAPL190524P00250000',
 '/quote/AAPL/options?strike=250&straddle=false',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US&id=SLN2310',
 'https://help.yahoo.com/kb/index?page=content&y=PROD_FIN_DESK&locale=en_US',
 'https://yahoo.uservoice.com/forums/382977',
 'http://info.yahoo.com/privacy/us/yahoo/',
 'http://info.yahoo.com/relevantads/',
 'http://info.yahoo.com/legal/us/yahoo/utos/utos-173.html',
 'https://finance.yahoo.com/sitemap/',
 'http://twitter.com/YahooFinance',
 'http://facebook.com/yahoofinance',
 'http://yahoofinance.tumblr.com',
 '/',
 '/watchlists',
 '/portfolios',
 '/screener',
 '/calendar',
 '/industries',
 '/videos/',
 '/news/',
 '/personal-finance',
 '/tech']

如今,從文檔中找到正確表格的辦法就是反覆試驗了。有些網站會給目標表格加上一個id屬性。我肯定有兩個分別放置看漲數據和看跌數據的表格:

tables = doc.findall('.//table')
tables #只有兩個對象
[<Element table at 0x1e68027ae58>, <Element table at 0x1e68027a9a8>]
calls = tables[0] #AAPL calls數據
puts = tables[1]  #AAPL puts數據
rows = calls.findall('.//tr') #每一個表格都有一個標題行,而後纔是數據行。
len(rows)
29

對於標題行和數據行,咱們但願獲取每一個單元格內的文本。對於標題行,就是th單元格,而對於數據行,則是td單元格:

def _unpack(row,kind='td'):
    elts = row.findall('.//%s' % kind)
    return[val.text_content() for val in elts]

這樣,咱們就獲得了:

_unpack(rows[0],kind='th')
['Contract Name',
 'Last Trade Date',
 'Strike',
 'Last Price',
 'Bid',
 'Ask',
 'Change',
 '% Change',
 'Volume',
 'Open Interest',
 'Implied Volatility']
_unpack(rows[1],kind='td')
['AAPL190524C00167500',
 '2019-05-14 3:49PM EDT',
 '167.50',
 '21.90',
 '20.65',
 '21.90',
 '0.00',
 '-',
 '5',
 '5',
 '57.52%']

如今,把全部步驟結合起來,將數據轉換爲一個DataFrame。因爲數值型數據仍然是字符串格式,因此咱們但願將部分列(可能不是所有)轉換爲浮點格式。雖然你能夠手動實現該功能,可是pandas剛好就有一個TextParser類可用於自動類型轉換:

from pandas.io.parsers import TextParser

def parse_options_data(table):
    rows =table.findall('.//tr')
    header = _unpack(rows[0],kind='th')
    data = [_unpack(r) for r in rows[1:]]
    return TextParser(data, names=header).get_chunk()

最後,我對那兩個lxml表格對象調用該解析函數並獲得最終的DataFrame:

call_data = parse_options_data(calls)
put_data = parse_options_data(puts)
call_data[:10]
Contract Name Last Trade Date Strike Last Price Bid Ask Change % Change Volume Open Interest Implied Volatility
0 AAPL190524C00167500 2019-05-14 3:49PM EDT 167.5 21.90 20.65 21.90 0.00 - 5 5 57.52%
1 AAPL190524C00172500 2019-05-17 3:50PM EDT 172.5 16.60 16.60 17.00 -2.35 -12.40% 204 142 48.88%
2 AAPL190524C00175000 2019-05-17 3:44PM EDT 175.0 14.55 14.20 14.60 -1.30 -8.20% 138 99 45.17%
3 AAPL190524C00177500 2019-05-17 3:50PM EDT 177.5 11.90 11.85 12.50 -1.80 -13.14% 120 234 46.00%
4 AAPL190524C00180000 2019-05-17 3:51PM EDT 180.0 9.70 9.65 10.00 -0.60 -5.83% 1,550 411 39.09%
5 AAPL190524C00182500 2019-05-17 3:39PM EDT 182.5 7.68 7.50 7.85 -0.47 -5.77% 272 406 36.40%
6 AAPL190524C00185000 2019-05-17 3:59PM EDT 185.0 5.90 5.85 6.00 -0.50 -7.81% 3,184 1,001 35.40%
7 AAPL190524C00187500 2019-05-17 3:59PM EDT 187.5 4.05 4.00 4.10 -0.59 -12.72% 6,159 1,645 31.69%
8 AAPL190524C00190000 2019-05-17 3:59PM EDT 190.0 2.69 2.59 2.67 -0.36 -11.80% 19,710 4,230 30.03%
9 AAPL190524C00192500 2019-05-17 3:59PM EDT 192.5 1.60 1.56 1.61 -0.34 -17.53% 11,506 4,427 28.91%

XML(Extensible Markup Language)是另外一種常見的支持分層、嵌套數據以及元數據的結構化數據格式。本書所使用的這些文件實際上來自於一個很大的XML文檔。

紐約大都會運輸署(Metropolitan Transportation Authority,MTA)發佈了一些有關其公交和列車服務的數據資料(http://www.mta.info/developers/download.html )。這裏,咱們將看看包含在一組XML文件中的運行狀況數據。每項列車或公交服務都各有各自的文件(如Metro-North Railroad的文件是Performance_MNR.xml),其中每條XML記錄就是一條月度數據。這裏須要注意,前文所述的網址和文件都已經有了些許變化,下面的文本展現,實際是「./Performance_MNRR.xml」文本內容,咱們的後續學習也將基於該文件。

<?xml version="1.0" encoding="UTF-8" standalone="true"?>

-<PERFORMANCE>


-<INDICATOR>

<INDICATOR_SEQ>28345</INDICATOR_SEQ>

<PARENT_SEQ>55526</PARENT_SEQ>

<AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>

<INDICATOR_NAME>Hudson Line - OTP</INDICATOR_NAME>

<DESCRIPTION>Percent of commuter trains that arrive at their destinations within 5 minutes and 59 seconds of the scheduled time.</DESCRIPTION>

<CATEGORY>Service Indicators</CATEGORY>

<FREQUENCY>M</FREQUENCY>

<DESIRED_CHANGE>U</DESIRED_CHANGE>

<INDICATOR_UNIT>%</INDICATOR_UNIT>

<DECIMAL_PLACES>1</DECIMAL_PLACES>


-<YEAR>

<PERIOD_YEAR>2008</PERIOD_YEAR>


-<MONTH>

<PERIOD_MONTH>1</PERIOD_MONTH>


-<MONTHLYVALUES>

<YTD_TARGET>98.00</YTD_TARGET>

利用lxml.objectify解析文件,經過getroot獲得XML文件的根節點的引用:

from lxml import objectify
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

path="Performance_MNRR.xml"
parsed = objectify.parse(open(path))
root = parsed.getroot()

root.INDICATOR返回一個用於產生各個 XML元素的生成器。對於每條記錄,咱們能夠用標記名(如YTD_ACTUAL)和數據值填充一個字典(排除幾個標記):

data = []

skip_fields = ['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.text
    data.append(el_data)

最後,將這組字典轉換爲一個DataFrame:

perf = DataFrame(data)
perf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 7 columns):
AGENCY_NAME       13 non-null object
CATEGORY          13 non-null object
DESCRIPTION       13 non-null object
FREQUENCY         13 non-null object
INDICATOR_NAME    13 non-null object
INDICATOR_UNIT    13 non-null object
YEAR              0 non-null object
dtypes: object(7)
memory usage: 808.0+ bytes

XML數據能夠比本例複雜得多。每一個標記均可以有元數據。看看下面這個HTML的連接標記(它也算是一段有效的XML):

from io import StringIO

tag = '<a href="http://www.baidu.com">BaiDu</a>'

root = objectify.parse(StringIO(tag)).getroot()

如今能夠訪問連接文本或標記中的任何字段了(如href):

root
<Element a at 0x137f2861e88>
root.get('href')
'http://www.baidu.com'
root.text
'BaiDu'
相關文章
相關標籤/搜索