1、讀寫CSV數據html
(1)使用csv庫處理CSV數據json
import csv with open('./stock.csv') as f: f_csv = csv.reader(f) headers = next(f_csv) for row in f_csv: # process row
因爲每一行的row是個列表,訪問須要用row[0]、row[1],數組
(2)能夠考慮轉換成命名元組訪問。app
import csv from collections import namedtuple with open('./stock.csv') as f: f_csv = csv.reader(f) headers = next(f_csv) Row = namedtuple('Row',headers) for r in f_csv: row = Row(*r) # process row
(3)轉換爲字典ide
import csv with open('./stock.csv') as f: f_csv = csv.DictReader(f) for row in f_csv: # process row
寫入CSV數據:函數
import csv headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume'] rows = [ ('AA', '39.48', '6/11/2007', '9:34am', '-0.18', '428900'), ('BB', '48.54', '8/25/2001', '19:57am', '-0.44', '142800'), ('CC', '92.13', '3/18/1886', '3:11am', '-0.67', '126700'), ('DD', '79.25', '2/05/1999', '8:22am', '-0.27', '110000'), ] with open('stock2.csv','w') as f: f_csv = csv.writer(f) f_csv.writerow(headers) f_csv.writerows(rows)
若是數據是字典序列,那麼能夠這樣處理:編碼
import csv headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume'] rows = [ {'Symbol':'AA','Price':39.48,'Date':'6/11/2007', 'Time':'9:34am', 'Change':-0.18, 'Volume':428900} ] with open('stock2.csv','w') as f: f_csv = csv.DictWriter(f, headers) f_csv.writeheader() f_csv.writerows(rows)
標題行出現非法字符,須要進行轉換。url
import re with open('./stock.csv') as f: f_csv = csv.reader(f) headers = [ re.sub('[^a-zA-Z_]', '_', h) for h in next(f_csv)]
讀取數據時,將部分數據轉換成除字符串以外的類型。spa
import csv,re col_type = [str,float,str,str,float,str] with open('./stock.csv') as f: f_csv = csv.reader(f) headers = [ re.sub('[^a-zA-Z_]', '_', h) for h in next(f_csv)]
for row in f_csv: row = tuple( convert(value)for convert, value in zip(col_type, row) )
字段轉化成字典:code
field_type = [ ('Price',float), ('Change',float), ('Volume',int), ] with open('./stock.csv') as f: for row in csv.DictReader(f): row.update( (key,convert(row[key])) for key, convert in field_type) print(row)
2、讀寫JSON數據
(1)字符串形式:json.dumps()、json.loads()
(2)文件形式:json.dump()、json.load()
(3)使用pprint()函數,合理格式輸出 或者 在json.dumps()函數中使用indext參數
>>> from urllib.request import urlopen
>>> pprint(json_resp)
>>> print(json.dumps(data, indent=4))
(4)load時解碼爲OrderDict有序字典
>>> from collections import OrderedDict
>>> data = json.loads(s, object_pairs_hook=OrderedDict)
(5)JSON字典轉變爲Python對象
class JSONObject: def __init__(self,d): self.__dict__ = d >>> data = json.loads(s, object_hook=JSONObject) >>> data.name
(6)序列化類實例,提供一個函數做爲輸入並返回一個能夠被序列化處理的字典。
def serialize_instance(obj): obj = { '__classname__' : type(obj).__name__ } obj.update(vars(obj)) return obj class Point: def __init__(self,x,y): self.x = x self.y = y classes = { 'Point':Point } def unserialize_instance(d): clsname = d.pop('__classname__',None) if clsname: cls = classes[clsname] obj = cls.__new__(cls) for key,value in d.items(): setattr(obj,key,value) return obj return d >>> p = Point(2, 3) >>> s = json.dumps(p, default = serialize_instance) >>> a = json.loads(s, object_hook = unserialize_instance )
3、解析簡單的XML文檔
xml.etree.ElementTree.parse()函數將整個XML文檔解析爲一個文檔對象。
以後,就能夠利用find()、iterfind()、findtext()方法查詢特定的XML元素。
(1)指定標籤時,須要總體考慮文檔的結構。每個查找操做都是相對於一個起始元素來展開的。
(2)doc.iterfind('channel/item')的調用會查找全部在channel元素之下的item元素。doc表明着文檔的頂層。
(3)以後對item.findtext()的調用就相對於已找到的item元素來展開。
(4)每一個由ElementTree模塊所表示的單個元素都有重要的屬性和方法,tag屬性包含標籤的名稱,text屬性包含附着的文本,get()方法能夠用來提取出屬性。
4、以增量的方式解析大型XML文件
從大型XML文檔中提取出數據時,使用迭起器和生成器。
<?xml version="1.0" encoding="ISO-8859-1"?> <!-- Edited with XML Spy v2007 (http://www.altova.com) --> <breakfast_menu> <food> <food> <name>Belgian Waffles</name> <price>$5.95</price> <description>two of our famous Belgian Waffles with plenty of real maple syrup</description> <calories>650</calories> </food> <food> <name>Strawberry Belgian Waffles</name> <price>$7.95</price> <description>light Belgian waffles covered with strawberries and whipped cream</description> <calories>900</calories> </food> <food> <name>Berry-Berry Belgian Waffles</name> <price>$8.95</price> <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description> <calories>900</calories> </food> <food> <name>French Toast</name> <price>$4.50</price> <description>thick slices made from our homemade sourdough bread</description> <calories>600</calories> </food> <food> <name>Homestyle Breakfast</name> <price>$6.95</price> <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description> <calories>950</calories> </food> </food> </breakfast_menu>
iterparse()方法容許咱們對XML文檔作增量式處理;
迭代器產生出形式爲(event, elem)的元組,event是列出的事件,而elem是對應的XML元素。
from xml.etree.ElementTree import iterparse def parse_and_remove(filename, path): tag_stack = [] elem_stack = [] path_parts = path.split('/') doc = iterparse(filename, ('start', 'end')) next(doc) for event, elem in doc: if event == 'start': tag_stack.append(elem.tag) elem_stack.append(elem) elif event == 'end': if tag_stack == path_parts: yield elem elem_stack[-2].remove(elem) try: tag_stack.pop() elem_stack.pop() except IndexError as e: pass data = parse_and_remove('./simple.xml','food/food') for i in data: print('>>>>>>>>>>>>>>>>',i,'<<<<<<<<<<<<<<<')
elem_stack[-2].remove(elme),這一行代碼使得以前經過yield產生的元素從它們的父節點中移除。
所以可假設其再也沒有任何其餘的引用存在,所以該元素被銷燬進而能夠回收它所佔用的內存。
5、將字典轉換爲XML
xml.etree.ElementTree庫一樣能夠用來建立XML文檔。
from xml.etree.ElementTree import Element def dict_to_xml(tag, d): elem = Element(tag) for key, val in d.items(): child = Element(key) child.text = str(val) elem.append(child) return elem >>> s = { 'name':'GGGG','share':'100','price':'23.44'} >>> e = dict_to_xml('stock', s) <Element 'stock' at 0x000001CAFEE80F98>
轉換的結果是獲得一個Element實例。能夠利用xml.etree.ElementTree庫中的tostring()函數將其轉換爲字節串。
>>> from xml.etree.ElementTree import tostring >>> tostring(e) b'<stock><name>GGGG</name><price>23.44</price><share>100</share></stock>'
爲元素附加屬性,使用set()方式實現。
若是要考慮元素間的順序,建立OrderedDict有序字典來取代普通的字典。
當建立XML時,傾向於只使用字符串來完成:
def str_to_xml(tag, d): parts = ['<{}>'.format(tag)] for key, val in d.items(): parts.append('<{0}>{1}</{0}>'.format(key, val)) parts.append('</{}>'.format(tag)) return ''.join(parts) >>> e = str_to_xml('stock', s) <stock><name>GGGG</name><price>23.44</price><share>100</share></stock>
當字典中包含有特殊字符時:{ 'name' : '<spam>' }
>>> d = {'name':'<spam>'} >>> tostring(dict_to_xml('stock',d)) b'<stock><name><spam></name></stock>' >>> str_to_xml('stock',d) <stock><name><spam></name></stock>
須要手工對字符作轉義處理,可使用xml.sax.saxutils中的escape()和unescape()函數。
>>> from xml.sax.saxutils import escape,unescape >>> escape('<spam>') '<spam>' >>> unescape(_) '<spam>'
6、解析、修改和重寫XML
修改XML文檔的結構主要是對父元素進行操做,若是須要移除某個元素,那麼就利用它的直接父節點的remove()方法完成。
若是插入或添加新的元素,使用父節點的insert()和append()方法來完成。這些元素也可使用索引和切片操做來進行操控,好比element[i]或element[i:j]。
>>> from xml.etree.ElementTree impor parse,Element
>>> doc = parse('simple.xml')
>>> root = doc.getroot()
>>> root.remove(root.find('sri'))
>>> root.getchildren().index(root.find('nm')) # 1
>>> e = Element('spam')
>>> e.text = 'This is a test'
>>> root.insert(2, e) # 插入在<nm>...</nm> 後面 ..
>>> doc.write('new_simple.xml', xml_declaration=True)
7、用命名空間來解析XML文檔
對包含有命名空間的XML文檔進行解析會很是繁瑣。XMLNamespaces類的功能只是用來稍微簡化一下這個過程,
可以在後續的操做中使用縮短的命名空間名稱,而沒必要去使用徹底限定的URI。
class XMLNamespace: def __init__(self, **kwargs): self.namespace = {} for name, uri in kwargs.items(): self.register(name, uri) def register(self,name, uri): self.namespace[name] = '{'+uri+'}' def __call__(self, path): return path.format_map(self.namespace) >>> ns = XMLNamespace(html='http://www.w3.org/1999/xhtml') >>> doc.find(ns('content/{html}/html/{html}/head/{html}/title'))
'Hello World'
正在解析的文本除了命名空間以外的其餘高級XML特性,那麼最好仍是使用lxml庫。
9、編碼和解碼十六進制數字
(1)字節流 =》十六進制數組成的字符串
>>> import binascii
>>> s = b'hello'
>>> h = binascii.b2a_hex(s) b'68656c6c6f'
(2)十六進制數組成的字符串 =》字節流
>>> binascii.a2b_hex(h) b'hello'
(3)base64模塊下,字節流 =》十六進制數組成的字符串
>>> import base64
>>> h = base64.bl6encode(s) b'68656C6C6F'
(4)base64模塊下,十六進制數組成的字符串 =》字節流
>>> s = base64.bl6decode(h) b'hello'
base64.bl64encode和base64.bl64decode只能對大寫形式的十六進制數進行操做,而binascii模塊可以處理任意一種狀況。
10、Base64編碼和解碼
採用Base64編碼對二進制數據作編碼解碼操做。
(1)字節串編碼
>>> import base64
>>> s = b'hello'
>>> a = base64.b64encode(s) b'aGVsbG8'
(2)解碼
>>> base64.n64decode(a) b'hello'