本文針對前面利用Python 所作的一次數據匹配實驗,整理了其中的一些對於csv文件的讀寫操做和經常使用的Python'數據結構'(如字典和列表)之間的轉換
(Python Version 2.7)python
將列表轉換爲csv文件git
將嵌套字典的列表轉換爲csv文件github
最基本的轉換,將列表中的元素逐行寫入到csv文件中數據結構
def list2csv(list, file): wr = csv.writer(open(file, 'wb'), quoting=csv.QUOTE_ALL) for word in list: wr.writerow([word])
這種屬於典型的csv文件讀寫,常見的csv文件經常是第一行爲屬性欄,標明各個字段,接下來每一行都是對應屬性的值,讀取時經常用字典來存儲(key爲第一行的屬性,value爲對應行的值),例如app
my_list = [{'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia', 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi', 'players.date_of_birth': '08/02/1991', 'players.team': 'Bordeaux'}, {'players.vis_name': 'Khazri', 'players.role': 'Midfielder', 'players.country': 'Tunisia', 'players.last_name': 'Khazri', 'players.player_id': '989', 'players.first_name': 'Wahbi', 'players.date_of_birth': '08/02/1991', 'players.team': 'Sunderland'}, {'players.vis_name': 'Lewis Baker', 'players.role': 'Midfielder', 'players.country': 'England', 'players.last_name': 'Baker', 'players.player_id': '9574', 'players.first_name': 'Lewis', 'players.date_of_birth': '25/04/1995', 'players.team': 'Vitesse'} ]
而最後全部的字典嵌套到一個列表中存儲,而接下來是一個逆過程,即將這種嵌套了字典的列表還原爲csv文件存儲起來ui
# write nested list of dict to csv def nestedlist2csv(list, out_file): with open(out_file, 'wb') as f: w = csv.writer(f) fieldnames=list[0].keys() # solve the problem to automatically write the header w.writerow(fieldnames) for row in list: w.writerow(row.values())
注意其中的fieldnames
用於傳遞key
即第一行的屬性spa
csv文件轉換爲字典rest
第一行爲key,其他行爲value日誌
每一行爲key,value的記錄excel
csv文件轉換爲二級字典
字典轉換爲csv文件
第一行爲key,其他行爲value
每一行爲key,value的記錄
針對常見的首行爲屬性,其他行爲值的情形
# convert csv file to dict # @params: # key/value: the column of original csv file to set as the key and value of dict def csv2dict(in_file,key,value): new_dict = {} with open(in_file, 'rb') as f: reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=',') for row in reader: new_dict[row[key]] = row[value] return new_dict
其中的new_dict[row[key]] = row[value]
中的'key'
和'value'
是csv文件中的對應的第一行的屬性字段,須要注意的是這裏假設csv文件比較簡單,所指定的key是惟一的,不然直接從csv轉換爲dict文件會形成重複字段的覆蓋而丟失數據,若是原始數據指定做爲key的列存在重複的狀況,則須要構建列表字典
,將value部分設置爲list,可參照列表字典
的構建部分代碼
針對每一行均爲鍵值對
的特殊情形
這裏默認認爲第一列爲所構建的字典的key,而第二列對應爲value,可根據須要進行修改
# convert csv file to dict(key-value pairs each row) def row_csv2dict(csv_file): dict_club={} with open(csv_file)as f: reader=csv.reader(f,delimiter=',') for row in reader: dict_club[row[0]]=row[1] return dict_club
[更新]
構造有值爲列表的字典,主要適用於須要把csv中的某些列對應的值做爲某一個列的值的情形
或者說自己並不適合做爲單純的字典結構,同一個鍵對應的值不惟一
# build a dict of list like {key:[...element of lst_inner_value...]} # key is certain column name of csv file # the lst_inner_value is a list of specific column name of csv file def build_list_dict(source_file, key, lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: for element in lst_inner_value: new_dict.setdefault(row[key], []).append(row[element]) return new_dict # sample: # test_club=build_list_dict('test_info.csv','season',['move from','move to']) # print test_club
這個通常是特殊用途,將csv文件進一步結構化,將其中的某一列(屬性)所對應的值做爲key,而後將其他鍵值對構成子字典做爲value,通常用於匹配時優先過濾來創建一種層級結構提升準確度
例如我有csv文件的記錄以下(以表格形式表示)
id | name | age | country |
---|---|---|---|
1 | danny | 21 | China |
2 | Lancelot | 22 | America |
... | ... | ... | ... |
通過二級字典轉換後(假設構建country-name兩級)獲得以下字典
dct={'China':{'danny':{'id':'1','age':'21'}} 'America':{'Lancelot':{'id':'2','age':'22'}}}
代碼以下
# build specific nested dict from csv files(date->name) def build_level2_dict(source_file): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row['country'], dict()) item[row['name']] = {k: row[k] for k in ('id','age')} new_dict[row['country']] = item return new_dict
[更新]
進一步改進後可使用更加靈活一點的方法來構建二級字典,不用修改內部代碼,二是指定傳入的鍵和值,有兩種不一樣的字典構建,按需查看
構建的二級字典的各層級的鍵值均人爲指定爲某一列的值
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict # inner_value:set the inner value for the inner key def build_level2_dict2(source_file,outer_key,inner_key,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = row[inner_value] new_dict[row[outer_key]] = item return new_dict
指定第一層和第二層的字典的鍵,而將csv文件中剩餘的鍵值對存儲爲最內層的值
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict,and rest key-value will be store as the value of inner key def build_level2_dict(source_file,outer_key,inner_key): new_dict = {} with open(source_file, 'rb')as csv_file: reader = csv.reader(csv_file, delimiter=',') fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = {k: row[k] for k in inner_keyset} new_dict[row[outer_key]] = item return new_dict
還有另外一種構建二級字典的方法,利用的是pop()
方法,可是我的以爲不如這個直觀,貼在下面
def build_dict(source_file): projects = defaultdict(dict) # if there is no header within the csv file you need to set the header # and utilize fieldnames parameter in csv.DictReader method # headers = ['id', 'name', 'age', 'country'] with open(source_file, 'rb') as fp: reader = csv.DictReader(fp, dialect='excel', skipinitialspace=True) for rowdict in reader: if None in rowdict: del rowdict[None] nationality = rowdict.pop("country") date_of_birth = rowdict.pop("name") projects[nationality][date_of_birth] = rowdict return dict(projects)
[更新]
另外另種構造二級字典的方法,主要是針對csv文件並不適合直接構造單純的字典結構,某些鍵對應多個值,因此須要在內部用列表來保存值,或者對每個鍵值對用列表保存
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:[{pairs of lst_inner_value}]} def build_level2_dict3(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: new_dict.setdefault(row[outer_key], []).append({k: row[k] for k in lst_inner_value}) return new_dict
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:{key of lst_inner_value:[...value of lst_inner_value...]}} def build_level2_dict4(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) # item.setdefault('move from',[]).append(row['move from']) # item.setdefault('move to', []).append(row['move to']) for element in lst_inner_value: item.setdefault(element, []).append(row[element]) new_dict[row[outer_key]] = item return new_dict
# build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_key:a list of column name # lst_inner_value: a list of column name,for circumstance that the inner value of the same lst_inner_key are not distinct # {outer_key:{lst_inner_key:[...lst_inner_value...]}} def build_list_dict2(source_file,outer_key,lst_inner_key,lst_inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) item.setdefault(row[lst_inner_key], []).append(row[lst_inner_value]) new_dict[row[outer_key]] = item return new_dict # dct=build_list_dict2('test_info.csv','season','move from','move to')
相似的,能夠從csv重構造三級字典甚至多級字典,方法和上面的相似,就不贅述了,只貼代碼
# build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:{rest_key:rest_value...}}}} # the params are extract from the csv column name as you like def build_level3_dict(source_file,outer_key,inner_key1,inner_key2): new_dict = {} with open(source_file, 'rb')as csv_file: reader = csv.reader(csv_file, delimiter=',') fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key1) inner_keyset.remove(inner_key2) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = {k: row[k] for k in inner_keyset} item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:inner_value}}} # the params are extract from the csv column name as you like def build_level3_dict2(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = row[inner_value] item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict
這裏一樣給出兩種根據不一樣需求構建字典的方法,一種是將剩餘的鍵值對原封不動地保存爲最內部的值,另外一種是隻取所須要的鍵值對保留。
此外還有一種特殊情形,當你的最內部的值不是一個單獨的元素而須要是一個列表來存儲多個對應同一個鍵的元素,則只須要對於最內部的鍵值對進行修改
# build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:[inner_value]}}} # for multiple inner_value with the same inner_key2,thus gather them in a list # the params are extract from the csv column name as you like def build_level3_dict3(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, 'rb')as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item.setdefault(row[inner_key2], []).append(row[inner_value]) item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict
其中的核心部分是這一句sub_item.setdefault(row[inner_key2], []).append(row[inner_value])
每一行爲key,value的記錄
第一行爲key,其他行爲value
輸出列表字典
前述csv文件轉換爲字典的逆過程,比較簡單就直接貼代碼啦
def dict2csv(dict,file): with open(file,'wb') as f: w=csv.writer(f) # write each key/value pair on a separate row w.writerows(dict.items())
def dict2csv(dict,file): with open(file,'wb') as f: w=csv.writer(f) # write all keys on one row and all values on the next w.writerow(dict.keys()) w.writerow(dict.values())
其實這個不太經常使用,卻是逆過程比較常見,就是從常規的csv文件導入到列表的字典(自己是一個字典,csv文件的首行構成鍵,其他行依次構成對應列下的鍵的值,其中值造成列表),不過若是碰到這種情形要保存爲csv文件的話,作法以下
import csv import pandas as pd from collections import OrderedDict dct=OrderedDict() dct['a']=[1,2,3,4] dct['b']=[5,6,7,8] dct['c']=[9,10,11,12] header = dct.keys() rows=pd.DataFrame(dct).to_dict('records') with open('outTest.csv', 'wb') as f: f.write(','.join(header)) f.write('\n') for data in rows: f.write(",".join(str(data[h]) for h in header)) f.write('\n')
這裏用到了三個包,除了csv包用於常規的csv文件讀取外,其中OrderedDict
用於讓csv文件輸出後保持原有的列的順序,而pandas
則適用於中間的一步將列表構成的字典轉換爲字典構成的列表,舉個例子
[('a', [1, 2, 3, 4]), ('b', [5, 6, 7, 8]), ('c', [9, 10, 11, 12])] to [{'a': 1, 'c': 9, 'b': 5}, {'a': 2, 'c': 10, 'b': 6}, {'a': 3, 'c': 11, 'b': 7}, {'a': 4, 'c': 12, 'b': 8}]
這個主要是針對那種分隔符比較特殊的csv文件,通常情形下csv文件統一用一種分隔符是關係不大的(向上述操做基本都是針對分隔符統一用,
的情形),而下面這種第一行屬性分隔符是,
然後續值的分隔符均爲;
的讀取時略有不一樣,通常可逐行轉換爲字典在進行操做,代碼以下:
def func(id_list,input_file,output_file): with open(input_file, 'rb') as f: # if the delimiter for header is ',' while ';' for rows reader = csv.reader(f, delimiter=',') fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=';') rows = [row for row in reader if row['players.player_id'] in set(id_list)] # operation on rows...
可根據須要修改分隔符中的內容.
關於csv文件的一些操做我在實驗過程當中遇到的問題大概就是這些啦,大部分其實均可以在stackoverflow上找到或者本身提問解決,上面的朋友仍是很給力的,後續會小結一下實驗過程當中的一些對數據的其餘處理如格式轉換,除重,重複判斷等等
最後,源碼我發佈在github上的csv_toolkit
裏面,歡迎隨意玩耍~
更新日誌一、2016-12-22: 改進了構建二級字典的方法,使其變得更加靈活二、2016-12-24 14:55:30: 加入構造三級字典的方法三、2017年1月9日11:26:59: 最內部可保存制定列的元素列表四、2017年1月16日10:29:44:加入了列表字典的構建;針對特殊二級字典的構建(須要保存對應同一個鍵的多個值);五、2017年2月9日10:54:41: 加入新的二級列表字典的構建六、2017年2月10日11:18:01:改進了簡單的csv文件到字典的構建代碼