數據分析過程當中常常須要進行讀寫操做,Pandas實現了不少 IO 操做的API,這裏簡單作了一個列舉。html
格式類型 | 數據描述 | Reader | Writer |
---|---|---|---|
text | CSV | read_ csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | clipboard | read_clipboard | to_clipboard |
binary | Excel | read_excel | to_excel |
binary | HDF5 | read_hdf | to_hdf |
binary | Feather | read_feather | to_feather |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQLGoogle | Big Query | read_gbq | to_gbq |
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)[source]¶
pd.read_excel(io, sheetname=0,header=0,skiprows=None,index_col=None,names=None, arse_cols=None,date_parser=None,na_values=None,thousands=None, convert_float=True,has_index_names=None,converters=None,dtype=None, true_values=None,false_values=None,engine=None,squeeze=False,**kwds)
重要參數詳解python
pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, tupleize_cols=None, thousands=', ', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True)[source]
參數詳解mysql
A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with
'https'
you might try removing the's'
.web接收網址、文件、字符串。網址不接受https,嘗試去掉s後爬去正則表達式
The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.sql
正則表達式,返回與正則表達式匹配的表格。數據庫
The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of
None
tries to uselxml
to parse and if that fails it falls back onbs4
+html5lib
.express解析器默認爲‘lxml’json
The row (or list of rows for a
MultiIndex
) to use to make the columns headers.指定列標題所在的行,list爲多重索引
The column (or list of columns) to use to create the index.
指定行標題對應的列,list爲多重索引
0-based. Number of rows to skip after parsing the column integer. If a sequence of integers or a slice is given, will skip the rows indexed by that sequence. Note that a single element sequence means ‘skip the nth row’ whereas an integer means ‘skip n rows’.
跳過第n行(序列標示)或跳過n行(整數標示)
This is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly. For example,
attrs = {'id': 'table'}
is a valid attribute dictionary because the ‘id’ HTML tag attribute is a valid HTML attribute for any HTML tag as per this document.
attrs = {'asdf': 'table'}
is not a valid attribute dictionary because ‘asdf’ is not a valid HTML attribute even if it is a valid XML attribute. Valid HTML 4.01 table attributes can be found here. A working draft of the HTML 5 spec can be found here. It contains the latest information on table attributes for the modern web.
傳遞一個字典,標示表格的屬性值。
boolean or list of ints or names or list of lists or dict, default False
- boolean. If True -> try parsing the index.
- list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
- list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
- dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
pd.to_datetime
afterpd.read_csv
Note: A fast-path exists for iso8601-formatted dates.
解析日期
If
False
try to parse multiple header rows into aMultiIndex
, otherwise return raw tuples. Defaults toFalse
.Deprecated since version 0.21.0: This argument will be removed and will always convert to MultiIndex
不推薦使用
Separator to use to parse thousands. Defaults to
','
.千位分隔符
The encoding used to decode the web page. Defaults to
None
.``None`` preserves the previous encoding behavior, which depends on the underlying parser library (e.g., the parser library will try to use the encoding provided by the document).解碼方式,默認使用文檔提供的編碼
Character to recognize as decimal point (e.g. use ‘,’ for European data).
New in version 0.19.0.
小數點標示,默認使用「.」
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the cell (not column) content, and return the transformed content.
New in version 0.19.0.
轉換某些列的函數的字典:鍵爲列名或者整數,值爲轉換函數,函數只能傳入一個參數,就是該列單元格的值。
Custom NA values
New in version 0.19.0.
標示那些爲NA值
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to
New in version 0.19.0.
保持默認的NA值,與na_values一塊兒使用
# -*- coding: utf-8 -*- """ @Datetime: 2018/11/11 @Author: Zhang Yafei """ from multiprocessing import Pool import pandas import requests import os BASE_DIR = os.path.dirname(os.path.abspath(__file__)) HTML_DIR = os.path.join(BASE_DIR,'藥品商品名通用名稱數據庫') if not os.path.exists(HTML_DIR): os.mkdir(HTML_DIR) name_list = [] if os.path.exists('drug_name.csv'): data = pandas.read_csv('drug_name.csv',encoding='utf-8') header = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'zh-CN,zh;q=0.9', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Content-Length': '248', 'Content-Type': 'application/x-www-form-urlencoded', 'Cookie': 'JSESSIONID=0000ixyj6Mwe6Be4heuHcvtSW4C:-1; Hm_lvt_3849dadba32c9735c8c87ef59de6783c=1541937281; Hm_lpvt_3849dadba32c9735c8c87ef59de6783c=1541940406', 'Upgrade-Insecure-Requests': '1', 'Origin': 'http://pharm.ncmi.cn', 'Referer': 'http://pharm.ncmi.cn/dataContent/dataSearch.do?did=27', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36', } def spider(page): adverse_url = 'http://pharm.ncmi.cn/dataContent/dataSearch.do?did=27' form_data = { 'method': 'list', 'did': 27, 'ec_i': 'ec', 'ec_crd': 15, 'ec_p': page, 'ec_rd': 15, 'ec_pd': page, } response = requests.post(url=adverse_url,headers=header,data=form_data) filename = '{}.html'.format(page) with open(filename,'w',encoding='utf-8') as f: f.write(response.text) print(filename,'下載完成') def get_response(page): file = os.path.join(HTML_DIR,'{}.html') with open(file.format(page),'r',encoding='utf-8') as f: response = f.read() return response def parse(page): response = get_response(page) result = pandas.read_html(response,attrs={'id':'ec_table'})[0] data = result.iloc[:,:5] data.columns = ['序號','批准文號','藥品中文名稱','藥品商品名稱','生產單位'] if page==1: data.to_csv('drug_name.csv',mode='w',encoding='utf_8_sig',index=False) else: data.to_csv('drug_name.csv',mode='a',encoding='utf_8_sig',header=False,index=False) print('第{}頁數據存取完畢'.format(page)) def get_unparse_data(): if os.path.exists('drug_name.csv'): pages = data['序號'] pages = list(set(range(1,492))-set(pages.values)) else: pages = list(range(1,492)) return pages def download(): pool = Pool() pool.map(spider,list(range(1,492))) pool.close() pool.join() def write_to_csv(): pages = get_unparse_data() print(pages) list(map(parse,pages)) def new_data(chinese_name): trade_name = '/'.join(set(data[data.藥品中文名稱==chinese_name].藥品商品名稱)) name_list.append(trade_name) def read_from_csv(): name = data['藥品中文名稱'].values print(len(name)) chinese_name = list(set(data['藥品中文名稱'].values)) list(map(new_data,chinese_name)) df_data = {'藥品中文名稱':chinese_name,'藥品商品名稱':name_list} new_dataframe = pandas.DataFrame(df_data) new_dataframe.to_csv('unique_chinese_name.csv',mode='w',encoding='utf_8_sig',index=False) return new_dataframe def main(): # download() # write_to_csv() return read_from_csv() if __name__ == '__main__': drugname_dataframe = main()
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
效果:將SQL查詢或數據庫表讀入DataFrame。
此功能是一個方便的包裝和 (爲了向後兼容)。它將根據提供的輸入委派給特定的功能。SQL查詢將被路由到,而數據庫表名將被路由到。請注意,委派的功能可能有更多關於其功能的特定說明,此處未列出。
參數詳解
SQL query to be executed or a table name.
要執行的SQL查詢或表名。
or DBAPI2 connection (fallback mode)
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
或DBAPI2鏈接(後備模式)
使用SQLAlchemy可使用該庫支持的任何數據庫。若是是DBAPI2對象,則僅支持sqlite3。
Column(s) to set as index(MultiIndex).
要設置爲索引的列(MultiIndex)。
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
嘗試將非字符串,非數字對象(如decimal.Decimal)的值轉換爲浮點,這對SQL結果集頗有用。
List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’}
要解析爲日期的列名列表。
在解析字符串時,格式字符串是strftime兼容的格式字符串,或者是(D、s、ns、ms、us),以防解析整型時間戳。
{column_name:arg dict}的字典,其中arg dict對應於pandas.to_datetime()的關鍵字參數。對於沒有本機Datetime支持的數據庫(如SQLite)特別有用。
List of column names to select from SQL table (only used when reading a table).
從SQL表中選擇的列名列表(僅在讀取表時使用)。
If specified, return an iterator where chunksize is the number of rows to include in each chunk.
若是指定,則返回一個迭代器,其中chunksize是要包含在每一個塊中的行數。
使用案例
import pymysql import pandas as pd con = pymysql.connect(host="127.0.0.1",user="root",password="password",db="world") # 讀取sql data_sql=pd.read_sql("SQL查詢語句",con) # 存儲 data_sql.to_csv("test.csv")
pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)[source]
效果:將SQL數據庫表讀入DataFrame。
給定一個表名和一個SQLAlchemy可鏈接,返回一個DataFrame。此功能不支持DBAPI鏈接。
參數詳解
Name of SQL table in database.
數據庫中SQL表的名稱。
SQLite DBAPI connection mode not supported.
不支持SQLite DBAPI鏈接模式。
Name of SQL schema in database to query (if database flavor supports this). Uses default schema if None (default).
要查詢的數據庫中的SQL模式的名稱(若是數據庫flavor支持此功能)。若是爲None(默認值),則使用默認架構。
Column(s) to set as index(MultiIndex).
要設置爲索引的列(MultiIndex)。
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Can result in loss of Precision.
嘗試將非字符串,非數字對象(如decimal.Decimal)的值轉換爲浮點值。可能致使精度損失。
要解析爲日期的列名列表。
{column_name:format string}的字典,其中格式字符串在解析字符串時間時與strftime兼容,或者在解析整 數時間戳的狀況下是(D,s,ns,ms,us)之一。
{column_name:arg dict}的字典,其中arg dict對應於pandas.to_datetime()的關鍵字參數。對於沒有本機Datetime支持的數據庫(如SQLite)特別有用。
List of column names to select from SQL table
從SQL表中選擇的列名列表
If specified, returns an iterator where chunksize is the number of rows to include in each chunk.
若是指定,則返回一個迭代器,其中chunksize是要包含在每一個塊中的行數。
使用案例
import pandas as pd import pymysql from sqlalchemy import create_engine con = create_engine('mysql+pymysql://user_name:password@127.0.0.1:3306/database_name') data = pd.read_sql_table("table_name", con) data.to_csv("table_name.csv")
DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')[source]¶
參數詳解
File path or object, if None is provided the result is returned as a string.
字符串或文件句柄,默認無文件
路徑或對象,若是沒有提供,結果將返回爲字符串。
Field delimiter for the output file.
默認字符 ‘ ,’
輸出文件的字段分隔符。
Missing data representation
字符串,默認爲 ‘’
浮點數格式字符串
Format string for floating point numbers
字符串,默認爲 None
浮點數格式字符串
順序,可選列寫入
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names
字符串或布爾列表,默認爲true
寫出列名。若是給定字符串列表,則假定爲列名的別名。
Write row names (index)
布爾值,默認爲Ture
寫入行名稱(索引)
字符串或序列,或False,默認爲None
若是須要,可使用索引列的列標籤。若是沒有給出,且標題和索引爲True,則使用索引名稱。若是數據文件使用多索引,則應該使用這個序列。若是值爲False,不打印索引字段。在R中使用index_label=False 更容易導入索引.
模式:值爲‘str’,字符串
Python寫模式,默認「w」
編碼:字符串,可選
表示在輸出文件中使用的編碼的字符串,Python 2上默認爲「ASCII」和Python 3上默認爲「UTF-8」。
字符串,可選項
表示在輸出文件中使用的壓縮的字符串,容許值爲「gzip」、「bz2」、「xz」,僅在第一個參數是文件名時使用。
字符串,默認爲 ‘\n’
在輸出文件中使用的換行字符或字符序列
CSV模塊的可選常量
默認值爲to_csv.QUOTE_MINIMAL。若是設置了浮點格式,那麼浮點將轉換爲字符串,所以csv.QUOTE_NONNUMERIC會將它們視爲非數值的。
字符串(長度1),默認「」
用於引用字段的字符
布爾,默認爲Ture
控制一個字段內的quotechar
字符串(長度爲1),默認爲None
在適當的時候用來轉義sep和quotechar的字符
一次寫入行
布爾值 ,默認爲False
從版本0.21.0中刪除:此參數將被刪除,而且老是將多索引的每行寫入CSV文件中的單獨行
(若是值爲false)將多索引列做爲元組列表(若是TRUE)或以新的、擴展的格式寫入,其中每一個多索引列是CSV中的一行。
注意事項:
to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,columns=None, header=True, index=True, index_label=None,startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None,inf_rep='inf', verbose=True, freeze_panes=None)
經常使用參數解析
writer = pd.ExcelWriter('data/excel.xlsx') df.to_excel(writer, sheet_name='user', index=False) writer.save()
補充:固定輸出列的順序
data = pd.DataFrame(data=data_list) # 固定列表的輸出順序 data = data.loc[:, columns]
import pandas as pd data = [ {"name":"張三","age":18,"city":"北京"}, {"name":"李四","age":19,"city":"上海"}, {"name":"王五","age":20,"city":"廣州"}, {"name":"趙六","age":21,"city":"深圳"}, {"name":"孫七","age":22,"city":"武漢"} ] df = pd.DataFrame(data,columns=["name","age","city"]) df
from sqlalchemy import create_engine table_name = "user" engine = create_engine( "mysql+pymysql://root:0000@127.0.0.1:3306/db_test?charset=utf8", max_overflow=0, # 超過鏈接池大小外最多建立的鏈接 pool_size=5, # 鏈接池大小 pool_timeout=30, # 池中沒有線程最多等待的時間,不然報錯 pool_recycle=-1 # 多久以後對線程池中的線程進行一次鏈接的回收(重置) ) conn = engine.connect() df.to_sql(table_name, conn, if_exists='append',index=False)
注意事項:
一、咱們用的庫是sqlalchemy,官方文檔提到to_sql是被sqlalchemy支持
文檔地址:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
二、數據庫配置用你本身的數據庫配置,db_flag爲數據庫類型,根據不一樣狀況更改,在保存數據以前,要先建立數據庫字段。3. engine_config爲數據庫鏈接配置信息
四、create_engine是根據數據庫配置信息建立鏈接對象
五、if_exists = 'append',追加數據
六、index = False 保存時候,不保存df的行索引,這樣恰好df的3個列和數據庫的3個字段一一對應,正常保存,若是不設置爲false的話,數據至關於4列,跟MySQL 3列對不上號,會報錯
- 這裏提個小問題,好比咱們想在遍歷的時候來一條數據,保存一條,而不是總體生成Dataframe後才保存,該怎麼作?上面提到if_exists,能夠追加,用這個便可實現,包括保存csv一樣也有此參數,能夠參考官方文檔