pandas

時間 2019-11-10

標籤 pandas 简体版

原文原文鏈接

打開excle和csv文件

excle文件內容python

Logitude	Latitude	station	freq	propotion
119.552376	30.2841038	hangzhou	10663	1
119.7004773	30.202115	hangzhou	10663	1
119.7660918	30.1929452	hangzhou	10663	1
119.77734	30.3170072	hangzhou	10663	1
119.2505493	30.1789208	hangzhou	10663	1
119.2055565	30.16166	hangzhou	10663	1
119.1099468	29.8293896	hangzhou	10663	1
120.1191603	30.1114958	hangzhou	10663	1
120.1722768	30.0505436	hangzhou	10663	1
120.1872744	30.0246524	hangzhou	10663	1
119.8310814	30.4248872	hangzhou	10663	1
120.1785258	30.2188364	hangzhou	10663	1
119.8579521	29.9615426	hangzhou	10663	1
118.5425376	29.4329306	hangzhou	10663	1
119.727348	29.9750276	hangzhou	10663	1
119.9448132	30.4103234	hangzhou	10663	1
119.7061014	30.2387942	hangzhou	10663	1
119.7798396	29.8067348	hangzhou	10663	1

 pd.read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None, names=None, parse_cols=None, parse_dates=False, date_parser=None, na_values=None, thousands=None, convert_float=True, has_index_names=None, converters=None, dtype=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)
 '''
 該函數主要的參數爲io、sheetname、header、names、encoding。
 io:excel文件，能夠是文件路徑、文件網址、file-like對象、xlrd workbook;
 sheetname:返回指定的sheet，參數能夠是字符串（sheet名）、整型（sheet索引）、list（元素爲字符串和整型，返回字典{'key':'sheet'}）、none（返回字典，所有sheet）;
 header:指定數據表的表頭，參數能夠是int、list of ints，即爲索引行數爲表頭;
 names:返回指定name的列，參數爲array-like對象。
 encoding:關鍵字參數，指定以何種編碼讀取。
 該函數返回pandas中的DataFrame或dict of DataFrame對象，利用DataFrame的相關操做便可讀取相應的數據。
 '''

一些打開exlce經常使用操做
df = pd.read_excel('小區表.xlsx')  #打開文件，返回DataFrame格式，若是沒指定行索引，默認在前面加索引
print(df.head())  #顯示head數據，默認首5行。同理還有顯示尾部數據tail()

df = pd.read_excel('小區表.xlsx',sheet_name=0,header=1,index_col = 0)
# index_col指定行索引，默認爲None。0爲第一列作爲行索引
# header指定列索引，默認爲0，指定行爲索引列
# sheet_name指定表，能夠是表名
print(df.head())

　　看下效果:git

　　讀取csv文件與讀取excle大致一致app

pd.read_csv('filename.csv') #這裏要注意不能指定sheet。

一些函數的使用

一、獲取行索引和列索引ide

print(df.columns)  #查詢全部列索引
print(df.columns[:3])  #支持切片操做
print(df.index) #查詢全部行索引，一樣支持切片操做

#
Index(['Logitude', 'Latitude', 'station', 'freq', 'propotion'], dtype='object')
Index(['Logitude', 'Latitude', 'station'], dtype='object')
RangeIndex(start=0, stop=18, step=1)

二、取數據　　函數

print( df.values)  #查詢全部行的數據，返回嵌套列表的形式，不包括行索引
print(df['station'])  #根據列名取1列的值
print(df[:3])   #用數字切片方法，取出前三行

#df.loc[x,y] #x帶錶行，y帶表列。x能夠切片（以數字切片，只創建在默認用數字的索引狀況下，才能這樣操做），返回dataframe對象。
print(df.loc[:2])  #取出前3行(0,1,2)，全部列
print(df.loc[:2,['propotion','station']])  #取出前3行，取出多列
print(df.loc[[1,2,4],['propotion','station']])  #根據列表的方法取出數據
print(df.loc[df['Latitude'] == 30.202115])   #篩選數據



print(df.iloc[2])  #取出第三行數據
print(df.iloc[3:5,1:3])  #根據切片，取出三、4行，第一、2列數據
print(df.iloc[[1,3,5],[2,4]])  #根據列表的方法取出數據
print(df.iloc[1,1])  #取一個單元格的數據
#loc與iloc區別是，loc列名或者行名來切片或者經過列表取數據，iloc以數字(或者說位置)切片或者經過列表的方式取數據

三、篩選數據編碼

print(df[df.Logitude > 120])  #判斷列Logitude大於120的數據,同理還有<,<=,>= ,!=,==
print(df.loc[df.Logitude > 120])  #判斷列Logitude大於120的數據
print(df.loc[df.Logitude > 120,'Latitude'])  #篩選出Logitude大於120，且是Latitude列的數據
df2 = df.copy()  #若是不想改變原來的結構，能夠拷貝一份再進行修改
print(df.loc[df.Logitude.isin([119.1099468,12345])])  #用isin來判斷，至關於python內置的in方法

　四、修改數據excel

df2 = df.copy()
df2['freq'] = 10713  #修改一列的值
df2.at[:3,1:3] = 'test'  #經過名字或者位置來修改數據
df2.at[:3,['freq','propotion']] = 'one'  #經過名字或者位置來修改數據
df.loc[:1] = 5   #也能夠經過loc方法來修改數據
df2.iat[1,3] = 'columns'  #經過位置來修改

#對篩選出來的數據作修改
 df.loc[(df.["propotion"] >= 1) & (df["propotion"] <3), 'station'] = 'beijing'  #篩選出propotion>1,propotion<3的數據。並對篩選出來的數據修改station列的值爲'beijing'


df.columns = df.columns.str.replace(' ','')  #把列名中有空格的去掉
df['station'] = df['station'].str.upper()  #對station列轉爲大寫，可使用str的全部方法

#修改列名
df.rename(columns={'total_bill': 'total', 'tip': 'pit', 'sex': 'xes'}, inplace=True)

插入數據orm

#插入列
df['D'] = 'd'  #最右側插入列，直接建立
df.insert(0,'A','a') #其餘位置插入列，用insert方法 

#插入行
# 在行末尾插入數據
insertRow = pd.DataFrame([[0,0,0,0,0]],columns=df.columns)#生成要插入行的dataframe數據
df = df.append(insertRow,ignore_index=True)  #用append方法插入數據，記得ignore_index = True,不然索引不會自動遞增
print(df)

#在中間插入數據,採用分段的方法
insertRow = pd.DataFrame([[0,0,0,0,0]],columns=df.columns)#生成要插入行的dataframe數據
above = df.loc[:2]
below = df.loc[3:]
newdata = above.append(insertRow,ignore_index=True).append(below,ignore_index=True)
print(newdata)

#也能夠用.concat()的方法來進行拼接，注意ignore_index=True
newdata1 = pd.concat([above,insertRow,below],ignore_index=True)
print(newdata1)

對數據中存在空行，空列，空單元格進行處理。對象

爲了數據直觀，咱們如今對數據進行修改blog

Logitude	Latitude	station		asd	freq	propotion
						
119.8579521	29.9615426	hangzhou			10663	1
						
118.5425376	29.4329306	hangzhou			10663	1

　　如今咱們需對上面進行刪除空行和空列

#刪除空行和空列
print(df)
del_empty_lines = df.dropna(how='all')   #刪除空行，all表示有全空行就刪除
del_empty_columns = df.dropna(axis=1,how='all')  #指定axis刪除空列，有全空列就刪除

print(del_empty_lines)
print(del_empty_columns)

#若是行中有空的數據就刪除該行，同理刪除列
print(df)
del_empty_lines = df.dropna(how='any')   #刪除空行，any表示該行有空的數據就刪除該行，默認是any
del_empty_columns = df.dropna(axis=1,how='any')  #指定axis刪除空列，any表示該列有空的數據(空的單元格)就刪除該列

print(del_empty_lines)
print(del_empty_columns)


#對空的數據進行填充
df = df.fillna(False)  #對空(NaN)數據進行填充


#判斷數據是否爲空
pd.isna(df)

判斷dataframe、列、行是否爲空

注意：
在pandas中咱們不能用if df：pass 這種方法來判斷是否爲空

正確的作法應該是用any，all，empty函數來判斷

#If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the NaNs to make the DataFrame empt
print(df['asd'].dropna().empty)  # 判斷列或者dataframe是否爲空，False表示不爲空.先刪除nan數據再判斷

DataFrame.any(axis=None, bool_only=None, skipna=None, level=None, **kwargs)
Parameters:	
axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
Returns:	
any : Series or DataFrame (if level specified)

　一些經常使用函數

Function	Description
count	Number of non-NA observations
sum	Sum of values
mean	Mean of values
mad	Mean absolute deviation
median	Arithmetic median of values
min	Minimum
max	Maximum
mode	Mode
abs	Absolute Value
prod	Product of values
std	Bessel-corrected sample standard deviation
var	Unbiased variance
sem	Standard error of the mean
skew	Sample skewness (3rd moment)
kurt	Sample kurtosis (4th moment)
quantile	Sample quantile (value at %)
cumsum	Cumulative sum
cumprod	Cumulative product
cummax	Cumulative maximum
cummin	Cumulative minimum

 df["total"] = df["propotion"]+df["freq"]   #在最後添加求和列，並對propotion和freq相加

#對列操做比較簡單
print(df["propotion"].sum())  #對propotion列相加，其餘好比min，mean使用方法相似
print(df[["propotion","asd"]].sum())  #對多列分別求和，第1種寫法
print(df.loc[:,["propotion","asd"]].sum()) #對多列分別求和，第2種寫法

#若是咱們要對結果添加到最後行，須要一些額外的步驟
sum_row = df[["propotion","asd"]].sum()
df_sum = pd.DataFrame(data=sum_row).T #生成dataframe對象，並對行和列翻轉
print(df_sum)
df_sum = df_sum.reindex(columns = df.columns) #填充丟失的列

df_final = df.append(df_sum,ignore_index = True) #如今咱們有格式良好的生成dataframe對象，用append方法添加

print(df_final)  #能夠獲得最後求總和的dataframe對象

　　分組

print(df.groupby('station'))  #此時生成的是一個可迭代的對象,能夠對分組出來的數據進行求和（sum）等操做

for st in df.groupby('station'):  #根據station來進行分組
    print(st[0],st[1])   #st爲一個元組，st[0]是station列的行數據，st[1]分組出來的數據

　寫入excle

DataFrame.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None, inf_rep='inf', verbose=True, freeze_panes=None)
'''
該函數主要參數爲:excel_writer。
excel_writer:寫入的目標excel文件，能夠是文件路徑、ExcelWriter對象;
sheet_name:被寫入的sheet名稱，string類型，默認爲'sheet1';
na_rep:缺失值表示，string類型;
header:是否寫表頭信息，布爾或list of string類型，默認爲True;
index:是否寫行號，布爾類型，默認爲True;
encoding:指定寫入編碼，string類型。


writer = pd.ExcelWriter('output1.xlsx')  #生成一個ExcelWriter對象

for st in df.groupby('station'):  #根據station來進行分組
    st[1].to_excel(writer,st[0],index=False)   #st爲一個元組，st[0]是station分組的名字，st[1]分組出來的數據，爲dataframe格式
    #index= False 表示不寫行號
writer.save()


寫入csv文件比較簡單
df.to_csv(filename)  #不須要ExcelWriter，不須要表名

new_match_old['PRB大於80%的次數'] = new_match_old.loc[:,'prb>80']      #複製列

相關標籤/搜索

python+pandas+statsmodels

pyautogui+pil+pandas

pandas+mysql+excel

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。