當數據文件是百萬級數據時,設置chunksize來分批次處理數據數組
案例:美國總統競選時的數據分析函數
讀取數據大數據
import numpy as np
import pandas as pd
from pandas import Series,DataFramespa
df1 = pd.read_csv("./usa_election.csv",low_memory=False)
df1.shapecode
結果:(536041, 16) #能夠看到數據量爲536041blog
將數據在此進行級聯成更大的文本數據ip
df =pd.concat([df1,df1,df1,df1])
df.shape數據分析
結果:(2144164, 16)pandas
%%time
ret = df.to_csv("./hehe.csv",index = False)it
ret
將df數據讀取到文件中,並計算寫入時間
ret = pd.read_csv("./hehe.csv",low_memory = False,chunksize=500000)
#將寫入的大數據文件讀出來,low_memory = False表示是否在內部一塊的形式處理文件,chunksize表示分批次處理文件,每次處理多少數據
ret
讀取的文件格式是:<pandas.io.parsers.TextFileReader at 0x122f30f0>
添加循環,讀出來數據
for x in ret:
print(type(x))
結果:
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
而後分批次處理數據
# 將str類型的時間轉化成爲時間類型的
處理前:
處理後:
處理過程:
months = {"JAN":"1", "FEB":"2","MAR":"3","APR":"4","MAY":"5","JUN":"6","JUL":"7","AUG":"8","SEP":"9","OCT":"10","NOV":"11","DEC":"12"}
def conver(x):
day,month,year = x.split("-") #進行切片操做
datatime = "20"+year+"-"+str(months[month])+"-"+day
return datatime #對切片從新組合
df1["contb_receipt_dt"] = df1["contb_receipt_dt"].map(conver)
df1["contb_receipt_dt"] = pd.to_datetime(df1["contb_receipt_dt"]) #轉化成時間格式
df1["contb_receipt_dt"]
累加和的操做
# 累加和
a = np.arange(101) 隨機一個數組數據
display(a)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100])
b = a.cumsum() #求出該數據的累加和用函數cumsum()
ree=DataFrame(b,columns=["num"])
ree["num"].plot() #畫出累加和的那列的圖譜