pandas處理大文本數據

當數據文件是百萬級數據時,設置chunksize來分批次處理數據數組

案例:美國總統競選時的數據分析函數

讀取數據大數據

import numpy as np
import pandas as pd
from pandas import Series,DataFramespa

df1 = pd.read_csv("./usa_election.csv",low_memory=False)
df1.shapecode

結果:(536041, 16)                          #能夠看到數據量爲536041blog

將數據在此進行級聯成更大的文本數據ip

df =pd.concat([df1,df1,df1,df1])
df.shape數據分析

結果:(2144164, 16)pandas

%%time
ret = df.to_csv("./hehe.csv",index = False)it

ret

將df數據讀取到文件中,並計算寫入時間

 

ret = pd.read_csv("./hehe.csv",low_memory = False,chunksize=500000)               

#將寫入的大數據文件讀出來,low_memory = False表示是否在內部一塊的形式處理文件,chunksize表示分批次處理文件,每次處理多少數據

ret

讀取的文件格式是:<pandas.io.parsers.TextFileReader at 0x122f30f0>

添加循環,讀出來數據

for x in ret:

     print(type(x))

結果:

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
而後分批次處理數據


# 將str類型的時間轉化成爲時間類型的
處理前:

處理後:

處理過程:

months = {"JAN":"1", "FEB":"2","MAR":"3","APR":"4","MAY":"5","JUN":"6","JUL":"7","AUG":"8","SEP":"9","OCT":"10","NOV":"11","DEC":"12"}

def conver(x):
      day,month,year = x.split("-") #進行切片操做
      datatime = "20"+year+"-"+str(months[month])+"-"+day
      return datatime #對切片從新組合
df1["contb_receipt_dt"] = df1["contb_receipt_dt"].map(conver)
df1["contb_receipt_dt"] = pd.to_datetime(df1["contb_receipt_dt"])                   #轉化成時間格式
df1["contb_receipt_dt"]

 

累加和的操做

# 累加和
a = np.arange(101)             隨機一個數組數據
display(a)

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100])


b = a.cumsum()                      #求出該數據的累加和用函數cumsum()
ree=DataFrame(b,columns=["num"])               
ree["num"].plot()                  #畫出累加和的那列的圖譜

相關文章
相關標籤/搜索