3-15 大數據處理技巧

In [1]:
import pandas as pd

gl=pd.read_csv('./Titanic_Data-master/Titanic_Data-master/train.csv')
gl.head()
Out[1]:
 
  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [2]:
gl.shape#查看大小
Out[2]:
(891, 12)
 

1.查看基本數據信息javascript

In [3]:
gl.info(memory_usage='deep')#查看基本信息
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 318.5 KB
 

2.查看不一樣數據類型的佔用空間css

In [4]:
for dtype in['float64','int64','object']:
    selevtrd_dtype=gl.select_dtypes(include=[dtype])#塞出不一樣數據類型
    mean_usage_b=selevtrd_dtype.memory_usage(deep=True).mean()#求對應的數據類型的內存平均值    
    mean_usage_mb=mean_usage_b/1024**2
    print('平均內存佔用',dtype,mean_usage_mb)
 
平均內存佔用 float64 0.004572550455729167
平均內存佔用 int64 0.005685170491536458
平均內存佔用 object 0.043910980224609375
In [5]:
import numpy as np
int_types=['uint8','int8','int16','int32','int64']
for it in int_types:
    print(np.iinfo(it))#簡化版info,查看每種數據類型的值範圍
 
Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

 

3.經過轉換數據類型來減小數據佔用內存html

In [6]:
def menu_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):#isinstance() 函數來判斷一個對象是不是一個已知的類型,相似 type()。
        usage_b=pandas_obj.memory_usage(deep=True).sum()#求內存佔用量的總和
    else:
        usage_b=pandas_obj.memory_usage(deep=True)
    usage_mb=usage_b/1024**2
    return'{:03.2f}MB'.format(usage_mb)#規定數據格式是小數點後2位

gl_int=gl.select_dtypes(include=['int64'])#去除int64的數據
coverter_int=gl_int.apply(pd.to_numeric,downcast='unsigned')#pd.to_numeric數據轉換;downcast='unsigned'向下轉換成無符號
print(menu_usage(gl_int))
print(menu_usage(coverter_int))
 
0.03MB
0.01MB
In [7]:
gl_float=gl.select_dtypes(include=['float64'])#去除int64的數據
coverter_float=gl_int.apply(pd.to_numeric,downcast='float')#pd.to_numeric數據轉換;downcast='unsigned'向下轉換成無符號
print(menu_usage(gl_float))
print(menu_usage(coverter_float))
 
0.01MB
0.02MB
 

4.把全部數據類型轉換成對應的不一樣的數據類型html5

In [8]:
optimized_gl=gl.copy()

optimized_gl[coverter_int.columns]=coverter_int
optimized_gl[coverter_float.columns]=coverter_float
print(menu_usage(gl))
print(menu_usage(optimized_gl))
 
0.31MB
0.29MB
 

5.describe():統計各項lable的屬性指標java

In [9]:
gl_obj=gl.select_dtypes(include=['object']).copy()
gl_obj.describe()
Out[9]:
 
  Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Turkula, Mrs. (Hedwig) male 347082 C23 C25 C27 S
freq 1 577 7 4 644
 

6.把重複的lable放在一個空間裏:即轉換成category類型node

In [10]:
dow=gl_obj.Sex#抽出對應lable的數據
dow.head()
Out[10]:
0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object
 

7.經過把數據類型object=>category,來減小佔用的空間python

In [11]:
dow_cat=dow.astype('category')#把上面的object轉成category類型
dow_cat.head()
Out[11]:
0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: category
Categories (2, object): [female, male]
In [12]:
dow_cat.head(10).cat.codes#鏈接字符串,查出不一樣類
Out[12]:
0    1
1    0
2    0
3    0
4    1
5    1
6    1
7    1
8    0
9    0
dtype: int8
In [13]:
print(menu_usage(dow))#object類型
print(menu_usage(dow_cat))#category類型,佔空間減小了
 
0.05MB
0.00MB
 

8.計算整個表經過轉換成category類型後的內存jquery

In [14]:
converted_obj=pd.DataFrame()#定義成空的DataFrame

for col in gl_obj.columns:
    num_unique_values=len(gl_obj[col].unique())
    num_total_values=len(gl_obj[col])
    if num_unique_values/num_total_values<0.5:#尋找重複量最大的一列
        converted_obj.loc[:,col]=gl_obj[col].astype('category')#astype轉換類型
    else:
        converted_obj.loc[:,col]=gl_obj[col]
        
In [15]:
print(menu_usage(gl_obj))
print(menu_usage(converted_obj))
 
0.26MB
0.14MB
相關文章
相關標籤/搜索