《利用python進行數據分析》讀書筆記--第七章 數據規整化:清理、轉換、合併、重塑(三)

http://www.cnblogs.com/batteryhp/p/5046433.htmlhtml

 

五、示例:usda食品數據庫python

下面是一個具體的例子,書中最重要的就是例子。數據庫

複製代碼
#-*- encoding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
import re
import json

#加載下面30M+的數據
db = json.load(open('E:\\foods-2011-10-03.json'))
#print len(db)
#print type(db)  #獲得的db是個list,每一個條目都是含有某種食物所有數據的字典
#print db[0]  #這一條很是長
#print db[0].keys()
#nutrients 是keys中的一個key,它對應的值是有關食物養分成分的一個字典列表,很長……
#print db[0]['nutrients'][0]
#下面將養分成分作成DataFrame
nutrients = DataFrame(db[0]['nutrients'])  #將字典列表直接作成DataFrame
#print nutrients.head()
#print type(db[0]['nutrients'])
info_keys = ['description','group','id','manufacturer']
info = DataFrame(db,columns = info_keys)
#print info
#查看分類分佈狀況
#print pd.value_counts(info.group)
#如今,爲了將全部的養分數據進行分析,須要將全部養分成分整合到一個大表中,下面分幾個步驟來完成
nutrients = []

for rec in db:
    fnuts = DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']  #廣播
    nutrients.append(fnuts)
nutrients = pd.concat(nutrients,ignore_index = True) #將列表鏈接起來,至關於rbind,把行對其鏈接在一塊兒

#去重,這是數據處理的重要步驟
print nutrients.duplicated().sum()
nutrients = nutrients.drop_duplicates()
#因爲nutrients與info有重複的名字,因此須要重命名一下info
#注意下面這樣的命名方式
col_mapping = {'description':'food',
'group':'fgroup'}
#rename函數返回的是副本,須要copy = False
info = info.rename(columns = col_mapping,copy = False)
#print info.columns #查看一下列名
col_mapping = {'description':'nutrient','group':'nutgroup'}
nutrients = nutrients.rename(columns = col_mapping,copy = False)
#print nutrients.columns 
#作完上面這些,顯然咱們須要將兩個DataFrame合併起來
print nutrients.ix[:10,:]
#print info.id
ndata = pd.merge(nutrients,info,on = 'id',how = 'outer')
print ndata
print ndata.ix[3000]
#注意下面的處理方式很nice
result = ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5)
print result
result['Zinc, Zn'].order().plot(kind = 'barh')
plt.show()
#只要稍微動動腦子(做者不止一次說過了……額),就能夠發現各養分成分最爲豐富的食物是什麼了
by_nuttriend = ndata.groupby(['nutgroup','nutrient'])
print by_nuttriend.head()
#注意下面取出最大值的方式
get_maximum = lambda x:x.xs(x.value.idxmax())
get_minimum = lambda x:x.xs(x.value.idxmin())
max_foods = by_nuttriend.apply(get_maximum)[['value','food']]
#讓food小一點
max_foods.food = max_foods.food.str[:50]
print max_foods.head()
print max_foods.ix['Amino Acids']['food']
>>>
14179
                       nutrient     nutgroup units    value    id
0                       Protein  Composition     g    25.18  1008
1             Total lipid (fat)  Composition     g    29.20  1008
2   Carbohydrate, by difference  Composition     g     3.06  1008
3                           Ash        Other     g     3.28  1008
4                        Energy       Energy  kcal   376.00  1008
5                         Water  Composition     g    39.28  1008
6                        Energy       Energy    kJ  1573.00  1008
7          Fiber, total dietary  Composition     g     0.00  1008
8                   Calcium, Ca     Elements    mg   673.00  1008
9                      Iron, Fe     Elements    mg     0.64  1008
10                Magnesium, Mg     Elements    mg    22.00  1008
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 375175
Data columns:
nutrient        375176  non-null values
nutgroup        375176  non-null values
units           375176  non-null values
value           375176  non-null values
id              375176  non-null values
food            375176  non-null values
fgroup          375176  non-null values
manufacturer    293054  non-null values
dtypes: float64(1), int64(1), object(6)
nutrient                 Glycine
nutgroup             Amino Acids
units                          g
value                      0.073
id                          1077
food            Spearmint, fresh
fgroup          Spices and Herbs
manufacturer                   
Name: 3000
nutrient          fgroup                          
Adjusted Protein  Sweets                               12.900
                  Vegetables and Vegetable Products     2.180
Alanine           Baby Foods                            0.085
                  Baked Products                        0.248
                  Beef Products                         1.550
                  Beverages                             0.003
                  Breakfast Cereals                     0.311
                  Cereal Grains and Pasta               0.373
                  Dairy and Egg Products                0.271
                  Ethnic Foods                          1.290
                  Fast Foods                            0.514
                  Fats and Oils                         0.000
                  Finfish and Shellfish Products        1.218
                  Fruits and Fruit Juices               0.027
                  Lamb, Veal, and Game Products         1.408
...
Zinc, Zn  Finfish and Shellfish Products       0.67
          Fruits and Fruit Juices              0.10
          Lamb, Veal, and Game Products        3.94
          Legumes and Legume Products          1.14
          Meals, Entrees, and Sidedishes       0.63
          Nut and Seed Products                3.29
          Pork Products                        2.32
          Poultry Products                     2.50
          Restaurant Foods                     0.80
          Sausages and Luncheon Meats          2.13
          Snacks                               1.47
          Soups, Sauces, and Gravies           0.20
          Spices and Herbs                     2.75
          Sweets                               0.36
          Vegetables and Vegetable Products    0.33
Length: 2246
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 467 entries, (u'Amino Acids', u'Alanine', 48) to (u'Vitamins', u'Vitamin K (phylloquinone)', 395)
Data columns:
nutrient        467  non-null values
nutgroup        467  non-null values
units           467  non-null values
value           467  non-null values
id              467  non-null values
food            467  non-null values
fgroup          467  non-null values
manufacturer    444  non-null values
dtypes: float64(1), int64(1), object(6)
                            value                                          food
nutgroup    nutrient                                                          
Amino Acids Alanine         8.009             Gelatins, dry powder, unsweetened
            Arginine        7.436                  Seeds, sesame flour, low-fat
            Aspartic acid  10.203                           Soy protein isolate
            Cystine         1.307  Seeds, cottonseed flour, low fat (glandless)
            Glutamic acid  17.452                           Soy protein isolate
nutrient
Alanine                           Gelatins, dry powder, unsweetened
Arginine                               Seeds, sesame flour, low-fat
Aspartic acid                                   Soy protein isolate
Cystine                Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                   Soy protein isolate
Glycine                           Gelatins, dry powder, unsweetened
Histidine                Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline    KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine        Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine            Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine                    Fish, cod, Atlantic, dried and salted
Phenylalanine     Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline                           Gelatins, dry powder, unsweetened
Serine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine         Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan         Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine          Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine            Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food
[Finished in 14.1s]
複製代碼

image

 
分類:  python
相關文章
相關標籤/搜索