http://www.cnblogs.com/batteryhp/p/5046433.htmlhtml
五、示例:usda食品數據庫python
下面是一個具體的例子,書中最重要的就是例子。數據庫
#-*- encoding: utf-8 -*- import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas import Series,DataFrame import re import json #加載下面30M+的數據 db = json.load(open('E:\\foods-2011-10-03.json')) #print len(db) #print type(db) #獲得的db是個list,每一個條目都是含有某種食物所有數據的字典 #print db[0] #這一條很是長 #print db[0].keys() #nutrients 是keys中的一個key,它對應的值是有關食物養分成分的一個字典列表,很長…… #print db[0]['nutrients'][0] #下面將養分成分作成DataFrame nutrients = DataFrame(db[0]['nutrients']) #將字典列表直接作成DataFrame #print nutrients.head() #print type(db[0]['nutrients']) info_keys = ['description','group','id','manufacturer'] info = DataFrame(db,columns = info_keys) #print info #查看分類分佈狀況 #print pd.value_counts(info.group) #如今,爲了將全部的養分數據進行分析,須要將全部養分成分整合到一個大表中,下面分幾個步驟來完成 nutrients = [] for rec in db: fnuts = DataFrame(rec['nutrients']) fnuts['id'] = rec['id'] #廣播 nutrients.append(fnuts) nutrients = pd.concat(nutrients,ignore_index = True) #將列表鏈接起來,至關於rbind,把行對其鏈接在一塊兒 #去重,這是數據處理的重要步驟 print nutrients.duplicated().sum() nutrients = nutrients.drop_duplicates() #因爲nutrients與info有重複的名字,因此須要重命名一下info #注意下面這樣的命名方式 col_mapping = {'description':'food', 'group':'fgroup'} #rename函數返回的是副本,須要copy = False info = info.rename(columns = col_mapping,copy = False) #print info.columns #查看一下列名 col_mapping = {'description':'nutrient','group':'nutgroup'} nutrients = nutrients.rename(columns = col_mapping,copy = False) #print nutrients.columns #作完上面這些,顯然咱們須要將兩個DataFrame合併起來 print nutrients.ix[:10,:] #print info.id ndata = pd.merge(nutrients,info,on = 'id',how = 'outer') print ndata print ndata.ix[3000] #注意下面的處理方式很nice result = ndata.groupby(['nutrient','fgroup'])['value'].quantile(0.5) print result result['Zinc, Zn'].order().plot(kind = 'barh') plt.show() #只要稍微動動腦子(做者不止一次說過了……額),就能夠發現各養分成分最爲豐富的食物是什麼了 by_nuttriend = ndata.groupby(['nutgroup','nutrient']) print by_nuttriend.head() #注意下面取出最大值的方式 get_maximum = lambda x:x.xs(x.value.idxmax()) get_minimum = lambda x:x.xs(x.value.idxmin()) max_foods = by_nuttriend.apply(get_maximum)[['value','food']] #讓food小一點 max_foods.food = max_foods.food.str[:50] print max_foods.head() print max_foods.ix['Amino Acids']['food']
>>>
14179
nutrient nutgroup units value id
0 Protein Composition g 25.18 1008
1 Total lipid (fat) Composition g 29.20 1008
2 Carbohydrate, by difference Composition g 3.06 1008
3 Ash Other g 3.28 1008
4 Energy Energy kcal 376.00 1008
5 Water Composition g 39.28 1008
6 Energy Energy kJ 1573.00 1008
7 Fiber, total dietary Composition g 0.00 1008
8 Calcium, Ca Elements mg 673.00 1008
9 Iron, Fe Elements mg 0.64 1008
10 Magnesium, Mg Elements mg 22.00 1008
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 375175
Data columns:
nutrient 375176 non-null values
nutgroup 375176 non-null values
units 375176 non-null values
value 375176 non-null values
id 375176 non-null values
food 375176 non-null values
fgroup 375176 non-null values
manufacturer 293054 non-null values
dtypes: float64(1), int64(1), object(6)
nutrient Glycine
nutgroup Amino Acids
units g
value 0.073
id 1077
food Spearmint, fresh
fgroup Spices and Herbs
manufacturer
Name: 3000
nutrient fgroup
Adjusted Protein Sweets 12.900
Vegetables and Vegetable Products 2.180
Alanine Baby Foods 0.085
Baked Products 0.248
Beef Products 1.550
Beverages 0.003
Breakfast Cereals 0.311
Cereal Grains and Pasta 0.373
Dairy and Egg Products 0.271
Ethnic Foods 1.290
Fast Foods 0.514
Fats and Oils 0.000
Finfish and Shellfish Products 1.218
Fruits and Fruit Juices 0.027
Lamb, Veal, and Game Products 1.408
...
Zinc, Zn Finfish and Shellfish Products 0.67
Fruits and Fruit Juices 0.10
Lamb, Veal, and Game Products 3.94
Legumes and Legume Products 1.14
Meals, Entrees, and Sidedishes 0.63
Nut and Seed Products 3.29
Pork Products 2.32
Poultry Products 2.50
Restaurant Foods 0.80
Sausages and Luncheon Meats 2.13
Snacks 1.47
Soups, Sauces, and Gravies 0.20
Spices and Herbs 2.75
Sweets 0.36
Vegetables and Vegetable Products 0.33
Length: 2246
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 467 entries, (u'Amino Acids', u'Alanine', 48) to (u'Vitamins', u'Vitamin K (phylloquinone)', 395)
Data columns:
nutrient 467 non-null values
nutgroup 467 non-null values
units 467 non-null values
value 467 non-null values
id 467 non-null values
food 467 non-null values
fgroup 467 non-null values
manufacturer 444 non-null values
dtypes: float64(1), int64(1), object(6)
value food
nutgroup nutrient
Amino Acids Alanine 8.009 Gelatins, dry powder, unsweetened
Arginine 7.436 Seeds, sesame flour, low-fat
Aspartic acid 10.203 Soy protein isolate
Cystine 1.307 Seeds, cottonseed flour, low fat (glandless)
Glutamic acid 17.452 Soy protein isolate
nutrient
Alanine Gelatins, dry powder, unsweetened
Arginine Seeds, sesame flour, low-fat
Aspartic acid Soy protein isolate
Cystine Seeds, cottonseed flour, low fat (glandless)
Glutamic acid Soy protein isolate
Glycine Gelatins, dry powder, unsweetened
Histidine Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine Fish, cod, Atlantic, dried and salted
Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline Gelatins, dry powder, unsweetened
Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food
[Finished in 14.1s]