原創文章,歡迎分享! http://my.oschina.net/u/2306127/blog/613875
python
最近空氣污染嚴重,也爲了演練一下Orange插件編寫和數據處理的學習成果,準備開發一個AQI數據獲取和分析的插件。目前作出來的一個樣子以下,還有點酷吧?[下一步完善後,會將源碼共享,目前暫不拿來誤人,感興趣的可交流]git
研究過程當中,也發現一個重要的趨勢:北京的空氣質量在整個華北平原地區,幾乎任什麼時候候都是最好的!web
這裏主要介紹研究過程,目前結論只是初步觀察,後面研究再提供相應的分析圖表。
正則表達式
過程當中遇到的問題和處理辦法,與你們分享,也有一些未決的問題,看哪位牛人能夠解決:json
數據來源用的http://aqicn.org。使用requests這個庫進行數據抓取,功能很強,尤爲是能夠自定義Header。若是不自定義header,因爲這個網站採用了反抓取技術,只返回過時的老數據,是沒法獲得最新的數據的。代碼以下:
網絡
#Get AQI data from web,by a region. def getaqidata(left,right,bottom,top): aqi_url = geturl(left,right,bottom,top) aqi = requests.get(aqi_url,headers=gethead()) raqi = aqi.text raqi2 = re.search(r'\[\{.*\}\]',raqi) cities = json.loads(raqi2.group(0)) return cities
具體的Header能夠打開FireFox的「開發者」功能,選擇「網絡」,再選中當前的數據訪問請求列表,便可看到全部的消息。而後選擇「原始頭「,便可將相應的head拷貝下來,放到gethead()函數下,作成一個辭典返回。而後調用:數據結構
aqi = requests.get(aqi_url,headers=gethead())
返回的值是一個json的字符串,可是有一些頭信息,以下:app
mapShowLevel2Makers([{"lat":"38.871","lon":"115.521","aqi":"112", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"City Monitoring Station, Baoding", "img":"_c_az8khNSs3Uf7J_7tN1s57uaNIH4uezJz7b2v189UwA", "pol":"pm25","tz":"+0800","idx":781,"x":668}, {"lat":"38.896","lon":"115.522","aqi":"93", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Huadian II, Baoding", "img":"_AR8A4P9DTjpIZWJlaS_kv53lrprluIIv5Y2O55S15LqM5Yy6", "pol":"pm25","tz":"+0800","idx":783,"x":670}, ... {"lat":"40.152","lon":"118.311","aqi":"48", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Qianxi EPA, Tangshan", "img":"_ASUA2v9DTjpIZWJlaS_llJDlsbHluIIv6L-B6KW_546v5L-d5bGAKCop", "pol":"pm25","tz":"+0800","idx":823,"x":4640}], [7.8,0]);
使用正則表達式把數據提取出來,放到cities中。函數
raqi2 = re.search(r'\[\{.*\}\]',raqi)
提取的cities內容以下:性能
[{"lat":"38.871","lon":"115.521","aqi":"112", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"City Monitoring Station, Baoding", "img":"_c_az8khNSs3Uf7J_7tN1s57uaNIH4uezJz7b2v189UwA", "pol":"pm25","tz":"+0800","idx":781,"x":668}, {"lat":"38.896","lon":"115.522","aqi":"93", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Huadian II, Baoding", "img":"_AR8A4P9DTjpIZWJlaS_kv53lrprluIIv5Y2O55S15LqM5Yy6", "pol":"pm25","tz":"+0800","idx":783,"x":670}, ... {"lat":"40.152","lon":"118.311","aqi":"48", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Qianxi EPA, Tangshan", "img":"_ASUA2v9DTjpIZWJlaS_llJDlsbHluIIv6L-B6KW_546v5L-d5bGAKCop", "pol":"pm25","tz":"+0800","idx":823,"x":4640}]
cities是一個標準的列表,其中包含一個dict對象,裏面有若干個key-value數值對。
cities可使用標準的json操做或者python的list進行訪問。
pandas有很是豐富的數據操做函數,pandas能夠直接將上面的cities數據結構轉爲一個pandas.DataFrame。
import pandas as pd df = pandas.DataFrame(cities)
也可使用pandas.DataFrame.to_csv()將數據保存到csv文件中,或者直接存爲excel的表格,而後...能夠幹不少事了。
GeoPandas帶有Geometry字段,能夠保存幾何對象信息。能夠將pandas.DataFrame的lon/lat字段轉爲點對象,可是保存到shp時會出現失敗,將文本字段去除後就能夠(查看數據發現拼音等字符,有可能未處理當成非法字符了),暫時想了個辦法繞過去。
def aqi2geopandas(cities): df = pd.DataFrame(cities) ps = [] ps0 = [1] ns = [] ns0 = [1] for index, row in df.iterrows(): print(index,':',row['lat'],'-',row['lon']) ps0[0] = Point(float(row['lon']),float(row['lat'])) addr = row["city"].split(",") if len(addr) >= 1: ns0[0] = addr[len(addr)-1] else: ns0[0] = "noname" ps.append(ps0[0]) ns.append(ns0[0]) gs = GeoSeries(ps,crs={'init': 'epsg:4326', 'no_defs': True}) geodf = GeoDataFrame({'id' : df["x"],'name' : ns, 'lon' : df["lon"],'lat' : df["lat"], 'aqi' : df["aqi"],'utime' : df["utime"],'tz' : df["tz"], 'geometry' : gs }) return geodf
若是能夠直接轉換,上面的代碼還能夠大大簡化的。先實現獲得數據再說,功能代碼後面再去研究、優化。
#獲得GeoPandas對象。 gdf = aqi2geopandas(cities) #fshp是要保存的文件名。 gdf.to_file(fshp)
這個過程當中遇到一些問題,主要是Orange.data.Table對象構造時文本對象加不進去,有些API不知道用法,看了源代碼沒有徹底明白,後面再研究。目前採用保存到.tab文件,再讀入的方法,試過能夠用,只是須要建立臨時文件,性能上會有不足。
def reformcity_tab(i,city): rinfo = str(i+1)+"\t" rinfo = rinfo+city["lat"]+"\t" rinfo = rinfo+city["lon"]+"\t" rinfo = rinfo+city["aqi"]+"\t" rinfo = rinfo+city["city"]+"\t" addr = city["city"].split(",") if len(addr) == 0: rinfo = rinfo+"\t-\t-\t-\t" if len(addr) == 1: rinfo = rinfo+addr[0]+"\t-\t-\t" if len(addr) == 2: rinfo = rinfo+addr[1]+"\t"+addr[0]+"\t-\t" if len(addr) >= 3: rinfo = rinfo+addr[2]+"\t"+addr[1]+"\t"+addr[0]+"\t" rinfo = rinfo+city["utime"]+"\t" rinfo = rinfo+city["tz"] #print("$",rinfo) return rinfo def writecityname_tab(cities,Filename): print("#Write to File:",Filename,"...") f = open(Filename, 'w') f.write("ID\tLatitude\tLongitude\tAQI\tNAME\tPROV\tCONT\tSTA\tUTIME\tTZ" + "\n") f.write("discrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete" + "\n") f.write(" \t \t \t \t \t \t \t \t" + "\n") for i, city in enumerate(cities): try: rinfo = reformcity_tab(int(city["x"]),city) f.write(rinfo + "\n") #print(city) except Exception as err: print("#ERROR: ",err) continue f.close() print("#Write AQI to Orange.data.Table Finished.")
而後讀入.tab文件:
# ftable爲上面保存的文件名,必定要同樣哦。 self.table = Orange.data.Table(ftable)
目前已經能夠從網上按照指定區域抓取AQI數據,而後轉爲Orange.data.Table,以及Pandas.DataFrame和 GeoPandas.DataFrame的數據對象,而且經過GeoPandas.DataFrame.to_file(fname)轉爲shp文件,而後能夠在各類GIS軟件和R等數據分析軟件中打開,進行後續的分析和製圖等操做,我使用QGIS打開了,沒有問題。