接前幾章,已經實現了前端3D地球的展現,本章開始完成一個疫情數據的爬蟲html
爬取數據使用python爬蟲實現前端
1 python 3.7python
2 scrapy 2.0.1mysql
3 selenium git
4 chromedirver 選擇適合本身瀏覽器的 http://npm.taobao.org/mirrors/chromedriver/github
5 phantomjs web
6 mysql / mongodb pymsql / pymongodbsql
疫情數據最及時的是從國家衛健委的網站獲取,本例只是做爲demo聯繫,這裏使用qq的疫情地圖頁面進行數據爬取,參考連接mongodb
https://news.qq.com/zt2020/page/feiyan.htm#/global?ct=United%20States&nojump=1chrome
安裝第一節的環境和包
使用scrapy命令建立爬蟲項目,會建立一個爬蟲的項目文件夾,會有一個爬蟲項目的模板,執行命令
F:
cd F:\mygithub\VDDataServer
scrapy startproject COVID19
建立一個爬蟲spider文件
cd COVID19
scrapy genspider Covid19Spider news.qq.com
一個爬蟲的項目就完成了,項目結構以下
打開https://news.qq.com/zt2020/page/feiyan.htm#/global?ct=United%20States&nojump=1疫情地圖頁面,能夠看到有國內和海外的疫情數據,這裏咱們只爬取疫情的列表數據
海外疫情,數據列表在id爲foreignWraper的元素下
國內疫情,數據列表在id爲foreignWraper的元素下
數據都是存在html內的,這裏咱們用scrapy + selenium + phantom的方式進行爬取
經過頁面能夠看到數據項共有6項
打開爬蟲項目,找到以前scrapy生成項目模板的items.py文件,定義一個item類,依次是集合,表名,地區名稱,父地區名稱,新增數,現有數,累計數,治癒數,死亡數
class Covid19Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() collection = table = 'covid19' name = scrapy.Field() parent = scrapy.Field() new = scrapy.Field() now = scrapy.Field() total = scrapy.Field() cure = scrapy.Field() death = scrapy.Field()
切換到spider文件夾下的Covid19Spider.py
修改 start_requests方法,使用meta頭傳遞請求的頁面標識
def start_requests(self): # 定義要爬取的頁面列表 urls = ["https://news.qq.com/zt2020/page/feiyan.htm#/?ct=United%20States&nojump=1", "https://news.qq.com/zt2020/page/feiyan.htm#/global?ct=United%20States&nojump=1"] # 循環發送請求 for i in range(len(urls)): if i == 0: # 執行中國疫情頁面的parer yield scrapy.Request(urls[i], callback=self.parse_China, meta={'page': i}, dont_filter=True) else: # 執行海外疫情頁面的parer yield scrapy.Request(urls[i], callback=self.parse_Outsee, meta={'page': i}, dont_filter=True)
新增國內parser方法
# 疫情 中國 def parse_China(self, response): provinces = response.xpath( '//*[@id="listWraper"]/table[2]/tbody').extract() for prn in provinces: item = Covid19Item() prnNode = Selector(text=prn) item['name'] = prnNode.xpath( '//tr[1]/th/p[1]/span//text()').extract_first().replace('區', '') item['parent'] = '' item['new'] = prnNode.xpath( '//tr[1]/td[2]/p[2]//text()').extract_first() item['now'] = prnNode.xpath( '//tr[1]/td[1]/p[1]//text()').extract_first() item['total'] = prnNode.xpath( '//tr[1]/td[2]/p[1]//text()').extract_first() item['cure'] = prnNode.xpath( '//tr[1]/td[3]/p[1]//text()').extract_first() item['death'] = prnNode.xpath( '//tr[1]/td[4]/p[1]//text()').extract_first() cityNodes = prnNode.xpath('//*[@class="city"]').extract() for city in cityNodes: cityItem = Covid19Item() cityNode = Selector(text=city) cityItem['name'] = cityNode.xpath( '//th/span//text()').extract_first().replace('區', '') cityItem['parent'] = item['name'] cityItem['new'] = '' cityItem['now'] = cityNode.xpath( '//td[1]//text()').extract_first() cityItem['total'] = cityNode.xpath( '//td[2]//text()').extract_first() cityItem['cure'] = cityNode.xpath( '//td[3]//text()').extract_first() cityItem['death'] = cityNode.xpath( '//td[4]//text()').extract_first() yield cityItem yield item
新增國外parser方法
# 海外 def parse_Outsee(self, response): countries = response.xpath( '//*[@id="foreignWraper"]/table/tbody').extract() for country in countries: countryNode = Selector(text=country) item = Covid19Item() item['name'] = countryNode.xpath( '//tr/th/span//text()').extract_first() item['parent'] = '' item['new'] = countryNode.xpath( '//tr/td[1]//text()').extract_first() item['now'] = '' item['total'] = countryNode.xpath( '//tr/td[2]//text()').extract_first() item['cure'] = countryNode.xpath( '//tr/td[3]//text()').extract_first() item['death'] = countryNode.xpath( '//tr/td[4]//text()').extract_first() yield item
使用downloader中間件進行頁面的請求,在中間件中實現selenium + phantomjs的頁面請求,並將htmlresponse返回給spider處理
修改middlewares.py文件,新增SeleniumMiddelware類,根據以前設定的meta頭決定頁面的顯示等待元素
from scrapy import signals from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from scrapy.http import HtmlResponse from logging import getLogger from time import sleep class SeleniumMiddelware(): def __init__(self,timeout=None,service_args=[]): self.logger = getLogger(__name__) self.timeout = timeout self.browser = webdriver.PhantomJS(service_args=service_args) self.browser.set_window_size(1400,700) self.browser.set_page_load_timeout(self.timeout) self.wait = WebDriverWait(self.browser,self.timeout) def __del__(self): self.browser.close() def process_request(self,request,spider): self.logger.debug('PhantomJs is Starting') page = request.meta.get('page',1) try: # 訪問URL self.browser.get(request.url) # 等待爬取的元素的加載 if page == 0: self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#listWraper'))) else: self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,'#foreignWraper'))) # sleep(2) return HtmlResponse(url=request.url,body=self.browser.page_source,request=request,encoding='utf-8',status=200) except TimeoutException: return HtmlResponse(url=request.url,status=500,request=request) @classmethod def from_crawler(cls,crawler): return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'), service_args=crawler.settings.get('PHANTOMJS_SERVICE_ARGS'))
這裏定義了兩個配置,
在settings.py新增配置
一個是selenium的超時時間,一個是phantomjs服務的配置
SELENIUM_TIMEOUT = 20 PHANTOMJS_SERVICE_ARGS = ['--load-images=false', '--disk-cache=true']
啓用donwloader中間件
DOWNLOADER_MIDDLEWARES = { 'COVID19.middlewares.SeleniumMiddelware': 543, }
pipelines定義了對數據items的處理方式,在這裏能夠進行數據的存儲,定義兩個類,一個是mongo的存儲,一個是mysql的存儲
定義mongodb的操做類
class MongoPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def process_item(self, item, spider): self.db[item.collection].insert(dict(item)) return item def close_spider(self, spider): self.client.close()
定義mysql的操做類
class MySqlPipeLine(object): def __init__(self, host, database, user, password, port): self.host = host self.database = database self.user = user self.password = password self.port = port @classmethod def from_crawler(cls, crawler): return cls( host=crawler.settings.get('MYSQL_HOST'), database=crawler.settings.get('MYSQL_DB'), user=crawler.settings.get('MYSQL_USER'), password=crawler.settings.get('MYSQL_PASSWORD'), port=crawler.settings.get('MYSQL_PORT') ) def open_spider(self, spider): self.db = pymysql.connect( self.host, self.user, self.password, self.database, charset='utf8', port=self.port) self.cursor = self.db.cursor() def close_spider(self, spider): self.db.close() def process_item(self, item, spider): data = dict(item) keys = ', '.join(data.keys()) values = ', '.join(['%s'] * len(data)) sql = 'insert into {table}({keys}) values ({values}) on duplicate key update'.format( table=item.table, keys=keys, values=values) update = ','.join([" {key}=%s".format(key=key) for key in data]) sql += update try: if self.cursor.execute(sql, tuple(data.values())*2): print('successful') self.db.commit() except pymysql.MySQLError as e: print(e) self.db.rollback() return item
這裏分別定義了兩個pipeline,其中調用了數據庫的配置
在settings.py增長配置
MONGO_URI = 'localhost' MONGO_DB = 'COVID19' MYSQL_HOST = 'localhost' MYSQL_DB = 'covid19' MYSQL_USER = 'root' MYSQL_PASSWORD = '123456' MYSQL_PORT = 3306
啓用pipeline中間件
ITEM_PIPELINES = { 'COVID19.pipelines.MongoPipeline': 300, 'COVID19.pipelines.MySqlPipeLine': 300 }
建立一個covid19的數據庫
CREATE DATABASE covid19
建立數據表,一個是疫情數據表covid19,一個是地區經緯度的字典表dic_lnglat
SET NAMES utf8mb4; SET FOREIGN_KEY_CHECKS = 0; -- ---------------------------- -- Table structure for covid19 -- ---------------------------- DROP TABLE IF EXISTS `covid19`; CREATE TABLE `covid19` ( `name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `parent` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `new` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `now` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `total` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `cure` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `death` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, PRIMARY KEY (`name`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic; -- ---------------------------- -- Table structure for dic_lnglat -- ---------------------------- DROP TABLE IF EXISTS `dic_lnglat`; CREATE TABLE `dic_lnglat` ( `name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `lng` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `lat` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL, `type` int(0) NULL DEFAULT NULL, PRIMARY KEY (`name`) USING BTREE ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic; SET FOREIGN_KEY_CHECKS = 1;
爬蟲只能從這個頁面爬取到疫情的數據,若是要應用這些數據在3D地球VDEarth上顯示,還須要相關地區的經緯度
爬出的一共兩種類型數據,一種是國內的,包含省份和省份下的市,區等,一種是國外,只有國家名稱
國內的直接使用各城市的經緯度,國外使用國家的首都的經緯度,以前工做中,我已經存放了相關的數據,沒有的話能夠參考
國內城市經緯度整理:參考 http://www.javashuo.com/article/p-tpumegjz-hy.html
沒有找到合適的國外首都經緯度,附上國外首都經緯度整理
1 globe = { 2 "阿富汗": [69.11,34.28], 3 "阿爾巴尼亞": [19.49,41.18], 4 "阿爾及利亞": [3.08,36.42], 5 "美屬薩摩亞": [-170.43,-14.16], 6 "安道爾": [1.32,42.31], 7 "安哥拉": [13.15,-8.50], 8 "安提瓜和巴布達": [-61.48,17.20], 9 "阿根廷": [-60.00,-36.30], 10 "亞美尼亞": [44.31,40.10], 11 "阿魯巴": [-70.02,12.32], 12 "澳大利亞": [149.08,-35.15], 13 "奧地利": [16.22,48.12], 14 "阿塞拜疆": [49.56,40.29], 15 "巴哈馬": [-77.20,25.05], 16 "巴林": [50.30,26.10], 17 "孟加拉國": [90.26,23.43], 18 "巴巴多斯": [-59.30,13.05], 19 "白俄羅斯": [27.30,53.52], 20 "比利時": [4.21,50.51], 21 "伯利茲": [-88.30,17.18], 22 "貝寧": [2.42,6.23], 23 "不丹": [89.45,27.31], 24 "玻利維亞": [-68.10,-16.20], 25 "波斯尼亞和黑塞哥維那": [18.26,43.52], 26 "博茨瓦納": [25.57,-24.45], 27 "巴西": [-47.55,-15.47], 28 "英屬維爾京羣島": [-64.37,18.27], 29 "文萊": [115.00,4.52], 30 "保加利亞": [23.20,42.45], 31 "布基納法索": [-1.30,12.15], 32 "布隆迪": [29.18,-3.16], 33 "柬埔寨": [104.55,11.33], 34 "喀麥隆": [11.35,3.50], 35 "加拿大": [-75.42,45.27], 36 "佛得角": [-23.34,15.02], 37 "開曼羣島": [-81.24,19.20], 38 "中非共和國": [18.35,4.23], 39 "乍得": [14.59,12.10], 40 "智利": [-70.40,-33.24], 41 "中國": [116.20,39.55], 42 "哥倫比亞": [-74.00,4.34], 43 "科摩羅": [43.16,-11.40], 44 "剛果": [15.12,-4.09], 45 "哥斯達黎加": [-84.02,9.55], 46 "科特迪瓦": [-5.17,6.49], 47 "克羅地亞": [15.58,45.50], 48 "古巴": [-82.22,23.08], 49 "塞浦路斯": [33.25,35.10], 50 "捷克共和國": [14.22,50.05], 51 "朝鮮": [125.30,39.09], 52 "剛果(扎伊爾)": [15.15,-4.20], 53 "丹麥": [12.34,55.41], 54 "吉布提": [42.20,11.08], 55 "多米尼加": [-61.24,15.20], 56 "多米尼加共和國": [-69.59,18.30], 57 "東帝汶": [125.34,-8.29], 58 "厄瓜多爾": [-78.35,-0.15], 59 "埃及": [31.14,30.01], 60 "薩爾瓦多": [-89.10,13.40], 61 "赤道幾內亞": [8.50,3.45], 62 "厄立特里亞": [38.55,15.19], 63 "愛沙尼亞": [24.48,59.22], 64 "埃塞俄比亞": [38.42,9.02], 65 "福克蘭羣島(馬爾維納斯羣島)": [-59.51,-51.40], 66 "法羅羣島": [-6.56,62.05], 67 "斐濟": [178.30,-18.06], 68 "芬蘭": [25.03,60.15], 69 "法國": [2.20,48.50], 70 "法屬圭亞那": [-52.18,5.05], 71 "法屬波利尼西亞": [-149.34,-17.32], 72 "加蓬": [9.26,0.25], 73 "岡比亞": [-16.40,13.28], 74 "格魯吉亞": [44.50,41.43], 75 "德國": [13.25,52.30], 76 "加納": [-0.06,5.35], 77 "希臘": [23.46,37.58], 78 "格陵蘭": [-51.35,64.10], 79 "瓜德羅普島": [-61.44,16.00], 80 "危地馬拉": [-90.22,14.40], 81 "根西島": [-2.33,49.26], 82 "幾內亞": [-13.49,9.29], 83 "幾內亞比紹": [-15.45,11.45], 84 "圭亞那": [-58.12,6.50], 85 "海地": [-72.20,18.40], 86 "赫德島和麥當勞羣島": [74.00,-53.00], 87 "洪都拉斯": [-87.14,14.05], 88 "匈牙利": [19.05,47.29], 89 "冰島": [-21.57,64.10], 90 "印度": [77.13,28.37], 91 "印度尼西亞": [106.49,-6.09], 92 "伊朗": [51.30,35.44], 93 "伊拉克": [44.30,33.20], 94 "愛爾蘭": [-6.15,53.21], 95 "以色列": [35.12,31.47], 96 "意大利": [12.29,41.54], 97 "牙買加": [-76.50,18.00], 98 "約旦": [35.52,31.57], 99 "哈薩克斯坦": [71.30,51.10], 100 "肯尼亞": [36.48,-1.17], 101 "基里巴斯": [173.00,1.30], 102 "科威特": [48.00,29.30], 103 "吉爾吉斯斯坦": [74.46,42.54], 104 "老撾": [102.36,17.58], 105 "拉脫維亞": [24.08,56.53], 106 "黎巴嫩": [35.31,33.53], 107 "萊索托": [27.30,-29.18], 108 "利比里亞": [-10.47,6.18], 109 "阿拉伯利比亞民衆國": [13.07,32.49], 110 "列支敦士登": [9.31,47.08], 111 "立陶宛": [25.19,54.38], 112 "盧森堡": [6.09,49.37], 113 "馬達加斯加": [47.31,-18.55], 114 "馬拉維": [33.48,-14.00], 115 "馬來西亞": [101.41,3.09], 116 "馬爾代夫": [73.28,4.00], 117 "馬裏": [-7.55,12.34], 118 "馬耳他": [14.31,35.54], 119 "馬提尼克島": [-61.02,14.36], 120 "毛里塔尼亞": [57.30,-20.10], 121 "馬約特島": [45.14,-12.48], 122 "墨西哥": [-99.10,19.20], 123 "密克羅尼西亞(聯邦) ": [158.09,6.55], 124 "摩爾多瓦共和國": [28.50,47.02], 125 "莫桑比克": [32.32,-25.58], 126 "緬甸": [96.20,16.45], 127 "納米比亞": [17.04,-22.35], 128 "尼泊爾": [85.20,27.45], 129 "荷蘭": [04.54,52.23], 130 "荷屬安的列斯": [-69.00,12.05], 131 "新喀里多尼亞": [166.30,-22.17], 132 "新西蘭": [174.46,-41.19], 133 "尼加拉瓜": [-86.20,12.06], 134 "尼日爾": [2.06,13.27], 135 "尼日利亞": [7.32,9.05], 136 "諾福克島": [168.43,-45.20], 137 "北馬裏亞納羣島": [145.45,15.12], 138 "挪威": [10.45,59.55], 139 "阿曼": [58.36,23.37], 140 "巴基斯坦": [73.10,33.40], 141 "帕勞": [134.28,7.20], 142 "巴拿馬": [-79.25,9.00], 143 "巴布亞新幾內亞": [147.08,-9.24], 144 "巴拉圭": [-57.30,-25.10], 145 "祕魯": [-77.00,-12.00], 146 "菲律賓": [121.03,14.40], 147 "波蘭": [21.00,52.13], 148 "葡萄牙": [-9.10,38.42], 149 "波多黎各": [-66.07,18.28], 150 "卡塔爾": [51.35,25.15], 151 "韓國": [126.58,37.31], 152 "羅馬尼亞": [26.10,44.27], 153 "俄羅斯": [37.35,55.45], 154 "盧旺達": [30.04,-1.59], 155 "聖基茨和尼維斯": [-62.43,17.17], 156 "聖盧西亞": [-60.58,14.02], 157 "聖皮埃爾和密克隆": [-56.12,46.46], 158 "聖文森特和格林納丁斯": [-61.10,13.10], 159 "薩摩亞": [-171.50,-13.50], 160 "聖馬力諾": [12.30,43.55], 161 "聖多美和普林西比": [6.39,0.10], 162 "沙特阿拉伯": [46.42,24.41], 163 "塞內加爾": [-17.29,14.34], 164 "塞拉利昂": [-13.17,8.30], 165 "斯洛伐克": [17.07,48.10], 166 "斯洛文尼亞": [14.33,46.04], 167 "所羅門羣島": [159.57,-9.27], 168 "索馬里": [45.25,2.02], 169 "比勒陀利亞": [28.12,-25.44], 170 "西班牙": [-3.45,40.25], 171 "蘇丹": [32.35,15.31], 172 "蘇里南": [-55.10,5.50], 173 "斯威士蘭": [31.06,-26.18], 174 "瑞典": [18.03,59.20], 175 "瑞士": [7.28,46.57], 176 "阿拉伯敘利亞共和國": [36.18,33.30], 177 "塔吉克斯坦": [68.48,38.33], 178 "泰國": [100.35,13.45], 179 "馬其頓": [21.26,42.01], 180 "多哥": [1.20,6.09], 181 "湯加": [-174.00,-21.10], 182 "突尼斯": [10.11,36.50], 183 "土耳其": [32.54,39.57], 184 "土庫曼斯坦": [57.50,38.00], 185 "圖瓦盧": [179.13,-8.31], 186 "烏干達": [32.30,0.20], 187 "烏克蘭": [30.28,50.30], 188 "阿聯酋": [54.22,24.28], 189 "英國": [-0.05,51.36], 190 "坦桑尼亞": [35.45,-6.08], 191 "美國": [-77.02,39.91], 192 "美屬維爾京羣島": [-64.56,18.21], 193 "烏拉圭": [-56.11,-34.50], 194 "烏茲別克斯坦": [69.10,41.20], 195 "瓦努阿圖": [168.18,-17.45], 196 "委內瑞拉": [-66.55,10.30], 197 "越南": [105.55,21.05], 198 "南斯拉夫": [20.37,44.50], 199 "贊比亞": [28.16,-15.28], 200 "津巴布韋": [31.02,-17.43] 201 }
能夠在python內增長寫入mysql庫的同步進字典表
執行scrapy命令啓動爬蟲
scrapy crawl covid19spider
能夠看到爬蟲運行的控制檯日誌輸出,打開數據庫查看
上面pipeline章節使用了mysql和mongo,能夠看到數據已經寫入mysql和mongo中,實際選擇一個就好
爬蟲雖然能工做了,可是每次啓動都要手動執行命令,新增一個running.py文件,定時去調用爬蟲
# -*- coding: utf-8 -*- from multiprocessing import Process from scrapy import cmdline import time import logging import os # 配置參數便可, 爬蟲名稱,運行頻率 confs = [ { "spider_name": "covid19spider", "frequency": 10, }, ] def start_spider(spider_name, frequency): args = ["cd covid19","scrapy", "crawl", spider_name] while True: start = time.time() p = Process(target=cmdline.execute, args=(args,)) p.start() p.join() logging.debug("### use time: %s" % (time.time() - start)) time.sleep(frequency) if __name__ == '__main__': for conf in confs: process = Process(target=start_spider,args=(conf["spider_name"], conf["frequency"])) process.start() time.sleep(86400)
這樣爬蟲就能夠定時的去爬取數據,也可使用其餘的方式進行定時的調度,這裏很少說
至此 疫情數據的爬蟲就完成了
相關連接
從0開始疫情3D地球 - 3D疫情地球VDEarth - 1- 引言
從0開始疫情3D地球 - 3D疫情地球VDEarth - 2 - 前端代碼構建
從0開始疫情3D地球 - 3D疫情地球VDEarth - 3 - 3D地球組件實現(1)
從0開始疫情3D地球 - 3D疫情地球VDEarth - 4 - 3D地球組件實現(2)