十分鐘學會reqests模塊爬取數據——從爬取疫情數聽說起

時間 2021-01-22

標籤 html python git github 正則表達式 json windows api 服務器 app 欄目 HTML 简体版

原文原文鏈接

在作疫情數據可視化的時候涉及到一些數據的爬取，通常python中爬取數據經常使用的就是requests和urllib，二者相比requests更加快速便捷。代碼也更容易理解。html

安裝

pip install requestspython

pip install jsongit

由於爬取下來的數據大可能是json格式因此爲了解析數據咱們還會用到json模塊github

快速使用

爲了更容易上手使用requests模塊，本文略去各類介紹性文字，直接以實戰代碼來說解。
正則表達式

直接使用API數據

OK，假如咱們如今想對2020-nCov的疫情數據進行可視化分析，若是直接從丁香園或者百度疫情等平臺獲取數據的話就會設計到正則表達式等比較複雜的處理，因此最省事的就是看看能不能找到一些提供數據的接口，很幸運百度和騰訊都提供了數據接口百度APIi、騰訊API，咱們以百度API爲例，直接打開API地址發現是一個字典。json

OK，是咱們想要的數據，這時候只要兩行代碼就能夠搞定windows

data = requests.get("https://service-nxxl1y2s-1252957949.gz.apigw.tencentcs.com/release/newpneumonia")
data = data.json()

看到爬下來的數據正是咱們須要的。接下來只須要從字典裏面一個一個取出所須要的數據就能夠進行可視化分析。api

data = data['data']['conf']['component'][0]['caseList']

這裏用到的就是最基本的requests用法，直接向網站請求數據就是get，固然還有其餘一大堆請求方式。服務器

requests.post(url)
requests.put(url)
requests.delete(url)
requests.head(url)
requests.options(url)

get方式也能夠發送帶參數的請求，按照如下方式就能夠傳遞參數：app

url = 'http://httpbin.org/get'
data = {
    'name':'zhangsan',
    'age':'25'
}
response = requests.get(url,params=data)
print(response.url)
print(response.text)

response.text返回的是Unicode格式，一般須要轉換爲utf-8格式，不然就是亂碼。response.content是二進制模式，能夠下載視頻之類的，若是想看的話須要decode成utf-8格式。不論是經過response.content.decode("utf-8)的方式仍是經過response.encoding="utf-8"的方式均可以免亂碼的問題發生。

接下來講下header的事情，header就是頭部信息，有些網站在你發送請求的時候就必需要求你帶一個請求頭，不然就會報錯。

所以咱們從github上找到一些別人作的比較簡易可是數據知足咱們需求的頁面進行爬取。好比咱們想爬知乎而且和上面同樣直接請求

url = 'https://www.zhihu.com/'
response = requests.get(url)
response.encoding = "utf-8"
print(response.text)

就會發現報錯400

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>

意思是「沒法找到該網頁」HTTP 錯誤400表示請求出錯,網站被刪除或者被屏蔽了。因爲語法格式有誤,服務器沒法理解此請求。因此要按照如下方式進行請求：

url = 'https://www.zhihu.com/'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
}
response = requests.get(url,headers=headers)
print(response.text)

就能夠成功把知乎首頁數據爬下來了。若是想在請求的同時傳一些數據就能夠經過post把數據提交到url地址，等同於一字典的形式提交form表單裏面的數據

url = 'http://httpbin.org/post'
data = {
    'name':'jack',
    'age':'23'
    }
response = requests.post(url,data=data)
print(response.text)

結果

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "age": "23",
    "name": "jack"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "16",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e3e9f1c-b6bdd9f63ad5a5f5bbce5f7b"
  },
  "json": null,
  "origin": "60.169.239.171",
  "url": "http://httpbin.org/post"
}

扯遠了，回到爬取疫情數據上來，剛剛說的是在找到了API的狀況下也就是找到了直接提供數據的網址，那若是有些消息找不到API呢，好比想爬取關於安徽省的新聞，這兩個API都沒有直接提供，然而https://yiqing.ahusmart.com/ 頁面上就顯示了有關安徽的新聞。

如今按下F12切換到network刷新一下頁面，而後check一下里面的內容，發現幾條信息的preview有點像新聞

再check一下里面的內容，恰好是咱們要的安徽新聞。

這時候回到headers裏面看看請求的網址

OK，就是這個，接下來按照剛剛的方法，向這個網址發送請求就能夠把有關安徽的新聞拿下來了\

res = requests.get("https://yiqing.ahusmart.com/news/%E5%AE%89%E5%BE%BD%E7%9C%81")
data = res.json()

通常經常使用的網站用get方法或者post方法就能夠搞定，那麼複雜一點的之後再講。

附一些狀態碼的說明：

100: ('continue',),101: ('switching_protocols',),102: ('processing',),103: ('checkpoint',),122: ('uri_too_long', 'request_uri_too_long'),200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),201: ('created',),202: ('accepted',),203: ('non_authoritative_info', 'non_authoritative_information'),204: ('no_content',),205: ('reset_content', 'reset'),206: ('partial_content', 'partial'),207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),208: ('already_reported',),226: ('im_used',),# Redirection.300: ('multiple_choices',),301: ('moved_permanently', 'moved', '\\o-'),302: ('found',),303: ('see_other', 'other'),304: ('not_modified',),305: ('use_proxy',),306: ('switch_proxy',),307: ('temporary_redirect', 'temporary_moved', 'temporary'),308: ('permanent_redirect',      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0# Client Error.400: ('bad_request', 'bad'),401: ('unauthorized',),402: ('payment_required', 'payment'),403: ('forbidden',),404: ('not_found', '-o-'),405: ('method_not_allowed', 'not_allowed'),406: ('not_acceptable',),407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),408: ('request_timeout', 'timeout'),409: ('conflict',),410: ('gone',),411: ('length_required',),412: ('precondition_failed', 'precondition'),413: ('request_entity_too_large',),414: ('request_uri_too_large',),415: ('unsupported_media_type', 'unsupported_media', 'media_type'),416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),417: ('expectation_failed',),418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),421: ('misdirected_request',),422: ('unprocessable_entity', 'unprocessable'),423: ('locked',),424: ('failed_dependency', 'dependency'),425: ('unordered_collection', 'unordered'),426: ('upgrade_required', 'upgrade'),428: ('precondition_required', 'precondition'),429: ('too_many_requests', 'too_many'),431: ('header_fields_too_large', 'fields_too_large'),444: ('no_response', 'none'),449: ('retry_with', 'retry'),450: ('blocked_by_windows_parental_controls', 'parental_controls'),451: ('unavailable_for_legal_reasons', 'legal_reasons'),499: ('client_closed_request',),# Server Error.500: ('internal_server_error', 'server_error', '/o\\', '✗'),501: ('not_implemented',),502: ('bad_gateway',),503: ('service_unavailable', 'unavailable'),504: ('gateway_timeout',),505: ('http_version_not_supported', 'http_version'),506: ('variant_also_negotiates',),507: ('insufficient_storage',),509: ('bandwidth_limit_exceeded', 'bandwidth'),510: ('not_extended',),511: ('network_authentication_required', 'network_auth', 'network_authe