今日概要:html
Requests 是使用 Apache2 Licensed 許可證的 基於Python開發的HTTP 庫,其在Python內置模塊的基礎上進行了高度的封裝,從而使得Pythoner進行網絡請求時,變得美好了許多,使用Requests能夠垂手可得的完成瀏覽器可有的任何操做。python
無參數實例:git
# 無參實例 import requests data = requests.get("http://www.sina.com.cn/") print(data.url) print(data.text)
有參實例:github
import requests payload = {'key1': 'value1', 'key2': 'value2'} ret = requests.get("http://httpbin.org/get", params=payload) print(ret.url) print(ret.text)
向 https://github.com/timeline.json 發送一個GET請求,將請求和響應相關均封裝在 data對象中。json
基本POST實例:windows
import requests payload = {'key1': 'value1', 'key2': 'value2'} data = requests.post("http://httpbin.org/post", data=payload) print(data.text)
發送請求頭和數據實例後端
# -*- coding:utf-8 -*- # !/usr/bin/python import requests import json url = 'https://api.github.com/some/endpoint' payload = {'some': 'data'} headers = {'content-type': 'application/json'} data = requests.post(url, data=json.dumps(payload), headers=headers) print(data.text) print(data.cookies)
requests.get(url, params=None, **kwargs) requests.post(url, data=None, json=None, **kwargs) requests.put(url, data=None, **kwargs) requests.head(url, **kwargs) requests.delete(url, **kwargs) requests.patch(url, data=None, **kwargs) requests.options(url, **kwargs) # 以上方法均是在此方法的基礎上構建 requests.request(method, url, **kwargs)
1 def request(method, url, **kwargs): 2 """Constructs and sends a :class:`Request <Request>`. 3 4 :param method: method for the new :class:`Request` object. 5 :param url: URL for the new :class:`Request` object. 6 :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. 7 :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`. 8 :param json: (optional) json data to send in the body of the :class:`Request`. 9 :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. 10 :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. 11 :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload. 12 ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')`` 13 or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string 14 defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers 15 to add for the file. 16 :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. 17 :param timeout: (optional) How long to wait for the server to send data 18 before giving up, as a float, or a :ref:`(connect timeout, read 19 timeout) <timeouts>` tuple. 20 :type timeout: float or tuple 21 :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed. 22 :type allow_redirects: bool 23 :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. 24 :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``. 25 :param stream: (optional) if ``False``, the response content will be immediately downloaded. 26 :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. 27 :return: :class:`Response <Response>` object 28 :rtype: requests.Response 29 30 Usage:: 31 32 >>> import requests 33 >>> req = requests.request('GET', 'http://httpbin.org/get') 34 <Response [200]> 35 """ 36 37 參數列表
更多requests模塊相關的文檔見:http://cn.python-requests.org/zh_CN/latest/api
# -*- coding:utf-8 -*- # !/usr/bin/python from bs4 import BeautifulSoup import requests # http方式 response = requests.get("http://www.autohome.com.cn/news/") response.encoding = 'gbk' soup = BeautifulSoup(response.text, "html.parser") tag = soup.find(name="div", attrs={"id":"auto-channel-lazyload-article"}) li_list = tag.find_all("li") # [標籤對象,標籤對象] for li in li_list: h3 = li.find(name="h3") if not h3: continue print(h3.text, li.find(name="a").get("href")) """ 售13.59-18.59萬元 別克新款威朗上市 //www.autohome.com.cn/news/201710/908038.html#pvareaid=102624 售11.99-14.69萬元 別克閱朗正式上市 //www.autohome.com.cn/news/201710/908029.html#pvareaid=102624 售14.49-16.69萬元 別克GL6正式上市 //www.autohome.com.cn/news/201710/908024.html#pvareaid=102624 售10.99-14.39萬元 別克新款英朗上市 //www.autohome.com.cn/news/201710/908023.html#pvareaid=102624 中型SUV/1.6T動力 中華V7申報圖曝光 //www.autohome.com.cn/news/201710/908128.html#pvareaid=102624 拉低門檻 奔馳C級或換裝全新1.3T發動機 //www.autohome.com.cn/news/201710/908114.html#pvareaid=102624 外觀造型硬朗 昌河全新SUV申報圖曝光 //www.autohome.com.cn/news/201710/908111.html#pvareaid=102624 將於年內正式投產 捷豹XEL實車曝光 //www.autohome.com.cn/news/201710/908101.html#pvareaid=102624 與海外版一致 英菲尼迪新款Q50L申報圖 //www.autohome.com.cn/news/201710/908108.html#pvareaid=102624 或11月上市/兩種動力 榮威RX3實車到店 //www.autohome.com.cn/news/201710/908106.html#pvareaid=102624 更年輕 北汽新能源EC180/200推定製套裝 //www.autohome.com.cn/news/201710/908107.html#pvareaid=102624 即將「復活」 別克全新凱越申報圖曝光 //www.autohome.com.cn/news/201710/908105.html#pvareaid=102624 內飾面目一新 全新牧馬人產品手冊曝光 //www.autohome.com.cn/news/201710/908102.html#pvareaid=102624 售16.78-17.98萬元 長安CS95榮耀版上市 //www.autohome.com.cn/news/201710/908103.html#pvareaid=102624 售9.98-18.68萬 2018款榮威RX5上市 //www.autohome.com.cn/news/201710/908094.html#pvareaid=102624 """
BeautifulSoup是一個模塊,該模塊用於接收一個HTML或XML字符串,而後將其進行格式化,以後遍可使用他提供的方法進行快速查找指定元素,從而使得在HTML或XML中查找指定元素變得簡單。瀏覽器
windows下安裝BeautifulSoup模塊:pip install BeautifulSoup4服務器
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> asdf <div class="title"> <b>The Dormouse's story總共</b> <h1>f</h1> </div> <div class="story">Once upon a time there were three little sisters; and their names were <a class="sister0" id="link1">Els<span>f</span>ie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</div> ad<br/>sf <p class="story">...</p> </body> </html> """ soup = BeautifulSoup(html_doc, features="lxml") # 找到第一個a標籤 tag1 = soup.find(name='a') # 找到全部的a標籤 tag2 = soup.find_all(name='a') # 找到id=link2的標籤 tag3 = soup.select('#link2')
1. name,標籤名稱
# tag = soup.find('a') # name = tag.name # 獲取 # print(name) # tag.name = 'span' # 設置 # print(soup)
2.attrs,標籤屬性
# tag = soup.find('a') # attrs = tag.attrs # 獲取 # print(attrs) # tag.attrs = {'ik':123} # 設置 # tag.attrs['id'] = 'iiiii' # 設置 # print(soup)
3.children,全部子標籤
# body = soup.find('body') # v = body.children
4.children,全部子子孫孫標籤
# body = soup.find('body') # v = body.descendants
5.clear,將標籤的全部子標籤所有清空(保留標籤名)
# tag = soup.find('body') # tag.clear() # print(soup)
6.decompose,遞歸的刪除全部的標籤
# body = soup.find('body') # body.decompose() # print(soup)
7.extract,遞歸的刪除全部的標籤,並獲取刪除的標籤
# body = soup.find('body') # v = body.extract() # print(soup)
8. decode,轉換爲字符串(含當前標籤);decode_contents(不含當前標籤)
# body = soup.find('body') # v = body.decode() # v = body.decode_contents() # print(v)
9.encode,轉換爲字節(含當前標籤);encode_contents(不含當前標籤)
# body = soup.find('body') # v = body.encode() # v = body.encode_contents() # print(v)
10. find,獲取匹配的第一個標籤
# tag = soup.find('a') # print(tag) # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie') # print(tag)
11. find_all,獲取匹配的全部標籤
1 # tags = soup.find_all('a') 2 # print(tags) 3 4 # tags = soup.find_all('a',limit=1) 5 # print(tags) 6 7 # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') 8 # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie') 9 # print(tags) 10 11 12 # ####### 列表 ####### 13 # v = soup.find_all(name=['a','div']) 14 # print(v) 15 16 # v = soup.find_all(class_=['sister0', 'sister']) 17 # print(v) 18 19 # v = soup.find_all(text=['Tillie']) 20 # print(v, type(v[0])) 21 22 23 # v = soup.find_all(id=['link1','link2']) 24 # print(v) 25 26 # v = soup.find_all(href=['link1','link2']) 27 # print(v) 28 29 # ####### 正則 ####### 30 import re 31 # rep = re.compile('p') 32 # rep = re.compile('^p') 33 # v = soup.find_all(name=rep) 34 # print(v) 35 36 # rep = re.compile('sister.*') 37 # v = soup.find_all(class_=rep) 38 # print(v) 39 40 # rep = re.compile('http://www.oldboy.com/static/.*') 41 # v = soup.find_all(href=rep) 42 # print(v) 43 44 # ####### 方法篩選 ####### 45 # def func(tag): 46 # return tag.has_attr('class') and tag.has_attr('id') 47 # v = soup.find_all(name=func) 48 # print(v) 49 50 51 # ## get,獲取標籤屬性 52 # tag = soup.find('a') 53 # v = tag.get('id') 54 # print(v)
12. has_attr,檢查標籤是否具備該屬性
# tag = soup.find('a') # v = tag.has_attr('id') # print(v)
13.get_text,獲取標籤內部文本內容
# tag = soup.find('a') # v = tag.get_text('id') # print(v)
14.index,檢查標籤在某標籤中的索引位置
# tag = soup.find('body') # v = tag.index(tag.find('div')) # print(v) # tag = soup.find('body') # for i,v in enumerate(tag): # print(i,v)
15. is_empty_element,是不是空標籤(是否能夠是空)或者自閉合標籤,判斷是不是以下標籤:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'
# tag = soup.find('br') # v = tag.is_empty_element # print(v)
16. 當前的關聯標籤
# soup.next # soup.next_element # soup.next_elements # soup.next_sibling # soup.next_siblings # # tag.previous # tag.previous_element # tag.previous_elements # tag.previous_sibling # tag.previous_siblings # # tag.parent # tag.parents
17. 查找某標籤的關聯標籤
# tag.find_next(...) # tag.find_all_next(...) # tag.find_next_sibling(...) # tag.find_next_siblings(...) # tag.find_previous(...) # tag.find_all_previous(...) # tag.find_previous_sibling(...) # tag.find_previous_siblings(...) # tag.find_parent(...) # tag.find_parents(...) # 參數同find_all
18. select,select_one, CSS選擇器
1 soup.select("title") 2 3 soup.select("p nth-of-type(3)") 4 5 soup.select("body a") 6 7 soup.select("html head title") 8 9 tag = soup.select("span,a") 10 11 soup.select("head > title") 12 13 soup.select("p > a") 14 15 soup.select("p > a:nth-of-type(2)") 16 17 soup.select("p > #link1") 18 19 soup.select("body > a") 20 21 soup.select("#link1 ~ .sister") 22 23 soup.select("#link1 + .sister") 24 25 soup.select(".sister") 26 27 soup.select("[class~=sister]") 28 29 soup.select("#link1") 30 31 soup.select("a#link2") 32 33 soup.select('a[href]') 34 35 soup.select('a[href="http://example.com/elsie"]') 36 37 soup.select('a[href^="http://example.com/"]') 38 39 soup.select('a[href$="tillie"]') 40 41 soup.select('a[href*=".com/el"]') 42 43 44 from bs4.element import Tag 45 46 def default_candidate_generator(tag): 47 for child in tag.descendants: 48 if not isinstance(child, Tag): 49 continue 50 if not child.has_attr('href'): 51 continue 52 yield child 53 54 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator) 55 print(type(tags), tags) 56 57 from bs4.element import Tag 58 def default_candidate_generator(tag): 59 for child in tag.descendants: 60 if not isinstance(child, Tag): 61 continue 62 if not child.has_attr('href'): 63 continue 64 yield child 65 66 tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1) 67 print(type(tags), tags)
19. 標籤的內容
# tag = soup.find('span') # print(tag.string) # 獲取 # tag.string = 'new content' # 設置 # print(soup) # tag = soup.find('body') # print(tag.string) # tag.string = 'xxx' # print(soup) # tag = soup.find('body') # v = tag.stripped_strings # 遞歸內部獲取全部標籤的文本 # print(v)
20.append在當前標籤內部追加一個標籤
# tag = soup.find('body') # tag.append(soup.find('a')) # print(soup) # # from bs4.element import Tag # obj = Tag(name='i',attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('body') # tag.append(obj) # print(soup)
21.insert在當前標籤內部指定位置插入一個標籤
# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('body') # tag.insert(2, obj) # print(soup)
22. insert_after,insert_before 在當前標籤後面或前面插入
# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('body') # # tag.insert_before(obj) # tag.insert_after(obj) # print(soup)
23. replace_with 在當前標籤替換爲指定標籤
# from bs4.element import Tag # obj = Tag(name='i', attrs={'id': 'it'}) # obj.string = '我是一個新來的' # tag = soup.find('div') # tag.replace_with(obj) # print(soup)
24. 建立標籤之間的關係
# tag = soup.find('div') # a = soup.find('a') # tag.setup(previous_sibling=a) # print(tag.previous_sibling)
25. wrap,將指定標籤把當前標籤包裹起來
# from bs4.element import Tag # obj1 = Tag(name='div', attrs={'id': 'it'}) # obj1.string = '我是一個新來的' # # tag = soup.find('a') # v = tag.wrap(obj1) # print(soup) # tag = soup.find('a') # v = tag.wrap(soup.find('p')) # print(soup)
26. unwrap,去掉當前標籤,將保留其包裹的標籤
# tag = soup.find('a') # v = tag.unwrap() # print(soup)
GitHub自動登陸
# -*- coding:utf-8 -*- # !/usr/bin/python from bs4 import BeautifulSoup import requests # 1. 獲取token和cookie r1 = requests.get(url='https://github.com/login') s1 = BeautifulSoup(r1.text,'html.parser') val = s1.find(attrs={'name':'authenticity_token'}).get('value') # cookie返回給你 r1_cookie_dict = r1.cookies.get_dict() # 發送用戶認證 r2 = requests.post( url='https://github.com/session', data={ 'commit':'Sign in', 'utf8':'✓', 'authenticity_token':val, 'login':'xxx', 'password':'xxx' }, cookies = r1_cookie_dict ) r2_cookie_dict = r2.cookies.get_dict() print(r1_cookie_dict) print(r2_cookie_dict) all_cookies = {} all_cookies.update(r1_cookie_dict) all_cookies.update(r2_cookie_dict) # 3.github直接用帶token以後的cookies就行 r3 = requests.get('https://github.com/settings/emails',cookies=r2_cookie_dict) print(r3.text)
登陸抽屜並自動點贊
# -*- coding:utf-8 -*- # !/usr/bin/python from bs4 import BeautifulSoup import requests r1 = requests.get(url='http://dig.chouti.com/') r1_cookies_dict = r1.cookies.get_dict() r2 = requests.post( url='http://dig.chouti.com/login', data={ 'phone':'xxx', 'password':'xxx', 'oneMonth':1 }, cookies = r1_cookies_dict ) r2_cookies_dict = r2.cookies.get_dict() print(r1_cookies_dict) print(r2_cookies_dict) all_cookies = {} all_cookies.update(r1_cookies_dict) all_cookies.update(r2_cookies_dict) r3 = requests.post('http://dig.chouti.com/link/vote?linksId=14708906',cookies=r1_cookies_dict) print(r3.text)
注意:有的登陸頁面,登陸的時候不必定會給cookie,須要get一次纔給cookie,而登陸的時候僅僅是受權,get的時候的cookie,這樣就不須要帶第二次的cookie去請求
輪詢:客戶端定時向服務器發送Ajax請求,服務器接到請求後立刻返回響應信息並關閉鏈接。
優勢:後端程序編寫比較容易。
缺點:請求中有大半是無用,浪費帶寬和服務器資源。
實例:適於小型應用。
長輪詢:客戶端向服務器發送Ajax請求,服務器接到請求後hold住鏈接,直到有新消息才返回響應信息並關閉鏈接,客戶端處理完響應信息後再向服務器發送新的請求,服務器端會設置超時時間,當出現超時的時候,服務端會斷開連接,客戶端會再次請求服務端hold住
優勢:在無消息的狀況下不會頻繁的請求。
缺點:服務器hold鏈接會消耗資源。
實例:WebQQ、Hi網頁版、Facebook IM。
另外,對於長鏈接和socket鏈接也有區分:
長鏈接:在頁面裏嵌入一個隱蔵iframe,將這個隱蔵iframe的src屬性設爲對一個長鏈接的請求,服務器端就能源源不斷地往客戶端輸入數據。優勢:消息即時到達,不發無用請求。缺點:服務器維護一個長鏈接會增長開銷。實例:Gmail聊天