沛奇老師講的真心不錯。經過這節學習,讓我能簡單獲取一些網站的信息了。之前是隻能獲取靜態網頁,不知道獲取要登陸的網站的資源。此次後能獲獎一些須要登陸功能網站的資源了,並且也對requests模板更加熟練了。更重要的是,當爬蟲時,怎麼去分析網頁,這個學到了不少。html
百度百科:網絡爬蟲(又被稱爲網頁蜘蛛,網絡機器人,在FOAF社區中間,更常常的稱爲網頁追逐者),是一種按照必定的規則,自動地抓取萬維網信息的程序或者腳本。python
經過Python能夠快速的編寫爬蟲程序,來獲取指定URL的資源。python爬蟲用requests和bs4這兩個模板就能夠爬取不少資源了。git
request用到的經常使用兩個方法爲 get 和 post。github
因爲網絡上,大多數的url訪問都是這兩種訪問,因此經過這兩個方法能夠獲取大多數網絡資源。web
這兩個方法的主要參數以下:json
url:想要獲取URL資源的連接。cookie
headers:請求頭,因爲不少網站都作了反爬蟲。因此假裝好headers就能讓網站沒法釋放是機器在訪問。網絡
json:當訪問須要攜帶json時加入。session
data:當訪問須要攜帶data時加入,通常登陸網站的用戶名和密碼都在data裏。app
cookie:因爲辨別用戶身份,爬取靜態網站不須要,但須要登陸的網站就須要用到cookie。
parmas:參數,有些url帶id=1&user=starry等等,能夠寫進parmas這個參數裏。
timeout:設置訪問超時時間,當超過這個時間沒有獲取到資源就中止。
allow_redirects:有些url會重定向到另一個url,設置爲False能夠本身不讓它重定向。
proxies:設置代理。
以上參數是主要用到的參數。
bs4是將request獲取到的內容進行解析,能更快的找到內容,也很方便。
當requests返回的text內容爲html時,用bs4進行解析用,soup = BeautifulSoup4(html, "html.parser")
soup 經常使用的方法有:
find:根據參數查找第一個符合的內容,用用的有name和attrs參數
find_all:查找所有的。
get:獲取標籤的屬性
經常使用的屬性有children
一、首先向 https://dig.chouti.com 訪問獲取cookie
1 r1 = requests.get( 2 url="https://dig.chouti.com/", 3 headers={ 4 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36" 5 } 6 ) 7 r1_cookie_dict = r1.cookies.get_dict()
2 、訪問 https://dig.chouti.com/login 這個網頁,並攜帶帳戶、密碼和cookie。
1 response_login = requests.post( 2 url='https://dig.chouti.com/login', 3 data={ 4 "phone":"xxx", 5 "password":"xxx", 6 "oneMonth":"1" 7 }, 8 headers={ 9 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36" 10 }, 11 cookies = r1_cookie_dict 12 )
三、獲取點贊某個新聞須要的url,攜帶cookie訪問這個url就能夠訪問了。
1 rst = requests.post( 2 url="https://dig.chouti.com/link/vote?linksId=20639606", 3 headers={ 4 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36" 5 }, 6 cookies = r1_cookie_dict 7 )
1 拉勾網爲了防訪問,在headers裏須要 X-Anit-Forge-Code 和 X-Anit-Forge-Token 這兩個值,這兩個值訪問拉勾登陸網頁能夠獲取。
1 r1 = requests.get( 2 url='https://passport.lagou.com/login/login.html', 3 headers={ 4 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36', 5 'Host':'passport.lagou.com' 6 } 7 ) 8 all_cookies.update(r1.cookies.get_dict()) 9 10 X_Anit_Forge_Token = re.findall(r"X_Anti_Forge_Token = '(.*?)';",r1.text,re.S)[0] 11 X_Anti_Forge_Code = re.findall(r"X_Anti_Forge_Code = '(.*?)';",r1.text,re.S)[0]
2 而後訪問登陸url,並將帳戶、密碼和上面兩個值攜帶進去。固然,headers裏的值須要全一點。
1 r2 = requests.post( 2 url="https://passport.lagou.com/login/login.json", 3 headers={ 4 'Host':'passport.lagou.com', 5 "Referer":"https://passport.lagou.com/login/login.html", 6 "X-Anit-Forge-Code": X_Anti_Forge_Code, 7 "X-Anit-Forge-Token": X_Anit_Forge_Token, 8 "X-Requested-With": "XMLHttpRequest", 9 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36', 10 "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8" 11 }, 12 data={ 13 "isValidate":"true", 14 "username":"xxx", 15 "password":"xxx", 16 "request_form_verifyCode":"", 17 "submit":"" 18 }, 19 cookies=r1.cookies.get_dict() 20 ) 21 all_cookies.update(r2.cookies.get_dict())
3 固然,獲取了登陸成功的cookie也不必定能夠修改信息。這時還須要用戶受權。訪問 https://passport.lagou.com/grantServiceTicket/grant.html這個網頁就能夠獲取到
1 r3 = requests.get( 2 url='https://passport.lagou.com/grantServiceTicket/grant.html', 3 headers={ 4 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 5 }, 6 allow_redirects = False, 7 cookies = all_cookies 8 ) 9 all_cookies.update(r3.cookies.get_dict())
4 用戶受權會重定向一系列的url,咱們須要把重定向的網頁的cookie所有拿到。重定向的網頁在上一個url中的Location裏。
1 r4 = requests.get( 2 url=r3.headers['Location'], 3 headers={ 4 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 5 }, 6 allow_redirects=False, 7 cookies=all_cookies 8 ) 9 all_cookies.update(r4.cookies.get_dict()) 10 11 print('r5',r4.headers['Location']) 12 r5 = requests.get( 13 url=r4.headers['Location'], 14 headers={ 15 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 16 }, 17 allow_redirects=False, 18 cookies=all_cookies 19 ) 20 all_cookies.update(r5.cookies.get_dict()) 21 22 print('r6',r5.headers['Location']) 23 r6 = requests.get( 24 url=r5.headers['Location'], 25 headers={ 26 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 27 }, 28 allow_redirects=False, 29 cookies=all_cookies 30 ) 31 all_cookies.update(r6.cookies.get_dict()) 32 33 print('r7',r6.headers['Location']) 34 r7 = requests.get( 35 url=r6.headers['Location'], 36 headers={ 37 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 38 }, 39 allow_redirects=False, 40 cookies=all_cookies 41 ) 42 all_cookies.update(r7.cookies.get_dict())
5 接下來就是修改信息了,修改信息時訪問的url須要傳入submitCode和submitToken這兩個值,經過分析能夠獲得這兩個值在訪問 https://gate.lagou.com/v1/neirong/account/users/0/這裏url的返回值中能夠獲取。
獲取submitCode和submitToken:
1 r6 = requests.get( 2 url='https://gate.lagou.com/v1/neirong/account/users/0/', 3 headers={ 4 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36", 5 'X-L-REQ-HEADER': "{deviceType:1}", 6 'Origin': 'https://account.lagou.com', 7 'Host': 'gate.lagou.com', 8 }, 9 cookies = all_cookies 10 ) 11 all_cookies.update(r6.cookies.get_dict()) 12 r6_json = r6.json()
修改信息:
1 r7 = requests.put( 2 url='https://gate.lagou.com/v1/neirong/account/users/0/', 3 headers={ 4 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36', 5 'Origin': 'https://account.lagou.com', 6 'Host': 'gate.lagou.com', 7 'X-Anit-Forge-Code': r6_json['submitCode'], 8 'X-Anit-Forge-Token': r6_json['submitToken'], 9 'X-L-REQ-HEADER': "{deviceType:1}", 10 }, 11 cookies=all_cookies, 12 json={"userName": "Starry", "sex": "MALE", "portrait": "images/myresume/default_headpic.png", 13 "positionName": '...', "introduce": '....'} 14 )
總體代碼以下:
1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 4 ''' 5 @auther: Starry 6 @file: 5.修改我的信息.py 7 @time: 2018/7/5 11:43 8 ''' 9 10 import re 11 import requests 12 13 14 all_cookies = {} 15 #########################一、查看登陸界面 ################### 16 r1 = requests.get( 17 url='https://passport.lagou.com/login/login.html', 18 headers={ 19 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36', 20 'Host':'passport.lagou.com' 21 } 22 ) 23 all_cookies.update(r1.cookies.get_dict()) 24 25 X_Anit_Forge_Token = re.findall(r"X_Anti_Forge_Token = '(.*?)';",r1.text,re.S)[0] 26 X_Anti_Forge_Code = re.findall(r"X_Anti_Forge_Code = '(.*?)';",r1.text,re.S)[0] 27 print(X_Anit_Forge_Token,X_Anti_Forge_Code) 28 #################2.用戶名密碼登陸 ################################ 29 r2 = requests.post( 30 url="https://passport.lagou.com/login/login.json", 31 headers={ 32 'Host':'passport.lagou.com', 33 "Referer":"https://passport.lagou.com/login/login.html", 34 "X-Anit-Forge-Code": X_Anti_Forge_Code, 35 "X-Anit-Forge-Token": X_Anit_Forge_Token, 36 "X-Requested-With": "XMLHttpRequest", 37 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36', 38 "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8" 39 }, 40 data={ 41 "isValidate":"true", 42 "username":"xxx", 43 "password":"xxx", 44 "request_form_verifyCode":"", 45 "submit":"" 46 }, 47 cookies=r1.cookies.get_dict() 48 ) 49 all_cookies.update(r2.cookies.get_dict()) 50 51 ###################3 用戶受權 ######################### 52 r3 = requests.get( 53 url='https://passport.lagou.com/grantServiceTicket/grant.html', 54 headers={ 55 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 56 }, 57 allow_redirects = False, 58 cookies = all_cookies 59 ) 60 all_cookies.update(r3.cookies.get_dict()) 61 62 ###################4 用戶認證 ######################### 63 print('r4',r3.headers['Location']) 64 r4 = requests.get( 65 url=r3.headers['Location'], 66 headers={ 67 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 68 }, 69 allow_redirects=False, 70 cookies=all_cookies 71 ) 72 all_cookies.update(r4.cookies.get_dict()) 73 74 print('r5',r4.headers['Location']) 75 r5 = requests.get( 76 url=r4.headers['Location'], 77 headers={ 78 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 79 }, 80 allow_redirects=False, 81 cookies=all_cookies 82 ) 83 all_cookies.update(r5.cookies.get_dict()) 84 85 print('r6',r5.headers['Location']) 86 r6 = requests.get( 87 url=r5.headers['Location'], 88 headers={ 89 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 90 }, 91 allow_redirects=False, 92 cookies=all_cookies 93 ) 94 all_cookies.update(r6.cookies.get_dict()) 95 96 print('r7',r6.headers['Location']) 97 r7 = requests.get( 98 url=r6.headers['Location'], 99 headers={ 100 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36' 101 }, 102 allow_redirects=False, 103 cookies=all_cookies 104 ) 105 all_cookies.update(r7.cookies.get_dict()) 106 107 ###################5. 查看我的頁面 ######################### 108 109 r5 = requests.get( 110 url='https://www.lagou.com/resume/myresume.html', 111 headers={ 112 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36" 113 }, 114 cookies = all_cookies 115 ) 116 117 print("李港" in r5.text) 118 119 120 ###################6. 查看 ######################### 121 r6 = requests.get( 122 url='https://gate.lagou.com/v1/neirong/account/users/0/', 123 headers={ 124 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36", 125 'X-L-REQ-HEADER': "{deviceType:1}", 126 'Origin': 'https://account.lagou.com', 127 'Host': 'gate.lagou.com', 128 }, 129 cookies = all_cookies 130 ) 131 all_cookies.update(r6.cookies.get_dict()) 132 r6_json = r6.json() 133 print(r6.json()) 134 # ################################ 7.修改信息 ########################## 135 r7 = requests.put( 136 url='https://gate.lagou.com/v1/neirong/account/users/0/', 137 headers={ 138 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36', 139 'Origin': 'https://account.lagou.com', 140 'Host': 'gate.lagou.com', 141 'X-Anit-Forge-Code': r6_json['submitCode'], 142 'X-Anit-Forge-Token': r6_json['submitToken'], 143 'X-L-REQ-HEADER': "{deviceType:1}", 144 }, 145 cookies=all_cookies, 146 json={"userName": "Starry", "sex": "MALE", "portrait": "images/myresume/default_headpic.png", 147 "positionName": '...', "introduce": '....'} 148 ) 149 print(r7.text)
運行以下:
源代碼:
1 #-*- coding:utf-8 -*- 2 3 ''' 4 @auther: Starry 5 @file: github_login.py 6 @time: 2018/6/22 16:22 7 ''' 8 9 import requests 10 from bs4 import BeautifulSoup 11 import bs4 12 13 14 headers = { 15 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", 16 "Accept-Encoding": "gzip, deflate, br", 17 "Accept-Language": "zh-CN,zh;q=0.9", 18 "Cache-Control": "max-age=0", 19 "Connection": "keep-alive", 20 "Host": "github.com", 21 "Upgrade-Insecure-Requests": "1", 22 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36" 23 } 24 def login(user, password): 25 ''' 26 經過已知的帳號和密碼自動登入GitHub,並返回該帳號的Cookie 27 :param user: 登入帳號 28 :param password: 登入密碼 29 :return: 返回Cookie 30 ''' 31 r1 = requests.get( 32 url = "https://github.com/login", 33 headers = headers 34 ) 35 cookies_dict = r1.cookies.get_dict() 36 soup = BeautifulSoup(r1.text,"html.parser") 37 authenticity_token = soup.find('input',attrs={"name":"authenticity_token"}).get('value') 38 # print(authenticity_token) 39 res = requests.post( 40 url = "https://github.com/session", 41 data = { 42 "commit":"Sign in", 43 "utf8":"✓", 44 "authenticity_token":authenticity_token, 45 "login":user, 46 "password":password 47 }, 48 headers = headers, 49 cookies = cookies_dict 50 ) 51 cookies_dict.update(res.cookies.get_dict()) 52 return cookies_dict 53 54 def View_Information(cookies_dict): 55 ''' 56 展現GitHub的我的信息 57 :param cookies_dict: 帳號的Cookie值 58 :return: 59 ''' 60 url_setting_profile = "https://github.com/settings/profile" 61 html = requests.get( 62 url=url_setting_profile, 63 cookies = cookies_dict, 64 headers=headers 65 ) 66 soup = BeautifulSoup(html.text,"html.parser") 67 user_name = soup.find('input',attrs={'id':'user_profile_name'}).get('value') 68 user_email = [] 69 select = soup.find('select',attrs={'class':'form-select select'}) 70 for index,child in enumerate(select): 71 if index == 0: 72 continue 73 if isinstance(child,bs4.element.Tag): 74 user_email.append(child.get('value')) 75 Bio = soup.find('textarea', attrs={'id': 'user_profile_bio'}).text 76 URL = soup.find('input',attrs={'id':'user_profile_blog'}).get('value') 77 Company = soup.find('input',attrs={'id':'user_profile_company'}).get('value') 78 Location = soup.find('input',attrs={'id':'user_profile_location'}).get('value') 79 print(''' 80 用戶名爲:{0} 81 綁定的郵箱地址爲:{1} 82 個性簽名爲:{2} 83 主頁爲:{3} 84 公司爲:{4} 85 位置爲:{5} 86 '''.format(user_name,user_email,Bio,URL,Company,Location)) 87 html = requests.get( 88 url="https://github.com/settings/repositories", 89 cookies = cookies_dict, 90 headers = headers 91 ) 92 html.encoding = html.apparent_encoding 93 soup = BeautifulSoup(html.text, "html.parser") 94 div = soup.find(name='div',attrs={"class":"listgroup"}) 95 # print(div) 96 tplt = "{0:^20}\t{1:^20}\t{2:^20}" 97 print(tplt.format("項目名稱","項目連接","項目大小")) 98 for child in div.children: 99 if isinstance(child, bs4.element.Tag): 100 a = child.find('a',attrs={"class":"mr-1"}) 101 small = child.find('small') 102 print(tplt.format(a.text, a.get("href"), small.text)) 103 104 if __name__ == '__main__': 105 user = 'xxx' 106 password = 'xxx' 107 cookies_dict = login(user,password) 108 View_Information(cookies_dict)