路飛學城-Python爬蟲集訓-第1章

1心得體會

沛奇老師講的真心不錯。經過這節學習,讓我能簡單獲取一些網站的信息了。之前是隻能獲取靜態網頁,不知道獲取要登陸的網站的資源。此次後能獲獎一些須要登陸功能網站的資源了,並且也對requests模板更加熟練了。更重要的是,當爬蟲時,怎麼去分析網頁,這個學到了不少。html

2 什麼是爬蟲

  百度百科:網絡爬蟲(又被稱爲網頁蜘蛛,網絡機器人,在FOAF社區中間,更常常的稱爲網頁追逐者),是一種按照必定的規則,自動地抓取萬維網信息的程序或者腳本。python

  經過Python能夠快速的編寫爬蟲程序,來獲取指定URL的資源。python爬蟲用requestsbs4這兩個模板就能夠爬取不少資源了。git

3 request

  request用到的經常使用兩個方法爲 get 和 post。github

  因爲網絡上,大多數的url訪問都是這兩種訪問,因此經過這兩個方法能夠獲取大多數網絡資源。web

  這兩個方法的主要參數以下:json

    url:想要獲取URL資源的連接。cookie

    headers:請求頭,因爲不少網站都作了反爬蟲。因此假裝好headers就能讓網站沒法釋放是機器在訪問。網絡

    json:當訪問須要攜帶json時加入。session

    data:當訪問須要攜帶data時加入,通常登陸網站的用戶名和密碼都在data裏。app

    cookie:因爲辨別用戶身份,爬取靜態網站不須要,但須要登陸的網站就須要用到cookie。

    parmas:參數,有些url帶id=1&user=starry等等,能夠寫進parmas這個參數裏。

 

    timeout:設置訪問超時時間,當超過這個時間沒有獲取到資源就中止。

    allow_redirects:有些url會重定向到另一個url,設置爲False能夠本身不讓它重定向。

    proxies:設置代理。

  以上參數是主要用到的參數。

4.bs4

bs4是將request獲取到的內容進行解析,能更快的找到內容,也很方便。

當requests返回的text內容爲html時,用bs4進行解析用,soup = BeautifulSoup4(html, "html.parser") 

soup 經常使用的方法有:

find:根據參數查找第一個符合的內容,用用的有name和attrs參數

find_all:查找所有的。

get:獲取標籤的屬性

經常使用的屬性有children

5 登陸抽屜並自動點贊

一、首先向 https://dig.chouti.com 訪問獲取cookie

1 r1 = requests.get(
2     url="https://dig.chouti.com/",
3     headers={
4         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
5     }
6 )
7 r1_cookie_dict = r1.cookies.get_dict()

2 、訪問 https://dig.chouti.com/login 這個網頁,並攜帶帳戶、密碼和cookie。

 1 response_login = requests.post(
 2     url='https://dig.chouti.com/login',
 3     data={
 4         "phone":"xxx",
 5         "password":"xxx",
 6         "oneMonth":"1"
 7     },
 8     headers={
 9         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
10     },
11     cookies = r1_cookie_dict
12 )

三、獲取點贊某個新聞須要的url,攜帶cookie訪問這個url就能夠訪問了。

1 rst = requests.post(
2         url="https://dig.chouti.com/link/vote?linksId=20639606",
3         headers={
4             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
5         },
6         cookies = r1_cookie_dict
7     )

6 自動登入拉勾網並修改信息

1 拉勾網爲了防訪問,在headers裏須要 X-Anit-Forge-Code 和 X-Anit-Forge-Token 這兩個值,這兩個值訪問拉勾登陸網頁能夠獲取。

 1 r1 = requests.get(
 2     url='https://passport.lagou.com/login/login.html',
 3     headers={
 4         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
 5         'Host':'passport.lagou.com'
 6     }
 7 )
 8 all_cookies.update(r1.cookies.get_dict())
 9 
10 X_Anit_Forge_Token = re.findall(r"X_Anti_Forge_Token = '(.*?)';",r1.text,re.S)[0]
11 X_Anti_Forge_Code = re.findall(r"X_Anti_Forge_Code = '(.*?)';",r1.text,re.S)[0]

2 而後訪問登陸url,並將帳戶、密碼和上面兩個值攜帶進去。固然,headers裏的值須要全一點。

 1 r2 = requests.post(
 2     url="https://passport.lagou.com/login/login.json",
 3     headers={
 4         'Host':'passport.lagou.com',
 5         "Referer":"https://passport.lagou.com/login/login.html",
 6         "X-Anit-Forge-Code": X_Anti_Forge_Code,
 7         "X-Anit-Forge-Token": X_Anit_Forge_Token,
 8         "X-Requested-With": "XMLHttpRequest",
 9         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
10         "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
11     },
12     data={
13         "isValidate":"true",
14         "username":"xxx",
15         "password":"xxx",
16         "request_form_verifyCode":"",
17         "submit":""
18     },
19     cookies=r1.cookies.get_dict()
20 )
21 all_cookies.update(r2.cookies.get_dict())

3 固然,獲取了登陸成功的cookie也不必定能夠修改信息。這時還須要用戶受權。訪問 https://passport.lagou.com/grantServiceTicket/grant.html這個網頁就能夠獲取到

1 r3 = requests.get(
2     url='https://passport.lagou.com/grantServiceTicket/grant.html',
3     headers={
4         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
5     },
6     allow_redirects = False,
7     cookies = all_cookies
8 )
9 all_cookies.update(r3.cookies.get_dict())

4 用戶受權會重定向一系列的url,咱們須要把重定向的網頁的cookie所有拿到。重定向的網頁在上一個url中的Location裏。

 1 r4 = requests.get(
 2     url=r3.headers['Location'],
 3     headers={
 4         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
 5     },
 6     allow_redirects=False,
 7     cookies=all_cookies
 8 )
 9 all_cookies.update(r4.cookies.get_dict())
10 
11 print('r5',r4.headers['Location'])
12 r5 = requests.get(
13     url=r4.headers['Location'],
14     headers={
15         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
16     },
17     allow_redirects=False,
18     cookies=all_cookies
19 )
20 all_cookies.update(r5.cookies.get_dict())
21 
22 print('r6',r5.headers['Location'])
23 r6 = requests.get(
24     url=r5.headers['Location'],
25     headers={
26         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
27     },
28     allow_redirects=False,
29     cookies=all_cookies
30 )
31 all_cookies.update(r6.cookies.get_dict())
32 
33 print('r7',r6.headers['Location'])
34 r7 = requests.get(
35     url=r6.headers['Location'],
36     headers={
37         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
38     },
39     allow_redirects=False,
40     cookies=all_cookies
41 )
42 all_cookies.update(r7.cookies.get_dict())
用戶認證

5 接下來就是修改信息了,修改信息時訪問的url須要傳入submitCode和submitToken這兩個值,經過分析能夠獲得這兩個值在訪問 https://gate.lagou.com/v1/neirong/account/users/0/這裏url的返回值中能夠獲取。

獲取submitCode和submitToken:

 1 r6 = requests.get(
 2     url='https://gate.lagou.com/v1/neirong/account/users/0/',
 3     headers={
 4         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36",
 5         'X-L-REQ-HEADER': "{deviceType:1}",
 6         'Origin': 'https://account.lagou.com',
 7         'Host': 'gate.lagou.com',
 8     },
 9     cookies = all_cookies
10 )
11 all_cookies.update(r6.cookies.get_dict())
12 r6_json = r6.json()

修改信息:

 1 r7 = requests.put(
 2     url='https://gate.lagou.com/v1/neirong/account/users/0/',
 3     headers={
 4         'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
 5         'Origin': 'https://account.lagou.com',
 6         'Host': 'gate.lagou.com',
 7         'X-Anit-Forge-Code': r6_json['submitCode'],
 8         'X-Anit-Forge-Token': r6_json['submitToken'],
 9         'X-L-REQ-HEADER': "{deviceType:1}",
10     },
11     cookies=all_cookies,
12     json={"userName": "Starry", "sex": "MALE", "portrait": "images/myresume/default_headpic.png",
13           "positionName": '...', "introduce": '....'}
14 )

總體代碼以下:

  1 #!/usr/bin/env python
  2 # -*- coding: utf-8 -*-
  3 
  4 '''
  5 @auther: Starry
  6 @file: 5.修改我的信息.py
  7 @time: 2018/7/5 11:43
  8 '''
  9 
 10 import re
 11 import requests
 12 
 13 
 14 all_cookies = {}
 15 #########################一、查看登陸界面 ###################
 16 r1 = requests.get(
 17     url='https://passport.lagou.com/login/login.html',
 18     headers={
 19         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
 20         'Host':'passport.lagou.com'
 21     }
 22 )
 23 all_cookies.update(r1.cookies.get_dict())
 24 
 25 X_Anit_Forge_Token = re.findall(r"X_Anti_Forge_Token = '(.*?)';",r1.text,re.S)[0]
 26 X_Anti_Forge_Code = re.findall(r"X_Anti_Forge_Code = '(.*?)';",r1.text,re.S)[0]
 27 print(X_Anit_Forge_Token,X_Anti_Forge_Code)
 28 #################2.用戶名密碼登陸 ################################
 29 r2 = requests.post(
 30     url="https://passport.lagou.com/login/login.json",
 31     headers={
 32         'Host':'passport.lagou.com',
 33         "Referer":"https://passport.lagou.com/login/login.html",
 34         "X-Anit-Forge-Code": X_Anti_Forge_Code,
 35         "X-Anit-Forge-Token": X_Anit_Forge_Token,
 36         "X-Requested-With": "XMLHttpRequest",
 37         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36',
 38         "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
 39     },
 40     data={
 41         "isValidate":"true",
 42         "username":"xxx",
 43         "password":"xxx",
 44         "request_form_verifyCode":"",
 45         "submit":""
 46     },
 47     cookies=r1.cookies.get_dict()
 48 )
 49 all_cookies.update(r2.cookies.get_dict())
 50 
 51 ###################3 用戶受權 #########################
 52 r3 = requests.get(
 53     url='https://passport.lagou.com/grantServiceTicket/grant.html',
 54     headers={
 55         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
 56     },
 57     allow_redirects = False,
 58     cookies = all_cookies
 59 )
 60 all_cookies.update(r3.cookies.get_dict())
 61 
 62 ###################4 用戶認證 #########################
 63 print('r4',r3.headers['Location'])
 64 r4 = requests.get(
 65     url=r3.headers['Location'],
 66     headers={
 67         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
 68     },
 69     allow_redirects=False,
 70     cookies=all_cookies
 71 )
 72 all_cookies.update(r4.cookies.get_dict())
 73 
 74 print('r5',r4.headers['Location'])
 75 r5 = requests.get(
 76     url=r4.headers['Location'],
 77     headers={
 78         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
 79     },
 80     allow_redirects=False,
 81     cookies=all_cookies
 82 )
 83 all_cookies.update(r5.cookies.get_dict())
 84 
 85 print('r6',r5.headers['Location'])
 86 r6 = requests.get(
 87     url=r5.headers['Location'],
 88     headers={
 89         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
 90     },
 91     allow_redirects=False,
 92     cookies=all_cookies
 93 )
 94 all_cookies.update(r6.cookies.get_dict())
 95 
 96 print('r7',r6.headers['Location'])
 97 r7 = requests.get(
 98     url=r6.headers['Location'],
 99     headers={
100         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36'
101     },
102     allow_redirects=False,
103     cookies=all_cookies
104 )
105 all_cookies.update(r7.cookies.get_dict())
106 
107 ###################5. 查看我的頁面 #########################
108 
109 r5 = requests.get(
110     url='https://www.lagou.com/resume/myresume.html',
111     headers={
112         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
113     },
114     cookies = all_cookies
115 )
116 
117 print("李港" in r5.text)
118 
119 
120 ###################6. 查看 #########################
121 r6 = requests.get(
122     url='https://gate.lagou.com/v1/neirong/account/users/0/',
123     headers={
124         "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36",
125         'X-L-REQ-HEADER': "{deviceType:1}",
126         'Origin': 'https://account.lagou.com',
127         'Host': 'gate.lagou.com',
128     },
129     cookies = all_cookies
130 )
131 all_cookies.update(r6.cookies.get_dict())
132 r6_json = r6.json()
133 print(r6.json())
134 # ################################ 7.修改信息 ##########################
135 r7 = requests.put(
136     url='https://gate.lagou.com/v1/neirong/account/users/0/',
137     headers={
138         'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
139         'Origin': 'https://account.lagou.com',
140         'Host': 'gate.lagou.com',
141         'X-Anit-Forge-Code': r6_json['submitCode'],
142         'X-Anit-Forge-Token': r6_json['submitToken'],
143         'X-L-REQ-HEADER': "{deviceType:1}",
144     },
145     cookies=all_cookies,
146     json={"userName": "Starry", "sex": "MALE", "portrait": "images/myresume/default_headpic.png",
147           "positionName": '...', "introduce": '....'}
148 )
149 print(r7.text)
自動登陸拉勾網並修改信息

7 做業 自動登陸GitHub並獲取信息

運行以下:

源代碼:

  1 #-*- coding:utf-8 -*-
  2 
  3 '''
  4 @auther: Starry
  5 @file: github_login.py
  6 @time: 2018/6/22 16:22
  7 '''
  8 
  9 import  requests
 10 from bs4 import BeautifulSoup
 11 import bs4
 12 
 13 
 14 headers = {
 15     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
 16     "Accept-Encoding": "gzip, deflate, br",
 17     "Accept-Language": "zh-CN,zh;q=0.9",
 18     "Cache-Control": "max-age=0",
 19     "Connection": "keep-alive",
 20     "Host": "github.com",
 21     "Upgrade-Insecure-Requests": "1",
 22     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.62 Safari/537.36"
 23 }
 24 def login(user, password):
 25     '''
 26     經過已知的帳號和密碼自動登入GitHub,並返回該帳號的Cookie
 27     :param user: 登入帳號
 28     :param password: 登入密碼
 29     :return: 返回Cookie
 30     '''
 31     r1 = requests.get(
 32         url = "https://github.com/login",
 33         headers = headers
 34     )
 35     cookies_dict = r1.cookies.get_dict()
 36     soup = BeautifulSoup(r1.text,"html.parser")
 37     authenticity_token = soup.find('input',attrs={"name":"authenticity_token"}).get('value')
 38     # print(authenticity_token)
 39     res = requests.post(
 40         url = "https://github.com/session",
 41         data = {
 42             "commit":"Sign in",
 43             "utf8":"",
 44             "authenticity_token":authenticity_token,
 45             "login":user,
 46             "password":password
 47         },
 48         headers = headers,
 49         cookies = cookies_dict
 50     )
 51     cookies_dict.update(res.cookies.get_dict())
 52     return cookies_dict
 53 
 54 def View_Information(cookies_dict):
 55     '''
 56     展現GitHub的我的信息
 57     :param cookies_dict: 帳號的Cookie值
 58     :return:
 59     '''
 60     url_setting_profile = "https://github.com/settings/profile"
 61     html = requests.get(
 62         url=url_setting_profile,
 63         cookies = cookies_dict,
 64         headers=headers
 65     )
 66     soup = BeautifulSoup(html.text,"html.parser")
 67     user_name = soup.find('input',attrs={'id':'user_profile_name'}).get('value')
 68     user_email = []
 69     select = soup.find('select',attrs={'class':'form-select select'})
 70     for index,child in enumerate(select):
 71         if index == 0:
 72             continue
 73         if isinstance(child,bs4.element.Tag):
 74             user_email.append(child.get('value'))
 75     Bio = soup.find('textarea', attrs={'id': 'user_profile_bio'}).text
 76     URL = soup.find('input',attrs={'id':'user_profile_blog'}).get('value')
 77     Company = soup.find('input',attrs={'id':'user_profile_company'}).get('value')
 78     Location = soup.find('input',attrs={'id':'user_profile_location'}).get('value')
 79     print('''
 80     用戶名爲:{0}
 81     綁定的郵箱地址爲:{1}
 82     個性簽名爲:{2}
 83     主頁爲:{3}
 84     公司爲:{4}
 85     位置爲:{5}
 86     '''.format(user_name,user_email,Bio,URL,Company,Location))
 87     html = requests.get(
 88         url="https://github.com/settings/repositories",
 89         cookies = cookies_dict,
 90         headers = headers
 91     )
 92     html.encoding = html.apparent_encoding
 93     soup = BeautifulSoup(html.text, "html.parser")
 94     div = soup.find(name='div',attrs={"class":"listgroup"})
 95     # print(div)
 96     tplt = "{0:^20}\t{1:^20}\t{2:^20}"
 97     print(tplt.format("項目名稱","項目連接","項目大小"))
 98     for child in div.children:
 99         if isinstance(child, bs4.element.Tag):
100             a = child.find('a',attrs={"class":"mr-1"})
101             small = child.find('small')
102             print(tplt.format(a.text, a.get("href"), small.text))
103 
104 if __name__ == '__main__':
105     user = 'xxx'
106     password = 'xxx'
107     cookies_dict = login(user,password)
108     View_Information(cookies_dict)
自動登陸GitHub並獲取信息
相關文章
相關標籤/搜索