Day 03

時間 2019-12-11

標籤 day 简体版

原文原文鏈接

昨日回顧:
    一 爬取豆瓣電影TOP250
        1.爬取電影頁
        2.解析提取電影信息
        3.保存數據

    二 Selenium請求庫
        驅動瀏覽器往目標網站發送請求，獲取響應數據。
        - 不須要分析複雜的通訊流程
        - 執行js代碼
        - 獲取動態數據


    三 selenium使用
        driver = webdriver.Chrome()  打開驅動瀏覽器
        # 隱式等待
        driver.get('網站')  往某個網站發送請求
        # 顯式等待
        driver.close()

    四 選擇器
        element: 查找一個
        elements: 查找多個

        by_id
        by_class_name
        by_name
        by_link_text
        by_partial_link_text
        by_css_selector

今日內容:
    一 Selenium剩餘部分
    二 BeautifulSoup4 解析庫

    一 Selenium剩餘部分
        1.元素交互操做:
            - 點擊、清除
                click
                clear

            - ActionChains
                是一個動做鏈對象，須要把driver驅動傳給它。
                動做鏈對象能夠操做一系列設定好的動做行爲。

            - iframe的切換
                driver.switch_to.frame('iframeResult')

            - 執行js代碼
                execute_script()
    二 BeautifulSoup4 解析庫（+ re模塊 > selenium）
        BS4

        1.什麼BeautifulSoup？
            bs4是一個解析庫，能夠經過某種(解析器)來幫咱們提取想要的數據。

        2.爲何要使用bs4？
            由於它能夠經過簡潔的語法快速提取用戶想要的數據內容。

        3.解析器的分類
            - lxml
            - html.parser

        4.安裝與使用
            - 遍歷文檔樹
            - 搜索文檔樹



補充知識點:

 1  數據格式:  2 
 3  json數據:  4  {  5     "name": "tank"
 6  }  7 
 8  XML數據:  9     <name>tank</name>
10 
11  HTML: 12     <html></html>

生成器: yield 值（把值放進生成器中）php

 1     def f():  2         # return 1
 3         yield 1
 4         yield 2
 5         yield 3
 6 
 7     g = f()  8     print(g)  9 
10     for line in g: 11         print(line)

01❤元素交互操做css

 1 from selenium import webdriver  # 用來驅動瀏覽器的
 2 from selenium.webdriver.common.keys import Keys  # 鍵盤按鍵操做
 3 import time  4 driver = webdriver.Chrome(r'E:\Python驅動瀏覽器\chromedriver.exe')  5 
 6 try:  7     driver.implicitly_wait(10)  8 
 9     driver.get('https://www.jd.com/') 10 
11     input1 = driver.find_element_by_id('key') 12     input1.send_keys('劍網3') 13     search_button = driver.find_element_by_class_name('button') 14  search_button.click() 15 
16     time.sleep(1) 17 
18     # 清空
19     input2 = driver.find_element_by_class_name('text') 20  input2.clear() 21     input2.send_keys('劍網3花蘿') 22  input2.send_keys(Keys.ENTER) 23 
24     time.sleep(10) 25 
26 finally: 27     driver.close()

View Code

02❤自動完成滑塊驗證碼html

 1 from selenium import webdriver  2 from selenium.webdriver import ActionChains  3 import time  4 driver = webdriver.Chrome(r'E:\Python驅動瀏覽器\chromedriver.exe')  5 try:  6     driver.implicitly_wait(10)  7     driver.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')  8     time.sleep(5)  9 
10     driver.switch_to.frame('iframeResult') 11     time.sleep(1) 12 
13     # 起始方塊id：draggable
14     source = driver.find_element_by_id('draggable') 15     # 目標方塊：droppable
16     target = driver.find_element_by_id('droppable') 17 
18     # print(source.size) #大小
19     # print(source.tag_name)#標籤名
20     # print(source.text) #文本
21     # print(source.location)#座標：x、y軸
22 
23     # 找到滑塊距離
24     distance = target.location['x'] - source.location['x'] 25     # 摁住起始滑塊
26  ActionChains(driver).click_and_hold(source).perform() 27     # 方式二：一點一點移動
28     s = 0 29     while s < distance: 30         # 獲取動做鏈對象
31         # 每一次位移s距離
32         ActionChains(driver).move_by_offset(xoffset=2,yoffset=0).perform() 33         s +=2
34         time.sleep(0.1) 35     # 鬆開起始滑塊
36  ActionChains(driver).release().perform() 37     time.sleep(10) 38 
39 finally: 40     driver.close()

View Code

03❤主頁彈窗內容更改python

from selenium import webdriver  # web驅動
import time driver = webdriver.Chrome(r'E:\Python驅動瀏覽器\chromedriver.exe') try: driver.implicitly_wait(10) driver.get('https://www.baidu.com/') driver.execute_script( ''' alert("花間天下第一") ''' ) time.sleep(10) finally: driver.close()

View Code

04❤模擬瀏覽器的前進後退jquery

 1 #模擬瀏覽器的前進後退
 2 import time  3 from selenium import webdriver  4 
 5 browser = webdriver.Chrome(r'E:\Python驅動瀏覽器\chromedriver.exe')  6 browser.get('https://www.baidu.com')  7 browser.get('https://www.taobao.com')  8 browser.get('http://www.sina.com.cn/')  9 
10 browser.back() 11 time.sleep(10) 12 browser.forward() 13 browser.close()

View Code

05❤自動爬取指定的京東商品信息web

 1 '''
 2 初級版：普普統統  3 '''
 4 from selenium import webdriver  5 from selenium.webdriver.common.keys import Keys  # 鍵盤按鍵操做
 6 import time  7 
 8 driver = webdriver.Chrome(r'E:\Python驅動瀏覽器\chromedriver.exe')  9 try: 10     driver.implicitly_wait(10) 11     driver.get('https://www.jd.com/') 12 
13     input1 = driver.find_element_by_id('key') 14     input1.send_keys('劍網3花蘿') 15  input1.send_keys(Keys.ENTER) 16 
17     time.sleep(5) 18     num = 1
19     good_list = driver.find_elements_by_class_name('gl-item') 20     for good in good_list: 21         # 商品名稱
22         good_name = good.find_element_by_css_selector('.p-name em').text 23         # 商品連接
24         good_url = good.find_element_by_css_selector('.p-name a').get_attribute('href') 25         # 商品價格
26         good_price = good.find_element_by_class_name('p-price').text 27         # 商品評價
28         good_commit = good.find_element_by_class_name('p-commit').text 29 
30         good_content = f'''
31  第{num}個 32  商品名稱：{good_name} 33  商品連接：{good_url} 34  商品價格：{good_price} 35  商品評價：{good_commit} 36         '''
37         print(good_content) 38         with open('jd.txt','a',encoding='utf-8') as f: 39  f.write(good_content) 40             num += 1
41 finally: 42     driver.close()

View Code

 1 '''
 2 中級版：增長下拉+下一頁  3 '''
 4 import time  5 from selenium import webdriver  6 from selenium.webdriver.common.keys import Keys  7 
 8 driver = webdriver.Chrome(r'E:\Python驅動瀏覽器\chromedriver.exe')  9 
10 num = 1
11 
12 try: 13     driver.implicitly_wait(10) 14     # 往京東發送請求
15     driver.get('https://www.jd.com/') 16 
17     # 往京東主頁輸入框輸入墨菲定律，按回車鍵
18     input_tag = driver.find_element_by_id('key') 19     input_tag.send_keys('劍網3炮太') 20  input_tag.send_keys(Keys.ENTER) 21 
22     time.sleep(5) 23 
24     # 下拉滑動5000px
25     js_code = '''
26  window.scrollTo(0, 5000) 27     '''
28 
29  driver.execute_script(js_code) 30 
31     # 等待5秒，待商品數據加載
32     time.sleep(5) 33 
34     good_list = driver.find_elements_by_class_name('gl-item') 35     for good in good_list: 36         # 商品名稱
37         good_name = good.find_element_by_css_selector('.p-name em').text 38         # 商品連接
39         good_url = good.find_element_by_css_selector('.p-name a').get_attribute('href') 40         # 商品價格
41         good_price = good.find_element_by_class_name('p-price').text 42         # 商品評價
43         good_commit = good.find_element_by_class_name('p-commit').text 44 
45         good_content = f'''
46  num: {num} 47  商品名稱: {good_name} 48  商品連接: {good_url} 49  商品價格: {good_price} 50  商品評價: {good_commit} 51         '''
52         print(good_content) 53 
54         with open('jd.txt', 'a', encoding='utf-8') as f: 55  f.write(good_content) 56         num += 1
57 
58     # 找到下一頁並點擊
59     next_tag = driver.find_element_by_class_name('pn-next') 60  next_tag.click() 61 
62     time.sleep(10) 63 
64 finally: 65     driver.close()

View Code

 1 '''
 2 狂暴版:加載全部指定商品  3 '''
 4 import time  5 from selenium import webdriver  6 from selenium.webdriver.common.keys import Keys  7 
 8 
 9 def get_good(driver): 10     num = 1
11     try: 12         time.sleep(5) 13 
14         # 下拉滑動5000px
15         js_code = '''
16  window.scrollTo(0, 5000) 17         '''
18  driver.execute_script(js_code) 19 
20         # 等待5秒，待商品數據加載
21         time.sleep(5) 22         good_list = driver.find_elements_by_class_name('gl-item') 23         for good in good_list: 24             # 商品名稱
25             good_name = good.find_element_by_css_selector('.p-name em').text 26             # 商品連接
27             good_url = good.find_element_by_css_selector('.p-name a').get_attribute('href') 28             # 商品價格
29             good_price = good.find_element_by_class_name('p-price').text 30             # 商品評價
31             good_commit = good.find_element_by_class_name('p-commit').text 32 
33             good_content = f'''
34  num: {num} 35  商品名稱: {good_name} 36  商品連接: {good_url} 37  商品價格: {good_price} 38  商品評價: {good_commit} 39  \n 40             '''
41             print(good_content) 42             with open('jd.txt', 'a', encoding='utf-8') as f: 43  f.write(good_content) 44             num += 1
45 
46         print('商品信息寫入成功!') 47 
48         # 找到下一頁並點擊
49         next_tag = driver.find_element_by_class_name('pn-next') 50  next_tag.click() 51 
52         time.sleep(5) 53         # 遞歸調用函數自己
54  get_good(driver) 55 
56     finally: 57  driver.close() 58 
59 
60 if __name__ == '__main__': 61     driver = webdriver.Chrome(r'E:\Python驅動瀏覽器\chromedriver.exe') 62     try: 63         driver.implicitly_wait(10) 64         # 往京東發送請求
65         driver.get('https://www.jd.com/') 66         # 往京東主頁輸入框輸入墨菲定律，按回車鍵
67         input_tag = driver.find_element_by_id('key') 68         input_tag.send_keys('劍網3傘蘿') 69  input_tag.send_keys(Keys.ENTER) 70 
71         # 調用獲取商品信息函數
72  get_good(driver) 73 
74     finally: 75         driver.close()

View Code

06❤bs4的安裝與使用chrome

是pip3 install ***json

 1 '''
 2 安裝解析器：  3 pip install lxml  4 安裝解析庫：  5 pip install bs4  6 
 7 注意: 如何初始文本內有換行，也會算在裏面。（坑）  8 '''
 9 
10 html_doc = """
11 <html><head><title>The Dormouse's story</title></head> 12 <body> 13 <p class="sister"><b>$37</b></p> 14 <p class="story" id="p">Once upon a time there were three little sisters; and their names were 15 <a href="http://example.com/elsie" class="sister" >Elsie</a>, 16 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 17 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 18 and they lived at the bottom of a well.</p> 19 
20 <p class="story">...</p> 21 """
22 
23 from bs4 import BeautifulSoup 24 # python自帶的解析庫
25 # soup = BeautifulSoup(html_doc,'html.parser')
26 
27 # 調用bs4獲得一個soup對象
28 # 第一個參數是解析文本
29 # 第二個參數是解析器
30 soup = BeautifulSoup(html_doc, 'lxml') 31 
32 # 具有自動補全html標籤功能
33 print(soup) 34 
35 # bs4類型
36 print(type(soup)) 37 # 美化html便籤
38 html = soup.prettify() 39 print(html)

View Code

07❤bs4解析庫之遍歷文檔樹api

 1 from bs4 import BeautifulSoup  2 
 3 # 注意: 如何初始文本內有換行，也會算在裏面。（坑）
 4 html_doc = """
 5 <html><head><title>The Dormouse's story</title></head><body><p class="sister"><b>$37</b></p><p class="story" id="p">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" >Elsie</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p>  6 """
 7 
 8 
 9 # 第一個參數是解析文本
10 # 第二個參數是解析器
11 soup = BeautifulSoup(html_doc, 'lxml') 12 # 遍歷文檔樹
13 # 一、直接選擇標籤 *****
14 # （返回的是一個對象）
15 print(soup.html) 16 print(type(soup.html)) 17 print(soup.a)  # 獲取第一個a標籤
18 print(soup.p)  # 獲取第一個p標籤
19 
20 # 二、獲取標籤的名稱
21 print(soup.a.name)  # 獲取a標籤的名字
22 
23 # 三、獲取標籤的屬性 *****
24 print(soup.a.attrs)  # 獲取a標籤內全部的屬性
25 print(soup.a.attrs['href']) # 獲取a標籤內的href屬性
26 
27 # 四、獲取標籤的文本內容 *****
28 print(soup.p.text)  #￥37
29 
30 # 五、嵌套選擇標籤
31 print(soup.p.b)  # 獲取第一個p標籤內的b標籤
32 print(soup.p.b.text)  # 打印b標籤內的文本
33 
34 # 六、子節點、子孫節點
35 # 獲取子節點
36 print(soup.p.children)  # 獲取第一個p標籤全部的子節點，返回的是一個迭代器
37 print(list(soup.p.children))  # list轉成列表
38 
39 # 獲取子孫節點
40 print(soup.body.descendants)  # 獲取body標籤內全部的子孫節點，返回的是一個生成器
41 print(list(soup.body.descendants))  # list轉成列表
42 
43 # 獲取第一個p標籤中全部的內容，返回的是一個列表
44 print(soup.p.contents) 45 
46 # 七、父節點、祖先節點
47 # 獲取父節點
48 print(soup.a.parent)  # 獲取第一個a標籤內的父節點
49 
50 # 獲取祖先節點（爸爸，爸爸的爸爸，爸爸的爸爸的爸爸...以此類推）
51 print(list(soup.a.parents))  # 獲取第一個a標籤的祖先節點，返回的是一個生成器
52 
53 # 八、兄弟節點 （sibling: 兄弟姐妹）
54 print(soup.a) 55 # 獲取下一個兄弟節點
56 print(soup.a.next_sibling) 57 
58 # 獲取下一個的全部兄弟節點,返回的是一個生成器
59 print(soup.a.next_siblings) 60 print(list(soup.a.next_siblings)) 61 
62 # 獲取上一個兄弟節點
63 print(soup.a.previous_sibling) 64 # 獲取上一個的全部兄弟節點，返回的是一個生成器
65 print(list(soup.a.previous_siblings))

View Code

08❤bs4解析庫之搜索文檔瀏覽器

''' 標籤查找與屬性查找: 標籤: - 字符串過濾器 字符串全局匹配 name 屬性匹配 attrs 屬性查找匹配 text 文本匹配 - 正則過濾器 re模塊匹配 - 列表過濾器 列表內的數據匹配 - bool過濾器 True匹配 - 方法過濾器 用於一些要的屬性以及不須要的屬性查找。 屬性: - class_ - id '''

 1 import re  2 # name
 3 # 根據re模塊匹配帶有a的節點
 4 a = soup.find(name=re.compile('a'))  5 print(a)  6 a_s = soup.find_all(name=re.compile('a'))  7 print(a_s)  8 
 9 # attrs
10 a = soup.find(attrs={"id": re.compile('link')}) 11 print(a) 12 
13 '''三、列表過濾器'''
14 # 列表內的數據匹配
15 print(soup.find(name=['a', 'p', 'html', re.compile('a')])) 16 print(soup.find_all(name=['a', 'p', 'html', re.compile('a')])) 17 
18 
19 '''四、bool過濾器 '''
20 # True匹配
21 print(soup.find(name=True, attrs={"id": True})) 22 
23 '''五、方法過濾器'''
24 # 用於一些要的屬性以及不須要的屬性查找。
25 
26 def have_id_not_class(tag): 27     # print(tag.name)
28     if tag.name == 'p' and tag.has_attr("id") and not tag.has_attr("class"): 29         return tag 30 
31 # print(soup.find_all(name=函數對象))
32 print(soup.find_all(name=have_id_not_class)) 33 
34 
35 ''' 補充知識點:'''
36 # id
37 a = soup.find(id='link2') 38 print(a) 39 
40 # class
41 p = soup.find(class_='sister') 42 print(p)

View Code

7月3日做業【半成品】只能加載第一個，沒法全部

 1 ''''''
 2 '''
 3 今日做業:
 4     1.整理課堂知識點
 5     2.寫博客
 6     3.爬取豌豆莢app數據
 7         spider_method:
 8             requests + bs4
 9                 or
10             selenium
11 
12         url:
13             https://www.wandoujia.com/category/6001
14 
15         data:
16             名稱、詳情頁url、下載人數、app大小
17             app_name, detail_url, download_num, app_size
18 '''
19 import requests
20 from bs4 import BeautifulSoup
21 import time
22 response = requests.get('https://www.wandoujia.com/category/6001')
23 response.encoding = response.apparent_encoding
24 soup = BeautifulSoup(response.text, 'html.parser')
25 
26 app_list= soup.find(attrs={'class': 'app-desc'})
27 app_url_name_list = soup.find(name='a', attrs={'class': 'name'})
28 
29 # 從a標籤中找到title
30 app_name = app_url_name_list['title']
31 #獲取url
32 detail_url = app_url_name_list.attrs.get('href')
33 #獲取下載人數
34 download_num = soup.find(attrs={'class': 'install-count'}).text
35 # 得到app大小
36 app_size = soup.find( attrs={'class': 'dot'}).next_sibling.next_sibling.text
37 app_content = f'''
38 ❀=================  遊戲信息  ==================❀
39 遊戲名稱:{app_name}
40 詳情頁url:{detail_url}
41 下載人數:{download_num}
42 app大小:{app_size}
43 ❀=============  遊戲信息加載完畢  ==============❀
44 '''
45 print(app_content)
46 with open('wdj.txt', 'a', encoding='utf-8') as f:
47     f.write(app_content)

1. day-03
2. Day - 03
3. Day 03
4. Day 03 爬蟲
5. day 03抽象
6. RHCSA day 03
7. Day 03 HTML5
8. 04.24 day 03 ajax json
9. 在線實習—DAY 03
10. year：2017 month：08 day：03
更多相關文章...
• jQuery Mobile 表單選擇菜單 - jQuery Mobile 教程
• PHP 5 Date/Time 函數 - PHP參考手冊
• JDK13 GA發佈：5大特性解讀
• 爲了進字節跳動，我精選了29道Java經典算法題，帶詳細講解

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。