Python爬蟲教程(16行代碼爬百度)

時間 2020-07-07

原文原文鏈接

最近在學習python，不過有一個正則表達式一直搞不懂，本身直接使用最笨的方法寫出了一個百度爬蟲，只有短短16行代碼。
首先安裝必揹包：php

pip3 install bs4 pip3 install requests

安裝好後，輸入html

import requests from bs4 import BeautifulSoup

F5運行若是不報錯則說明安裝成功。
打開瀏覽器，輸入'www.baidu.com'，即進入百度，隨便搜索什麼，我這裏用'python'爲例
能夠發現，百度搜索出來的連接爲python

https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=python****

最後能夠簡化爲:nginx

https://www.baidu.com/s?wd=python

因此首先嚐試獲取搜索結果的html:web

import requests from bs4 import BeautifulSoup url='https://www.baidu.com/s?wd='+'python' headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.16 Safari/537.36"} html = requests.get(url,headers=headers).text print(html)

而後，咱們再從HTML裏面找出咱們想要的
正則表達式

能夠看爬下來的數據也能夠使用谷歌瀏覽器的F12
這裏已谷歌的F12爲例

能夠發現，div標籤中sql

class爲'result c-container '的爲非百度，非廣告的內容(咱們須要的內容)
class爲'result-op c-container xpath-log'的爲百度自家的內容(能夠按需篩選)
class爲其它的都爲廣告

首先定義篩選瀏覽器

soup = BeautifulSoup(html, 'html.parser')

使用for循環找出全部div標籤，且class爲'result c-container'app

for div in soup.find_all('div',class_="result c-container"): print(div)

讓後再次使用for循環在其中找出h3標籤學習

for div in soup.find_all('div',class_="result c-container"): #print(div)註釋掉方便檢查代碼 for h3 in div.find_all('h3'): print(h3.text)

再次尋找出標題和連接(a標籤)

for div in soup.find_all('div',class_="result c-container"): #print(div) for h3 in div.find_all('h3'): #print(h3.text) for a in h3.find_all('a'): print(a.text,' url:',a['href'])

這樣，咱們就成功屏蔽了廣告、百度百科等等
總體代碼以下：

import requests from bs4 import BeautifulSoup url='https://www.baidu.com/s?wd='+'python' headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9","User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.16 Safari/537.36"} html = requests.get(url,headers=headers).text print(html) soup = BeautifulSoup(html, 'html.parser') for div in soup.find_all('div',class_="result c-container"): #print(div) for h3 in div.find_all('h3'): #print(h3.text) for a in h3.find_all('a'): print(a.text,' url:',a['href']) #with open(r'C:/爬蟲/百度.txt', 'w', encoding='utf-8') as wr:#若是須要將爬下來的內容寫入文檔，能夠加上這兩句 # wr.write(page)