python網絡數據採集（伴奏曲）

時間 2019-11-24

原文原文鏈接

這裏是前章，咱們作一下預備。以前太多事情沒能寫博客~。。 (此博客只適合python3x,python2x請自行更改代碼)html

首先你要有bs4模塊python

windows下安裝:pip3 install bs4，若是你電腦有python2x和python3x的話,在python3x中安裝bs4請已管理員的身份運行cmd執行pip3 install bs4安裝bs4。linux

linux下安裝：sudo pip3 install bs4windows

還有urllib.request模塊學習

windows下安裝:pip3 install urllib.request,若是你電腦有python2x和python3x的話,在python3x中安裝bs4請已管理員的身份運行cmd執行pip3 install urllib.request安裝urllib.request模塊網站

例子1：獲取源碼搜索引擎

from urllib.request import urlopenurl

from bs4 import BeautifulSoup.net

html=urlopen("http://wikipedia.org")htm

dgc=BeautifulSoup(html)

print(dgc)

輸出圖以下：

這裏我忘記加自定義錯誤了，固然你也能夠不加。保險起見仍是加

例子二：匹配對應的標籤

from urllib.request import urlopen

from bs4 import BeautifulSoup
try:
 html=urlopen("http://dlszx.dgjy.net/")
except EOFError as a:
    print("404 ")
except:
    print("404")
dgc=BeautifulSoup(html)

fbc=dgc.findAll("img",{"src":"uploadfile/201762105219962.jpg"})
print(fbc)

例子3：正則匹配全部對應的標籤

不會正則的請去學習

from urllib.request import urlopen
import re
from bs4 import BeautifulSoup
try:
 html=urlopen("http://dlszx.dgjy.net/")
except EOFError as a:
    print("404 ")
except:
    print("404")
dgc=BeautifulSoup(html)
fbc=dgc.findAll("img",{"src":re.compile("img/.*?\.jpg")})
for inks in fbc:
    print(inks)
注意事項！！！：不要拿findAll去搜索引擎匹配，亂的你想死
搜索引擎正則匹配要求很高：http:\/\/[a-zA-z].*?\[a-z]

例子4：

匹配網站全部的連接

from urllib.request import urlopen
import re
from bs4 import BeautifulSoup
try:
 html=urlopen("http://wikipeda.org")
except EOFError as a:
    print("EOFError")

except:
    print("I dont EOFError")

gfc=BeautifulSoup(html)
for inks in gfc.findAll("a")
  if 'href' in inks.attrs:
    print("inks.attrs["href"]")

如今的時間是

2017-8-13-13:38

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。