用Python抓取全站中的404錯誤

時間 2019-11-13

標籤 python 抓取錯誤欄目 Python 简体版

原文原文鏈接

連接是SEO的一個重要因素。爲了在搜索引擎中獲取更好的排名，必定要按期檢查下網站中的連接是否依然有效。特別是因爲一些巨大的改動可能會致使壞連接的出現。要檢測這些站內的連接問題，能夠經過一些在線的工具。好比Google Analytics，Bing Webmaster Tools，brokenlinkcheck.com等。儘管有現成的工具，咱們也能夠本身來編寫一個。使用Python會很是容易。
html

參考原文：How to Check Broken Links with 404 Error in Pythonpython

做者：Xiao Linggit

翻譯：yushulxgithub

如何檢查網站404錯誤

爲了讓網站更好的被搜索引擎抓取，通常的網站都會有一個sitemap.xml。因此基本步驟是：
app

讀取sitemap.xml，獲取全部的站內連接。工具
從每一個連接中再讀取全部的連接，可能包含inbound link或者outbound link。網站
檢查全部連接的狀態。
ui

軟件安裝

使用BeautifulSoup庫來分析網頁元素會很是方便：
搜索引擎

pip install beautifulsoup4

如何使用Python抓取網頁

由於程序運行的時間可能會很長，要隨時打斷的話，須要注入鍵盤事件：
url

def ctrl_c(signum, frame):
    global shutdown_event
    shutdown_event.set()
    raise SystemExit('\nCancelling...')
 
global shutdown_event
shutdown_event = threading.Event()
signal.signal(signal.SIGINT, ctrl_c)

使用BeautifulSoup來分析sitemap.xml：

pages = []
try:
    request = build_request("http://kb.dynamsoft.com/sitemap.xml")
    f = urlopen(request, timeout=3)
    xml = f.read()
    soup = BeautifulSoup(xml)
    urlTags = soup.find_all("url")
 
    print "The number of url tags in sitemap: ", str(len(urlTags))
 
    for sitemap in urlTags:
        link = sitemap.findNext("loc").text
        pages.append(link)
 
    f.close()
except HTTPError, URLError:
    print URLError.code
 
return pages

分析HTML元素獲取全部連接：

def queryLinks(self, result):
    links = []
    content = ''.join(result)
    soup = BeautifulSoup(content)
    elements = soup.select('a')
 
    for element in elements:
        if shutdown_event.isSet():
            return GAME_OVER
 
        try:
            link = element.get('href')
            if link.startswith('http'):
                links.append(link)
        except:
            print 'href error!!!'
            continue
 
    return links
 
def readHref(self, url):
    result = []
    try:
        request = build_request(url)
        f = urlopen(request, timeout=3)
        while 1 and not shutdown_event.isSet():
            tmp = f.read(10240)
            if len(tmp) == 0:
                break
            else:
                result.append(tmp)
 
        f.close()
    except HTTPError, URLError:
        print URLError.code
 
    if shutdown_event.isSet():
        return GAME_OVER
 
    return self.queryLinks(result)

檢查link的response返回值：

def crawlLinks(self, links, file=None):
    for link in links:
        if shutdown_event.isSet():
            return GAME_OVER
 
        status_code = 0
 
        try:
            request = build_request(link)
            f = urlopen(request)
            status_code = f.code
            f.close()
        except HTTPError, URLError:
            status_code = URLError.code
 
        if status_code == 404:
            if file != None:
                file.write(link + '\n')
 
        print str(status_code), ':', link
 
    return GAME_OVER