python爬取網站頁面時,部分標籤無指定屬性而報錯

在寫爬取頁面a標籤下href屬性的時候,有這樣一個問題,若是a標籤下沒有href這個屬性則會報錯,以下:python

 

百度了有師傅用正則匹配的,方法感受都不怎麼好,查了BeautifulSoup的官方文檔,發現一個不錯的方法,以下圖:app

官方文檔連接:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/async

has_attr() 這個方法能夠判斷某標籤是否存在某屬性,若是存在則返回 True ide

 

解決辦法:函數

爲美觀使用了匿名函數url

 

soup_a = soup.find_all(lambda tag:tag.has_attr('href'))

 

 

最終實現爬取頁面 url 腳本以下:spa

#!/usr/bin/env python # -*- coding:utf-8 -*- # Author:Riy

import time import requests import sys import logging from bs4 import BeautifulSoup from requests.exceptions import RequestException from multiprocessing import Process, Pool logging.basicConfig( level=logging.DEBUG, format='%(levelname)-10s: %(message)s', ) class down_url: def download(self, url): '''爬取url'''
        try: start = time.time() logging.debug('starting download url...') response = requests.get(url) page = response.content soup = BeautifulSoup(page, 'lxml') soup_a = soup.select('a') soup_a = soup.find_all(lambda tag:tag.has_attr('href')) soup_a_href_list = [] # print(soup_a)
            for k in soup_a: # print(k)
                soup_a_href = k['href'] if soup_a_href.find('.'): # print(soup_a_href)
 soup_a_href_list.append(soup_a_href) print(f'運行了{time.time()-start}秒') except RecursionError as e: print(e) return soup_a_href_list def write(soup_a_href_list, txt): '''下載到txt文件''' logging.debug('starting write txt...') with open(txt, 'a', encoding='utf-8') as f: for i in soup_a_href_list: f.writelines(f'{i}\n') print(f'已生成文件{txt}') def help_memo(self): '''查看幫助'''
        print(''' -h or --help 查看幫助 -u or --url 添加url -t or --txt 寫入txt文件 ''') def welcome(self): '''歡迎頁面''' desc = ('歡迎使用url爬取腳本'.center(30, '*')) print(desc) def main(): '''主函數''' p = Pool(3) p_list = [] temp = down_url() logging.debug('starting run python...') try: if len(sys.argv) == 1: temp.welcome() temp.help_memo() elif sys.argv[1] in {'-h', '--help'}: temp.help_memo() elif sys.argv[1] in {'-u ', '--url'} and sys.argv[3] in {'-t', '--txt'}: a = temp.download(sys.argv[2]) temp.write(a, sys.argv[4]) elif sys.argv[1] in {'-t', '--txt'}: print('請先輸入url!') elif sys.argv[1] in {'-u', '--url'}: url_list = sys.argv[2:] print(url_list) for i in url_list: a = p.apply_async(temp.download, args=(i,)) p_list.append(a) for p in p_list: print(p.get()) else: temp.help_memo() print('輸入的參數有誤!') except Exception as e: print(e) temp.help_memo() if __name__ == '__main__': main()
View Code
相關文章
相關標籤/搜索