python-75：BS4實例1源碼

時間 2019-11-15

標籤 python bs4 實例源碼欄目 Python 简体版

原文原文鏈接

最終實現咱們全部功能的源碼是這樣的python

#!/usr/bin/env python
# -*- coding:UTF-8 -*-
__author__ = '217小月月坑'
 
'''
實例一最終源碼
'''
 
from bs4 import BeautifulSoup
import urllib2
# deal with the coding error
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

url = 'http://beautifulsoup.readthedocs.org/zh_CN/latest/#'
request = urllib2.Request(url)
response = urllib2.urlopen(request)
contents = response.read()
soup = BeautifulSoup(contents)
# get the title
title = soup.title.string
# get the text
result = soup.find(itemprop="articleBody")
for i in result.find_all(attrs={"class": "headerlink"}):
    i.clear()
# write to a file
path = '/home/ym/'+title
f = open (path,"w+")
f.write(result.get_text())
f.close()
print "done!!"

寫到文件中的效果是這樣的：學習

好了，這一個實例就這樣簡單的結束了。咱們來回顧一下在這個過程當中咱們經歷了怎樣的一個過程url

使用urllib2將網頁源碼下載下來，便於後續的分析
code
使用BS4獲取咱們想要的內容utf-8
2中獲取的內容帶有咱們不想要的字符，因此咱們使用BS4對文檔樹的刪除方法將該特殊字符刪除文檔

在這整個過程當中，咱們學習了BS4搜索文檔樹的內容，這一部分是知識在爬蟲中佔又很重要的地位，除此以外，咱們還學習了BS4修改文檔樹的部分，好比刪除文檔樹，同時，咱們也學習了一些BS4中輸出和錯誤的處理，咱們實現這個實例所學到的知識已經佔了整個BS4文檔中的一半的內容，剩下的內容咱們能夠在用到的時候再慢慢學習
get