python獲取知乎日報另存爲txt文件

時間 2019-11-24

標籤 python 獲取日報另存爲 txt 文件欄目 Python 简体版

原文原文鏈接

前言html

拿來練手的，比較簡單（且有bug），歡迎交流~python

功能介紹linux

抓取當日的知乎日報的內容，並將每篇博文另存爲一個txt文件，集中放在一個文件夾下，文件夾名字爲當日時間。git

使用的庫github

re，BeautifulSoup，sys，urllib2正則表達式

注意事項python2.7

1.運行環境是Linux，python2.7.x，想在win上使用直接改一下里邊的命令就能夠了函數

2.bug是在處理「如何正確吐槽」的時候只能獲取第一個（懶癌發做了）post

3.直接獲取（以下）內容是不能夠的，知乎作了反抓取的處理網站

urllib2.urlop(url).read()

因此加個Headers就能夠了

4.由於zhihudaily.ahorn.me這個網站時不時掛掉，因此有時候會出現錯誤

1 def getHtml(url):
2     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
3     request=urllib2.Request(url,None,header)
4     response=urllib2.urlopen(request)
5     text=response.read()
6     return text

4.在作內容分析的時候能夠直接使用re，也能夠直接調用BeautifulSoup裏的函數（我對正則表達式發怵，因此直接bs），好比

1 def saveText(text):
2     soup=BeautifulSoup(text)
3     filename=soup.h2.get_text()+".txt"
4     fp=file(filename,'w')
5     content=soup.find('div',"content")
6     content=content.get_text()

show me the code

 1 #Filename:getZhihu.py
 2 import re
 3 import urllib2
 4 from bs4 import BeautifulSoup
 5 import sys
 6 
 7 reload(sys)
 8 sys.setdefaultencoding("utf-8")
 9 
10 #get the html code
11 def getHtml(url):
12     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}
13     request=urllib2.Request(url,None,header)
14     response=urllib2.urlopen(request)
15     text=response.read()
16     return text
17 #save the content in txt files
18 def saveText(text):
19     soup=BeautifulSoup(text)
20     filename=soup.h2.get_text()+".txt"
21     fp=file(filename,'w')
22     content=soup.find('div',"content")
23     content=content.get_text()
24     
25 #   print content #test
26     fp.write(content)
27     fp.close()
28 #get the urls from the zhihudaily.ahorn.com
29 def getUrl(url):
30     html=getHtml(url) 
31 #   print html
32     soup=BeautifulSoup(html)
33     urls_page=soup.find('div',"post-body")
34 #   print urls_page
35 
36     urls=re.findall('"((http)://.*?)"',str(urls_page))
37     return urls 
38 #main() founction
39 def main():
40     page="http://zhihudaily.ahorn.me"
41     urls=getUrl(page)
42     for url in urls:
43         text=getHtml(url[0])
44         saveText(text)
45 
46 if __name__=="__main__":
47     main()