title: CSDN文章爬取
date: 2019-06-09 13:17:26
tags:html
因爲前些時間新建了我的博客,因而想把csdn的博客遷移到此處,一鍵遷移功能沒有使用成功,因此想到了,直接爬取,而後從新發送
時間:3小時
預期結果:博客文章保存到本地python
- 找到文章列表,進行文章爬取,提取到文章的url信息。
- 進行文章內容的解析,提取文章內容。
- 保存到本地。
- 嘗試對文章樣式進行保存
採用python語言來完成,使用pyquery庫進行爬取。ide
article = doc('.blog-content-box') #文章標題 title = article('.title-article').text() #文章內容 content = article('.article_content')
dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt' with open(dir, 'a', encoding='utf-8') as file: file.write(title+'\n'+content.text())
urls = doc('.article-list .content a') return urls
for i in range(3): print(i) main(offset = i+1)
代碼整合編碼
#!/usr/bin/env python # _*_coding:utf-8 _*_ #@Time :2019/6/8 0008 下午 11:00 #@Author :喜歡二福的滄月君(necydcy@gmail.com) #@FileName: CSDN.py #@Software: PyCharm import requests from pyquery import PyQuery as pq def find_html_content(url): headers = { 'User-Agent': 'Mozilla/5.0(Macintosh;Inter Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gerko) Chrome/52.0.2743.116 Safari/537.36' } html = requests.get(url,headers=headers).text return html def read_and_wiriteblog(html): doc = pq(html) article = doc('.blog-content-box') #文章標題 title = article('.title-article').text() content = article('.article_content') try: dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt' with open(dir, 'a', encoding='utf-8') as file: file.write(title+'\n'+content.text()) except Exception: print("保存失敗") def geturls(url): content = find_html_content(url) doc = pq(content) urls = doc('.article-list .content a') return urls def main(offset): url = '此處爲博客地址' + str(offset) urls = geturls(url) for a in urls.items(): a_url = a.attr('href') print(a_url) html = find_html_content(a_url) read_and_wiriteblog(html) if __name__ == '__main__': for i in range(3): print(i) main(offset = i+1)