CSDN文章爬取

文章首發地址:喜歡滄月的二福君的我的博客


title: CSDN文章爬取
date: 2019-06-09 13:17:26
tags:html

  • CSDN
  • python
    category: 技術
    ---

計劃

因爲前些時間新建了我的博客,因而想把csdn的博客遷移到此處,一鍵遷移功能沒有使用成功,因此想到了,直接爬取,而後從新發送
時間:3小時
預期結果:博客文章保存到本地python

實施過程

  1. 找到文章列表,進行文章爬取,提取到文章的url信息。
  2. 進行文章內容的解析,提取文章內容。
  3. 保存到本地。
  4. 嘗試對文章樣式進行保存

使用的技術

採用python語言來完成,使用pyquery庫進行爬取。ide

編碼

  1. 分析文章頁面,內容的爬取代碼以下:
article = doc('.blog-content-box')
   #文章標題
   title = article('.title-article').text()
   #文章內容
   content = article('.article_content')
  1. 進行文章的保存
dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
        with open(dir, 'a', encoding='utf-8') as file:
            file.write(title+'\n'+content.text())
  1. 對文章的url的提取
urls = doc('.article-list .content a')
    return urls
  1. 分頁爬取
for i in range(3):
        print(i)
        main(offset = i+1)
  1. 代碼整合編碼

    完整代碼

#!/usr/bin/env python
# _*_coding:utf-8 _*_
#@Time    :2019/6/8 0008 下午 11:00
#@Author  :喜歡二福的滄月君(necydcy@gmail.com)
#@FileName: CSDN.py

#@Software: PyCharm

import requests
from pyquery import PyQuery as pq

def find_html_content(url):
    headers = {
                'User-Agent': 'Mozilla/5.0(Macintosh;Inter Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gerko) Chrome/52.0.2743.116 Safari/537.36'
            }
    html = requests.get(url,headers=headers).text
    return html
def read_and_wiriteblog(html):
    doc = pq(html)

    article = doc('.blog-content-box')
    #文章標題
    title = article('.title-article').text()

    content = article('.article_content')

    try:
        dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt'
        with open(dir, 'a', encoding='utf-8') as file:
            file.write(title+'\n'+content.text())
    except Exception:
        print("保存失敗")


def geturls(url):
    content = find_html_content(url)
    doc = pq(content)
    urls = doc('.article-list .content a')
    return urls

def main(offset):
    url = '此處爲博客地址' + str(offset)
    urls = geturls(url)
    for a in urls.items():
        a_url = a.attr('href')
        print(a_url)
        html = find_html_content(a_url)
        read_and_wiriteblog(html)
if __name__ == '__main__':
    for i in range(3):
        print(i)
        main(offset = i+1)
相關文章
相關標籤/搜索