Python爬蟲建站入門手記——從零開始創建採集站點（三：採集入庫）

時間 2019-12-04

標籤 python 爬蟲建站入門手記開始創建採集站點入庫欄目 Python 简体版

原文原文鏈接

上回，我已經大概把爬蟲寫出來了。
我寫了一個內容爬蟲，一個爬取tag裏面內容連接的爬蟲
其實還差一個，就是收集一共有哪些tag的爬蟲。可是這裏先不說這個問題，由於我上次忘了此次又不想弄。。
還有個緣由：若是實際採集的話，直接用http://segmentfault.com/questions/newest?page=1這個連接獲取全部問題，挨個爬就行。前端

進入正題python

第三部分，採集入庫。

3.1 定義數據庫（or model or schema）

爲了入庫，我須要在Django定義一個數據庫的結構。（不說nosql和mongodb（也是一個nosql可是很像關係型）的事）
還記得那個名叫web的app麼，裏面有個叫models.py的文件，我如今就來編輯它。web

bashvim ~/python_spider/web/models.py

內容以下:sql

python# -*- coding: utf-8 -*-
from django.db import models

# Create your models here.


class Tag(models.Model):
    title = models.CharField(max_length=30)

    def __unicode__(self):
        return self.title


class Question(models.Model):
    title = models.CharField(max_length=255)
    content = models.TextField()
    tags = models.ManyToManyField(Tag, related_name='questions')
    sf_id = models.CharField(max_length=16, default='0')　＃　加上這個能夠記住問題在sf的位置，方便之後更新或者其餘操做
    update_date = models.DateTimeField(auto_now=True)

    def __unicode__(self):
        return self.title


class Answer(models.Model):
    question = models.ForeignKey(Question, related_name='answers')
    content = models.TextField()

    def __unicode__(self):
        return 'To question %s' % self.question.title

都很直白，關於各個field能夠看看 Django 的文檔。mongodb

而後，我須要告訴個人python_spider項目，在運行的時候加載web這個app（項目不會自動加載裏面的app）。shell

bashvim ~/python_spider/python_spider/settings.py

在INSTALLED_APPS裏面加入web:數據庫

pythonINSTALLED_APPS = (
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'web',
)

下面，就能夠用django自動生成數據庫schema了django

bashcd ~/python_spider
python manage.py makemigrations
python manage.py migrate

如今，我~/python_spider目錄就產生了一個db.sqlite3文件，這是個人數據庫。
把玩一番個人模型canvas

python>>> from web.models import Answer, Question, Tag
>>> tag = Tag()
>>> tag.title = u'測試標籤'
>>> tag.save()
>>> tag
<Tag: 測試標籤>
>>> question = Question(title=u'測試提問', content=u'提問內容')
>>> question.save()
>>> question.tags.add(tag)
>>> question.save()
>>> answer = Answer(content=u'回答內容', question=question)
>>> answer.save()
>>> tag.questions.all() # 根據tag找question
[<Question: 測試提問>]
>>> question.tags.all() # 獲取question的tags
[<Tag: 測試標籤>]
>>> question.answers.all() # 獲取問題的答案
[<Answer: To question 測試提問>]

以上操做結果正常，說明定義的models是可用的。vim

3.2 入庫

接下來，我須要把採集的信息入庫，說白了，就是把我本身蜘蛛的信息利用django的ORM存到django鏈接的數據庫裏面，方便之後再用Django讀取用於作站。

入庫的方法太多了，這裏隨便寫一種，就是在web app裏面創建一個spider.py, 裏面定義兩個蜘蛛，繼承以前本身寫的蜘蛛，再添加入庫方法。

bashvim ~/python_spider/web/spider.py

代碼以下：

python# -*- coding: utf-8 -*-
from sfspider import spider
from web.models import Answer, Question, Tag


class ContentSpider(spider.SegmentfaultQuestionSpider):

    def save(self): # 添加save()方法
        sf_id = self.url.split('/')[-1] # 1
        tags = [Tag.objects.get_or_create(title=tag_title)[0] for tag_title in self.tags]　＃ 2
        question, created = Question.objects.get_or_create(
            sf_id=sf_id,
            defaults={'title':self.title, 'content':self.content}
        ) # 3
        question.tags.add(*tags) # 4
        question.save()
        for answer in self.answers:
            Answer.objects.get_or_create(content=answer, question=question)
        return question, created


class TagSpider(spider.SegmentfaultTagSpider):

    def crawl(self): # 採集當前分頁
        sf_ids = [url.split('/')[-1] for url in self.questions]
        for sf_id in sf_ids:
            question, created = ContentSpider(sf_id).save()

    def crawl_all_pages(self):
        while True:
            print u'正在抓取TAG:%s, 分頁:%s' % (self.tag_name, self.page) # 5
            self.crawl()
            if not self.has_next_page:
                break
            else:
                self.next_page()

這個地方寫得很笨，以前該在SegmentfaultQuestionSpider加上這個屬性。

建立或者獲取該提問的tags

建立或者獲取提問，採用sf_id來避免重複

把tags都添加到提問，這裏用*是由於這個方法本來的參數是(tag1, tag2, tag3)。可是咱們的tags是個列表

測試的時候方便看看進度

而後，測試下咱們的入庫腳本

bashpython manage.py shell

python>>> from web.spider import TagSpider
>>> t = TagSpider(u'微信')
>>> t.crawl_all_pages()
正在抓取TAG:微信, 分頁:1
正在抓取TAG:微信, 分頁:2
正在抓取TAG:微信, 分頁:3
KeyboardInterrupt # 用control-c中斷運行，測試一下就行:)
>>> from web.models import Tag, Question
>>> Question.objects.all()
[<Question: 測試提問>, <Question: 微信支付獲取prepayid，返回簽名不匹配，>, <Question: 微信js怎麼獲取openID的>, <Question: 微信支付時加入attach參數提示簽名錯誤>, <Question: 微信支付JSAPI調用返回fail_invalid_appid>, <Question: 微信消息鏈接打開  和  掃碼打開鏈接有什麼區別>, <Question: django作微信開發後臺時沒法返回response>, <Question: 微信端內置瀏覽器對canvas的支持有問題>, <Question: 分享到微信朋友圈的網頁爲何點開直接跳至頁尾？>, <Question: 微信支付開發：發起微信支付的時候，報錯：invalid signature>, <Question: 前端加密代碼有什麼好辦法不被破解>, <Question: 有沒有桌面移動一體化網站發佈方案,有市場嗎?>, <Question: 微信如何獲取用戶的頭像>, <Question: 從新設置微信自定義菜單 手機端沒有顯示該菜單>, <Question: 如何在用戶輸入關鍵字時自動回覆圖片，一張總體圖。>, <Question: 手機圖片上傳是倒着的>, <Question: 微信內網頁上傳圖片問題>, <Question: 如何轉碼微信多媒體下載接口的音頻文件？>, <Question: 微信開放平臺建立應用時，不能上傳應用圖片>, <Question: 微信頁面中，怎麼打開已安裝的app？>, '...(remaining elements truncated)...']
>>> Question.objects.get(pk=5).tags.all() # 數據庫中id=5的question的tags
[<Tag: 微信>, <Tag: 微信公衆平臺>, <Tag: 微信js-sdk>, <Tag: openid>]

3.3 設置django.contrib.admin來查看和編輯內容

爲了更直觀的觀察我採集的數據，我能夠利用django自帶的admin
編輯文件

bashvim ~/python_spider/web/admin.py

pythonfrom django.contrib import admin
from web.models import Tag, Question, Answer

admin.site.register(Tag)
admin.site.register(Question)
admin.site.register(Answer)

而後建立超級用戶

bashpython manage.py createsuperuser # 根據提示建立

啓動測試服務器

bashpython manage.py runserver 0.0.0.0:80 # 我這是在runabove上，本地直接manage.py runserver

而後，我訪問http://192.99.71.91/admin/登陸剛剛建立的帳號，就能對內容進行查看和編輯了

OK, 今天的內容到此，下一篇，是編寫django的view，套用簡單的模板來建站。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。