一個六年經驗的python後端是怎麼學習用java寫API的(2)Extracter,微信文章抓取清洗入庫

描述

上一篇(一個六年經驗的python後端是怎麼學習用java寫API的(1)設定場景,選擇框架dropwizard)肯定需求後,第一步須要實現實現 Extracter 模塊來抓取微信文章,代碼爲 piratehtml

pirate 是由個人 django 腳手架 original 實現的,文件上傳提供了七牛和騰訊雲兩個 backend,部署提供了默認的配置文件,所以只要關注具體的微信的抓取邏輯便可。java

核心表講解

pirate/original/extracter/models.pypython

微信文章表,正常設計,大字段拆表。mysql

13 class WXArticle(TimeStampedModel):
 14     raw_url = models.CharField(max_length=1023)
 15     title = models.CharField(max_length=255)
 16     cover = models.CharField(max_length=255)
 17     description = models.CharField(max_length=255, default='')
 18     is_active = models.BooleanField(default=False)
 19     is_delete = models.BooleanField(default=False)
 
 
 98 class WXArticleContent(TimeStampedModel):
 99     article_id = models.IntegerField(unique=True)
100     content = models.TextField()
101     body_script = models.TextField(default='')

pirate/original/extracter/models.py nginx

圖片上傳緩存表,由於有時候會重複抓取一篇文章(例如編輯後須要從新寫文章),而每次抓取文章時會將裏面的圖片上傳到咱們的 cdn,添加上傳文件緩存表來減小重複的圖片上傳操做。git

188 class URLFileUploadCache(TimeStampedModel):
189     raw_url = models.CharField(blank=True, default='', max_length=512)
190     url = models.CharField(blank=True, default='', max_length=255)
191     raw_url_md5 = models.CharField(blank=True, default='', max_length=64, db_index=True)

抓取

觀察微信文章發現,文章 div id=js_content 爲文章正文,其中標題和 cover 圖在 meta 裏面,使用 requests 抓取文章原文,BeautifulSoup 過濾須要的 dom,正則找出須要替換的圖片,upload_handler 選擇七牛去上傳圖片而且替換掉原圖地址。github

21     @classmethod
 22     def spider_url(cls, raw_url):
 23         assert raw_url.startswith('https://mp.weixin.qq.com/') or raw_url.startswith('http://mp.weixin.qq.com/')
 24         content = tools.spider_request(raw_url)
 25         description = content_text = title = cover = ''
 26         body_script = ''
 27         if content:
 28             b = BeautifulSoup(content, 'html.parser')
 29             content_dom = b.find('div', id='js_content')
 30             title_dom = b.find('meta', property="og:title")
 31             cover_dome = b.find('meta', property="og:image")
 32             if title_dom:
 33                 title = title_dom.attrs['content']
 34             if cover_dome:
 35                 cover = cover_dome.attrs['content']
 36                 raw_url = cover
 37                 upload_cache = URLFileUploadCache.get_cached_objects(raw_url, get_last=True)
 38                 if upload_cache:
 39                     cover = upload_cache[0].url
 40                 else:
 41                     upload_data = tools.upload_file_from_url(raw_url)
 42                     url = upload_data['url']
 43                     if url:
 44                         URLFileUploadCache.new_cache(raw_url, url)
 45                         cover = url
 46             if content_dom:
 47                 content_text = unicode(content_dom)
 48                 ss = b.body.find_all('script')
 49                 body_script = ''.join(unicode(s) for s in ss)
 50                 urls = []
 51                 for _re in constants.WEIXIN_IMAGE_RES:
 52                     urls.extend(_re.findall(content_text))
 53                 descriptions = constants.WEIXIN_DESCRIPTION_RE.findall(content)
 54                 if descriptions:
 55                     description = BeautifulSoup(descriptions[0]).meta.attrs.get('content', '')
 56                 mapper = {}
 57                 for raw_url in urls:
 58                     if raw_url in mapper:
 59                         continue
 60                     upload_cache = URLFileUploadCache.get_cached_objects(raw_url, get_last=True)
 61                     if upload_cache:
 62                         mapper[raw_url] = upload_cache[0].url
 63                     else:
 64                         upload_data = tools.upload_file_from_url(raw_url, )
 65                         url = upload_data['url']
 66                         if url:
 67                             URLFileUploadCache.new_cache(raw_url, url)
 68                             mapper[raw_url] = url
 69                 for raw_url, url in mapper.iteritems():
 70                     if not url:
 71                         continue
 72                     content_text = content_text.replace(raw_url, url)
 73         return {
 74             'raw_url': raw_url,
 75             'title': title,
 76             'cover': cover,
 77             'content': content_text,
 78             'description': description,
 79             'body_script': body_script,
 80         }

上傳文件到七牛的 upload_handlersql

13 POLICY = settings.FILE_CALLBACK_POLICY or {
 14     'callbackUrl': settings.FILEUPLOAD_CALLBACK_URL,
 15     'callbackBody': 'bucket=$(bucket)&key=$(key)&filename=$(fname)&filesize=$(fsize)',
 16     'insertOnly': 1,
 17 }
 18
 19
 20 class UploadHandler(object):
 21
 22     def __init__(self, key, secret, bucket):
 23         self.key = key
 24         self.secret = secret
 25         self.bucket = bucket
 26         self.backend = qiniu
 27         self._backend_auth = Auth(self.key, self.secret)
 28
 29     def _upload_token(self, key, expires=3600, policy=None):
 30         token = self._backend_auth.upload_token(self.bucket, key=key, expires=expires, policy=policy)
 31         return token
 32
 33     def upload_file(self, key, data, policy=None, fname='file_name'):
 34         if policy is None:
 35             policy = POLICY
 36         uptoken = self._upload_token(key, policy=policy)
 37         return self.backend.put_data(uptoken, key, data, fname=fname)
 38
 39     def get_download_url(self, key):
 40         return '{}{}'.format(settings.FILE_DOWNLOAD_PREFIX, key)

CDN 配置

QQ20200216-234519@2x.png

  • 建立一個公開的存儲空間,名字即 FILE_UPLOAD_BUCKET

QQ20200216-234809@2x.png

  • 爲了讓域名好看一些,配置 CDN 加速域名,根據步驟操做便可(有一些 dns 的操做)。FILE_DOWNLOAD_PREFIX

image.png

  • 我的中心祕鑰管理獲取 upload_handler 須要的 FILE_UPLOAD_KEY、FILE_UPLOAD_SECRET

image.png

執行部署

  • 建立對應 mysql 數據庫
  • 建立 virtualenv,source 後 pip install 項目路徑下的 requirements/base.txt
  • 建立 config/settings/private_production.py,見下面代碼段
  • 數據庫 migrate,cd 到 original 目錄,執行./manage.py migrate --settings=config.settings.production
  • 啓動項目 ./manage.py runserver 0.0.0.0:8976 (端口號隨意)
  • 固然建議用 supervisor + nginx,見deploy目錄下的相關例子數據庫

    import osdjango

    DATABASES = {

    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'dbname', # 改爲你的數據庫名字
        'HOST': os.environ.get('ORIGINAL_MYSQL_HOST', 'localhost'), # 數據庫host
        'USER': os.environ.get('ORIGINAL_MYSQL_USER', 'db_username'), # 改爲你的數據庫user
        'PASSWORD': os.environ.get('ORIGINAL_MYSQL_PASSWORD', 'db_password'), # 改爲你的數據庫password
        'PORT': os.environ.get('ORIGINAL_MYSQL_PORT', 3306),
        'OPTIONS': {'charset': 'utf8mb4'},
    }

    }

    FILE_UPLOAD_BACKEND = 'qiniu'
    FILE_UPLOAD_KEY = 'i6fdSECQjLfF' # 改爲你的qiniu key,如今這裏是假的
    FILE_UPLOAD_SECRET = 'adfiuerqp' # 改爲你的qiniu secret
    FILE_UPLOAD_BUCKET = 'reworkdev' # 改爲你的qiniu bucket
    FILE_CALLBACK_POLICY = {}
    FILE_DOWNLOAD_PREFIX = '' # 改爲你的host, 例如 http://cdn.myhost.com/

    FILEUPLOAD_CALLBACK_URL = # 改爲你本身host的對應地址, 例如 https://www.myhost.com/api/v1...

postman 測試接口

4444.png

查看數據庫

QQ20200217-001013@2x.png

QQ20200217-001037@2x.png

接下來

基本物料接口準備就緒,下面就能夠正式開始 Java 框架的學習。

todo: 一個六年經驗的python後端是怎麼學習用java寫API的(3) RestAPI,dropwizard 的第一組API

相關文章
相關標籤/搜索