上一篇(一個六年經驗的python後端是怎麼學習用java寫API的(1)設定場景,選擇框架dropwizard)肯定需求後,第一步須要實現實現 Extracter 模塊來抓取微信文章,代碼爲 pirate 。html
pirate 是由個人 django 腳手架 original 實現的,文件上傳提供了七牛和騰訊雲兩個 backend,部署提供了默認的配置文件,所以只要關注具體的微信的抓取邏輯便可。java
pirate/original/extracter/models.pypython
微信文章表,正常設計,大字段拆表。mysql
13 class WXArticle(TimeStampedModel): 14 raw_url = models.CharField(max_length=1023) 15 title = models.CharField(max_length=255) 16 cover = models.CharField(max_length=255) 17 description = models.CharField(max_length=255, default='') 18 is_active = models.BooleanField(default=False) 19 is_delete = models.BooleanField(default=False) 98 class WXArticleContent(TimeStampedModel): 99 article_id = models.IntegerField(unique=True) 100 content = models.TextField() 101 body_script = models.TextField(default='')
pirate/original/extracter/models.py nginx
圖片上傳緩存表,由於有時候會重複抓取一篇文章(例如編輯後須要從新寫文章),而每次抓取文章時會將裏面的圖片上傳到咱們的 cdn,添加上傳文件緩存表來減小重複的圖片上傳操做。git
188 class URLFileUploadCache(TimeStampedModel): 189 raw_url = models.CharField(blank=True, default='', max_length=512) 190 url = models.CharField(blank=True, default='', max_length=255) 191 raw_url_md5 = models.CharField(blank=True, default='', max_length=64, db_index=True)
觀察微信文章發現,文章 div id=js_content 爲文章正文,其中標題和 cover 圖在 meta 裏面,使用 requests 抓取文章原文,BeautifulSoup 過濾須要的 dom,正則找出須要替換的圖片,upload_handler 選擇七牛去上傳圖片而且替換掉原圖地址。github
21 @classmethod 22 def spider_url(cls, raw_url): 23 assert raw_url.startswith('https://mp.weixin.qq.com/') or raw_url.startswith('http://mp.weixin.qq.com/') 24 content = tools.spider_request(raw_url) 25 description = content_text = title = cover = '' 26 body_script = '' 27 if content: 28 b = BeautifulSoup(content, 'html.parser') 29 content_dom = b.find('div', id='js_content') 30 title_dom = b.find('meta', property="og:title") 31 cover_dome = b.find('meta', property="og:image") 32 if title_dom: 33 title = title_dom.attrs['content'] 34 if cover_dome: 35 cover = cover_dome.attrs['content'] 36 raw_url = cover 37 upload_cache = URLFileUploadCache.get_cached_objects(raw_url, get_last=True) 38 if upload_cache: 39 cover = upload_cache[0].url 40 else: 41 upload_data = tools.upload_file_from_url(raw_url) 42 url = upload_data['url'] 43 if url: 44 URLFileUploadCache.new_cache(raw_url, url) 45 cover = url 46 if content_dom: 47 content_text = unicode(content_dom) 48 ss = b.body.find_all('script') 49 body_script = ''.join(unicode(s) for s in ss) 50 urls = [] 51 for _re in constants.WEIXIN_IMAGE_RES: 52 urls.extend(_re.findall(content_text)) 53 descriptions = constants.WEIXIN_DESCRIPTION_RE.findall(content) 54 if descriptions: 55 description = BeautifulSoup(descriptions[0]).meta.attrs.get('content', '') 56 mapper = {} 57 for raw_url in urls: 58 if raw_url in mapper: 59 continue 60 upload_cache = URLFileUploadCache.get_cached_objects(raw_url, get_last=True) 61 if upload_cache: 62 mapper[raw_url] = upload_cache[0].url 63 else: 64 upload_data = tools.upload_file_from_url(raw_url, ) 65 url = upload_data['url'] 66 if url: 67 URLFileUploadCache.new_cache(raw_url, url) 68 mapper[raw_url] = url 69 for raw_url, url in mapper.iteritems(): 70 if not url: 71 continue 72 content_text = content_text.replace(raw_url, url) 73 return { 74 'raw_url': raw_url, 75 'title': title, 76 'cover': cover, 77 'content': content_text, 78 'description': description, 79 'body_script': body_script, 80 }
上傳文件到七牛的 upload_handlersql
13 POLICY = settings.FILE_CALLBACK_POLICY or { 14 'callbackUrl': settings.FILEUPLOAD_CALLBACK_URL, 15 'callbackBody': 'bucket=$(bucket)&key=$(key)&filename=$(fname)&filesize=$(fsize)', 16 'insertOnly': 1, 17 } 18 19 20 class UploadHandler(object): 21 22 def __init__(self, key, secret, bucket): 23 self.key = key 24 self.secret = secret 25 self.bucket = bucket 26 self.backend = qiniu 27 self._backend_auth = Auth(self.key, self.secret) 28 29 def _upload_token(self, key, expires=3600, policy=None): 30 token = self._backend_auth.upload_token(self.bucket, key=key, expires=expires, policy=policy) 31 return token 32 33 def upload_file(self, key, data, policy=None, fname='file_name'): 34 if policy is None: 35 policy = POLICY 36 uptoken = self._upload_token(key, policy=policy) 37 return self.backend.put_data(uptoken, key, data, fname=fname) 38 39 def get_download_url(self, key): 40 return '{}{}'.format(settings.FILE_DOWNLOAD_PREFIX, key)
固然建議用 supervisor + nginx,見deploy目錄下的相關例子數據庫
import osdjango
DATABASES = {
'default': { 'ENGINE': 'django.db.backends.mysql', 'NAME': 'dbname', # 改爲你的數據庫名字 'HOST': os.environ.get('ORIGINAL_MYSQL_HOST', 'localhost'), # 數據庫host 'USER': os.environ.get('ORIGINAL_MYSQL_USER', 'db_username'), # 改爲你的數據庫user 'PASSWORD': os.environ.get('ORIGINAL_MYSQL_PASSWORD', 'db_password'), # 改爲你的數據庫password 'PORT': os.environ.get('ORIGINAL_MYSQL_PORT', 3306), 'OPTIONS': {'charset': 'utf8mb4'}, }
}
FILE_UPLOAD_BACKEND = 'qiniu'
FILE_UPLOAD_KEY = 'i6fdSECQjLfF' # 改爲你的qiniu key,如今這裏是假的
FILE_UPLOAD_SECRET = 'adfiuerqp' # 改爲你的qiniu secret
FILE_UPLOAD_BUCKET = 'reworkdev' # 改爲你的qiniu bucket
FILE_CALLBACK_POLICY = {}
FILE_DOWNLOAD_PREFIX = '' # 改爲你的host, 例如 http://cdn.myhost.com/
FILEUPLOAD_CALLBACK_URL = # 改爲你本身host的對應地址, 例如 https://www.myhost.com/api/v1...
查看數據庫
基本物料接口準備就緒,下面就能夠正式開始 Java 框架的學習。
todo: 一個六年經驗的python後端是怎麼學習用java寫API的(3) RestAPI,dropwizard 的第一組API