如何導出簡書中的所有文章（包括圖片）？

時間 2020-04-01

標籤如何導出書中所有文章包括圖片简体版

原文原文鏈接

上一篇文章讓你們脫坑簡書，這篇文章來具體說說如何脫坑。html

本文將教會你如何把簡書中的全部文章（包括文章內的圖片）進行導出，而且將利用這些文章創建本身的博客~python

首先，咱們須要把本身在簡書原有的文章導出，這裏能夠利用簡書的導出文章功能，步驟以下：git

登陸以後選擇「設置」github

在設置頁面先點擊左側的「帳號管理」，而後在右邊選擇「下載全部文章」 markdown

這樣就能下載全部文章了，不過僅僅是文章的文字內容而已，你上傳過的那些圖片在文件裏面只是一個連接而已~網絡

隨便打開一篇看看，以下圖：app

這樣確定不行啊，只下載了文字內容，沒下載圖片。可是別急，下面我用Python作了一個小工具，能夠把文章中的全部圖片下載下來，而且把文章中的全部圖片連接替換爲本地的連接~工具

簡單實現原理

markdown裏的圖片處理我以前已經寫過博客了，看這裏：python實現解析markdown文檔中的圖片，而且保存到本地~url

不過我此次換了一種方式來處理markdown，以前的作法有點繞，並且只是單純的下載圖片，沒有把圖片連接替換成本地的相對地址。spa

此次我用了mistletoe這個庫，能夠實現將markdown解析爲一個xml樹，這樣比起我以前的作法少了一步哈哈，能夠直接讀取markdown裏的全部圖片，而後一一下載，以後經過正則匹配的方式，對圖片進行一一替換。下載的文件夾用了Typora的方案：markdown文件名.assets。

實現的代碼

下載和替換圖片的方法：

def download_and_replace_image(filepath: str):
    print(f'正在處理文件：{filepath}')
    print(filepath)
    with open(filepath, 'r', encoding='utf-8') as f:
        file_content = f.read()
        html = mistletoe.markdown(file_content)
        soup = BeautifulSoup(html, features='html.parser')
        for img in soup.find_all('img'):
            img_url: str = img.get('src')
            if not img_url.startswith('http://') and not img_url.startswith('https://'):
                print(f'不是有效的網絡圖片連接，跳過')
                return
            img_name = os.path.basename(img_url.replace(url_suffix, ''))
            print(f'下載圖片：{img_name}')
            download_pics(img_url, filepath)
            if '.' in img_name:
                img_base_name = img_name[0:img_name.index('.')]
            else:
                # 沒有圖片後綴的話就加上jpg
                img_base_name = img_name + '.jpg'
                img_name += '.jpg'

            img_relative_path = os.path.join(os.path.basename(filepath).replace('.md', '.assets'), img_name)
            print(f'替換圖片連接：{img_url} with {img_relative_path}')

            file_content = re.sub(f"!\\[.*?\\]\\((.*?){img_base_name}(.*?)\\)", f'{{% assets {img_base_name} %}}', file_content)
            file_content = file_content.replace(f'{{% assets {img_base_name} %}}', f'![]({img_relative_path})')

        updated_file_content = file_content

    with open(filepath, 'w+', encoding='utf-8') as f:
        print(f'改動寫入文件：{filepath}')
        f.write(updated_file_content)
複製代碼

搭配線程池加快處理速度：

def run():
    print('正在處理。')

    work_path = os.path.join('.', 'docs')

    pool = threadpool.ThreadPool(4)
    args = []

    for root, dirs, files in os.walk(work_path):
        for filename in files:
            if filename.endswith('md'):
                filepath = os.path.abspath(os.path.join(root, filename))
                args.append(filepath)
                # download_and_replace_image(filepath)

    tasks = threadpool.makeRequests(download_and_replace_image, args)
    [pool.putRequest(task) for task in tasks]

    print('=> 線程池開始運行')
    pool.wait()
    print('任務完成。')
複製代碼