搗鼓系列：前端大文件上傳

某一天，在逛某金的時候忽然看到這篇文章，前端大文件上傳，以前也研究過相似的原理，可是一直沒能親手作一次，始終感受有點虛，最近花了點時間，精（熬）心（夜）準（肝）備（爆）了個例子，來和你們分享。html

本文代碼：github前端

問題

Knowing the time available to provide a response can avoid problems with timeouts. Current implementations select times between 30 and 120 secondsnode

tools.ietf.org/id/draft-th…mysql

若是一個文件太大，好比音視頻數據、下載的excel表格等等，若是在上傳的過程當中，等待時間超過30 ~ 120s，服務器沒有數據返回，就有可能被認爲超時，這是上傳的文件就會被中斷。jquery

另一個問題是，在大文件上傳的過程當中，上傳到服務器的數據由於服務器問題或者其餘的網絡問題致使中斷、超時，這是上傳的數據將不會被保存，形成上傳的浪費。ios

原理

大文件上傳利用將大文件分片的原則，將一個大文件拆分紅幾個小的文件分別上傳，而後在小文件上傳完成以後，通知服務器進行文件合併，至此完成大文件上傳。git

這種方式的上傳解決了幾個問題：github

文件太大致使的請求超時
將一個請求拆分紅多個請求（如今比較流行的瀏覽器，通常默認的數量是6個，同源請求併發上傳的數量），增長併發數，提高了文件傳輸的速度
小文件的數據便於服務器保存，若是發生網絡中斷，下次上傳時，已經上傳的數據能夠再也不上傳

實現

文件分片

File接口是基於Blob的，所以咱們能夠將上傳的文件對象使用slice方法進行分割，具體的實現以下：web

export const slice = (file, piece = CHUNK_SIZE) => {
  return new Promise((resolve, reject) => {
    let totalSize = file.size;
    const chunks = [];
    const blobSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
    let start = 0;
    const end = start + piece >= totalSize ? totalSize : start + piece;

    while (start < totalSize) {
        const chunk = blobSlice.call(file, start, end);
        chunks.push(chunk);

        start = end;
        const end = start + piece >= totalSize ? totalSize : start + piece;
    }
    
    resolve(chunks);
  });
};
複製代碼

而後將每一個小的文件，使用表單的方式上傳ajax

_chunkUploadTask(chunks) {
    for (let chunk of chunks) {
        const fd = new FormData();
        fd.append('chunk', chunk);

        return axios({
          url: '/upload',
          method: 'post',
          data: fd,
        })
          .then((res) => res.data)
          .catch((err) => {});
    }
}
複製代碼

後端採用了express，接收文件採用了[multer](https://github.com/expressjs/multer)這個庫

multer上傳的的方式有single、array、fields、none、any，作單文件上傳，採用single和array皆可，使用比較簡便，經過req.file 或 req.files來拿到上傳文件的信息

另外須要經過disk storage來定製化上傳文件的文件名，保證在每一個上傳的文件chunk都是惟一的。

const storage = multer.diskStorage({
  destination: uploadTmp,
  filename: (req, file, cb) => {
    // 指定返回的文件名，若是不指定，默認會隨機生成
    cb(null, file.fieldname);
  },
});
const multerUpload = multer({ storage });

// router
router.post('/upload', multerUpload.any(), uploadService.uploadChunk);

// service
uploadChunk: async (req, res) => {
  const file = req.files[0];
  const chunkName = file.filename;

  try {
    const checksum = req.body.checksum;
    const chunkId = req.body.chunkId;

    const message = Messages.success(modules.UPLOAD, actions.UPLOAD, chunkName);
    logger.info(message);
    res.json({ code: 200, message });
  } catch (err) {
    const errMessage = Messages.fail(modules.UPLOAD, actions.UPLOAD, err);
    logger.error(errMessage);
    res.json({ code: 500, message: errMessage });
    res.status(500);
  }
}
複製代碼

上傳的文件會被保存在uploads/tmp下，這裏是由multer自動幫咱們完成的，成功以後，經過req.files可以獲取到文件的信息，包括chunk的名稱、路徑等等，方便作後續的存庫處理。

爲何要保證chunk的文件名惟一？

由於文件名是隨機的，表明着一旦發生網絡中斷，若是上傳的分片尚未完成，這時數據庫也不會有相應的存片記錄，致使在下次上傳的時候找不到分片。這樣的後果是，會在tmp目錄下存在着不少遊離的分片，而得不到刪除。
同時在上傳暫停的時候，也能根據chunk的名稱來刪除相應的臨時分片（這步能夠不須要，multer判斷分片存在的時候，會自動覆蓋）

如何保證chunk惟一，有兩個辦法，

在作文件切割的時候，給每一個chunk生成文件指紋（chunkmd5)
經過整個文件的文件指紋，加上chunk的序列號指定（filemd5 + chunkIndex）

// 修改上述的代碼
const chunkName = `${chunkIndex}.${filemd5}.chunk`;
const fd = new FormData();
fd.append(chunkName, chunk);
複製代碼

至此分片上傳就大體完成了。

文件合併

文件合併，就是將上傳的文件分片分別讀取出來，而後整合成一個新的文件，比較耗IO，能夠在一個新的線程中去整合。

for (let chunkId = 0; chunkId < chunks; chunkId++) {
  const file = `${uploadTmp}/${chunkId}.${checksum}.chunk`;
  const content = await fsPromises.readFile(file);
  logger.info(Messages.success(modules.UPLOAD, actions.GET, file));
  try {
    await fsPromises.access(path, fs.constants.F_OK);
    await appendFile({ path, content, file, checksum, chunkId });
    if (chunkId === chunks - 1) {
        res.json({ code: 200, message });
    }
  } catch (err) {
    await createFile({ path, content, file, checksum, chunkId });
  }
}

Promise.all(tasks).then(() => {
  // when status in uploading, can send /makefile request
  // if not, when status in canceled, send request will delete chunk which has uploaded.
  if (this.status === fileStatus.UPLOADING) {
    const data = { chunks: this.chunks.length, filename, checksum: this.checksum };
    axios({
      url: '/makefile',
      method: 'post',
      data,
    })
      .then((res) => {
        if (res.data.code === 200) {
          this._setDoneProgress(this.checksum, fileStatus.DONE);
          toastr.success(`file ${filename} upload successfully!`);
        }
      })
      .catch((err) => {
        console.error(err);
        toastr.error(`file ${filename} upload failed!`);
      });
  }
});
複製代碼

首先使用access判斷分片是否存在，若是不存在，則建立新文件並讀取分片內容
若是chunk文件存在，則讀取內容到文件中
每一個chunk讀取成功以後，刪除chunk

這裏有幾點須要注意：

若是一個文件切割出來只有一個chunk，那麼就須要在createFile的時候進行返回，不然請求一直處於pending狀態。

await createFile({ path, content, file, checksum, chunkId });

if (chunks.length === 1) {
  res.json({ code: 200, message });
}
複製代碼

makefile以前務必要判斷文件是不是上傳狀態，否則在cancel的狀態下，還會繼續上傳，致使chunk上傳以後，chunk文件被刪除，可是在數據庫中卻存在記錄，這樣合併出來的文件是有問題的。

文件秒傳

如何作到文件秒傳，思考三秒，公佈答案，3. 2. 1.....，其實只是個障眼法。

爲啥說是個障眼法，由於根本就沒有傳，文件是從服務器來的。這就有幾個問題須要弄清楚，

怎麼肯定文件是服務器中已經存在了的？
文件的上傳的信息是保存在數據庫中仍是客戶端？
文件名不相同，內容相同，應該怎麼處理？

問題一：怎麼判斷文件已經存在了？

能夠爲每一個文件上傳生成對應的指紋，可是若是文件太大，客戶端生成指紋的時間將大大增長，怎麼解決這個問題？

還記得以前的slice，文件切片麼？大文件很差作，一樣的思路，切成小文件，而後計算md5值就行了。這裏使用spark-md5這個庫來生成文件hash。改造上面的slice方法。

export const checkSum = (file, piece = CHUNK_SIZE) => {
  return new Promise((resolve, reject) => {
    let totalSize = file.size;
    let start = 0;
    const blobSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
    const chunks = [];
    const spark = new SparkMD5.ArrayBuffer();
    const fileReader = new FileReader();

    const loadNext = () => {
      const end = start + piece >= totalSize ? totalSize : start + piece;
      const chunk = blobSlice.call(file, start, end);

      start = end;
      chunks.push(chunk);
      fileReader.readAsArrayBuffer(chunk);
    };

    fileReader.onload = (event) => {
      spark.append(event.target.result);

      if (start < totalSize) {
        loadNext();
      } else {
        const checksum = spark.end();
        resolve({ chunks, checksum });
      }
    };

    fileReader.onerror = () => {
      console.warn('oops, something went wrong.');
      reject();
    };

    loadNext();
  });
};
複製代碼

問題二：文件的上傳的信息是保存在數據庫中仍是客戶端？

文件上傳的信息最好是保存在服務端的數據庫中（客戶端可使用IndexDB），這樣作有幾個優勢，

數據庫服務提供了成套的CRUD，方便數據的操做
當用戶刷新瀏覽器以後，或者更換瀏覽器以後，文件上傳的信息不會丟失

這裏主要強調的是第二點，由於第一條客戶端也能夠作😁😁😁

const saveFileRecordToDB = async (params) => {
  const { filename, checksum, chunks, isCopy, res } = params;
  await uploadRepository.create({ name: filename, checksum, chunks, isCopy });

  const message = Messages.success(modules.UPLOAD, actions.UPLOAD, filename);
  logger.info(message);
  res.json({ code: 200, message });
};
複製代碼

問題三：文件名不相同，內容相同，應該怎麼處理？

這裏一樣有兩個解決辦法：

文件copy，直接將文件複製一份，而後更新數據庫記錄，而且加上isCopy的標識
文件引用，數據庫保存記錄，加上isCopy和linkTo的標識

這兩種方式有什麼區別：

使用文件copy的方式，在刪除文件的時候會更加自由點，由於原始文件和複製的文件都是獨立存在的，刪除不會相互干涉，缺點是會存在不少內容相同的文件；

可是使用引用方式複製的文件的刪除就比較麻煩，若是刪除的是複製的文件倒還好，刪除的若是是原始文件，就必須先將源文件copy一份到任意的一個複製文件中，同時修改負責的記錄中的isCopy爲false, 而後才能刪除原文件的數據庫記錄。

這裏作了個圖，順便貼下：

理論上講，文件引用的方式可能更加好一點，這裏偷了個懶，採用了文件複製的方式。

// 客戶端
uploadFileInSecond() {
  const id = ID();
  const filename = this.file.name;
  this._renderProgressBar(id);

  const names = this.serverFiles.map((file) => file.name);
  if (names.indexOf(filename) === -1) {
    const sourceFilename = names[0];
    const targetFilename = filename;

    this._setDoneProgress(id, fileStatus.DONE_IN_SECOND);
    axios({
      url: '/copyfile',
      method: 'get',
      params: { targetFilename, sourceFilename, checksum: this.checksum },
    })
      .then((res) => {
        if (res.data.code === 200) {
          toastr.success(`file ${filename} upload successfully!`);
        }
      })
      .catch((err) => {
        console.error(err);
        toastr.error(`file ${filename} upload failed!`);
      });
  } else {
    this._setDoneProgress(id, fileStatus.EXISTED);
    toastr.success(`file ${filename} has existed`);
  }
}

// 服務器端
copyFile: async (req, res) => {
  const sourceFilename = req.query.sourceFilename;
  const targetFilename = req.query.targetFilename;
  const checksum = req.query.checksum;
  const sourceFile = `${uploadPath}/${sourceFilename}`;
  const targetFile = `${uploadPath}/${targetFilename}`;

  try {
    await fsPromises.copyFile(sourceFile, targetFile);
    await saveFileRecordToDB({ filename: targetFilename, checksum, chunks: 0, isCopy: true, res });
  } catch (err) {
    const message = Messages.fail(modules.UPLOAD, actions.UPLOAD, err.message);
    logger.info(message);
    res.json({ code: 500, message });
    res.status(500);
  }
}
複製代碼

文件上傳暫停與文件續傳

文件上傳暫停，實際上是利用了xhr的abort方法，由於在案例中採用的是axios，axios基於ajax封裝了本身的實現方式。

這裏看看代碼暫停代碼：

const CancelToken = axios.CancelToken;

axios({
  url: '/upload',
  method: 'post',
  data: fd,
  cancelToken: new CancelToken((c) => {
    // An executor function receives a cancel function as a parameter
    canceler = c;
    this.cancelers.push(canceler);
  }),
})
複製代碼

axios在每一個請求中使用了一個參數cancelToken，這個cancelToken是一個函數，能夠利用這個函數來保存每一個請求的cancel句柄。

而後在點擊取消的時候，取消每一個chunk的上傳，以下：

// 這裏使用了jquery來編寫html，好吧，確實寫🤮了

$(`#cancel${id}`).on('click', (event) => {
  const $this = $(event.target);
  $this.addClass('hidden');
  $this.next('.resume').removeClass('hidden');

  this.status = fileStatus.CANCELED;
  if (this.cancelers.length > 0) {
    for (const canceler of this.cancelers) {
      canceler();
    }
  }
});
複製代碼

在每一個chunk上傳的同時，咱們也須要判斷每一個chunk是否存在？爲何？

由於發生意外的網絡中斷，上傳到chunk信息就會被保存到數據庫中，因此在作續傳的時候，已經存在的chunk就能夠不用再傳了，節省了時間。

那麼問題來了，是每一個chunk單一檢測，仍是預先檢測服務器中已經存在的chunks？

這個問題也能夠思考三秒，畢竟debug了很久。

3.. 2.. 1......

看我的的代碼策略，由於畢竟每一個人寫代碼的方式不一樣。原則是，不能阻塞每次的循環，由於在循環中須要生成每一個chunk的cancelToken，若是在循環中，每一個chunk都要從服務器中拿一遍數據，會致使後續的chunk生成不了cancelToken，這樣在點擊了cancel的時候，後續的chunk仍是可以繼續上傳。

// 客戶端
const chunksExisted = await this._isChunksExists();

for (let chunkId = 0; chunkId < this.chunks.length; chunkId++) {
  const chunk = this.chunks[chunkId];
  // 很早以前的代碼是這樣的
  // 這裏會阻塞cancelToken的生成
  // const chunkExists = await isChunkExisted(this.checksum, chunkId);

  const chunkExists = chunksExisted[chunkId];

  if (!chunkExists) {
    const task = this._chunkUploadTask({ chunk, chunkId });
    tasks.push(task);
  } else {
    // if chunk is existed, need to set the with of chunk progress bar
    this._setUploadingChunkProgress(this.checksum, chunkId, 100);
    this.progresses[chunkId] = chunk.size;
  }
}

// 服務器端
chunksExist: async (req, res) => {
  const checksum = req.query.checksum;
  try {
    const chunks = await chunkRepository.findAllBy({ checksum });
    const exists = chunks.reduce((cur, chunk) => {
      cur[chunk.chunkId] = true;
      return cur;
    }, {});
    const message = Messages.success(modules.UPLOAD, actions.CHECK, `chunk ${JSON.stringify(exists)} exists`);
    logger.info(message);
    res.json({ code: 200, message: message, data: exists });
  } catch (err) {
    const errMessage = Messages.fail(modules.UPLOAD, actions.CHECK, err);
    logger.error(errMessage);
    res.json({ code: 500, message: errMessage });
    res.status(500);
  }
}
複製代碼

文件續傳就是從新上傳文件，這點沒有什麼能夠講的，主要是要把上面的那個問題解決了。

$(`#resume${id}`).on('click', async (event) => {
  const $this = $(event.target);
  $this.addClass('hidden');
  $this.prev('.cancel').removeClass('hidden');

  this.status = fileStatus.UPLOADING;
  await this.uploadFile();
});
複製代碼

進度回傳

進度回傳是利用了XMLHttpRequest.upload，axios一樣封裝了相應的方法，這裏須要顯示兩個進度

每一個chunk的進度
全部chunk的總進度

每一個chunk的進度會根據上傳的loaded和total來進行計算，這裏也沒有什麼好說的。

axios({
  url: '/upload',
  method: 'post',
  data: fd,
  onUploadProgress: (progressEvent) => {
    const loaded = progressEvent.loaded;
    const chunkPercent = ((loaded / progressEvent.total) * 100).toFixed(0);

    this._setUploadingChunkProgress(this.checksum, chunkId, chunkPercent);
  },
})
複製代碼

總進度則是根據每一個chunk的加載量，進行累加，而後在和file.size來進行計算。

constructor(checksum, chunks, file) {
  this.progresses = Array(this.chunks.length).fill(0);
}

axios({
  url: '/upload',
  method: 'post',
  data: fd,
  onUploadProgress: (progressEvent) => {
    const chunkProgress = this.progresses[chunkId];
    const loaded = progressEvent.loaded;
    this.progresses[chunkId] = loaded >= chunkProgress ? loaded : chunkProgress;
    const percent = ((this._getCurrentLoaded(this.progresses) / this.file.size) * 100).toFixed(0);

    this._setUploadingProgress(this.checksum, percent);
  },
})

_setUploadingProgress(id, percent) {
  // ...

  // for some reason, progressEvent.loaded bytes will greater than file size
  const isUploadChunkDone = Number(percent) >= 100;
  // 1% to make file
  const ratio = isUploadChunkDone ? 99 : percent;
}
複製代碼

這裏須要注意的一點是，loaded >= chunkProgress ? loaded : chunkProgress，這樣判斷的目的是，由於續傳的過程當中，有可能某些片須要從新重**0**開始上傳，若是不這樣判斷，就會致使進度條的跳動。

數據庫配置

數據庫採用了sequelize + mysql，初始化代碼以下：

const initialize = async () => {
  // create db if it doesn't already exist
  const { DATABASE, USER, PASSWORD, HOST } = config;
  const connection = await mysql.createConnection({ host: HOST, user: USER, password: PASSWORD });
  try {
    await connection.query(`CREATE DATABASE IF NOT EXISTS ${DATABASE};`);
  } catch (err) {
    logger.error(Messages.fail(modules.DB, actions.CONNECT, `create database ${DATABASE}`));
    throw err;
  }

  // connect to db
  const sequelize = new Sequelize(DATABASE, USER, PASSWORD, {
    host: HOST,
    dialect: 'mysql',
    logging: (msg) => logger.info(Messages.info(modules.DB, actions.CONNECT, msg)),
  });

  // init models and add them to the exported db object
  db.Upload = require('./models/upload')(sequelize);
  db.Chunk = require('./models/chunk')(sequelize);

  // sync all models with database
  await sequelize.sync({ alter: true });
};
複製代碼

部署

生產環境的部署採用了docker-compose，代碼以下：

Dockerfile

FROM node:16-alpine3.11

# Create app directory
WORKDIR /usr/src/app

# A wildcard is used to ensure both package.json AND package-lock.json are copied
# where available (npm@5+)
COPY package*.json ./

# If you are building your code for production
# RUN npm ci --only=production

# Bundle app source
COPY . .

# Install app dependencies
RUN npm install
RUN npm run build:prod
複製代碼

docker-compose.yml

version: "3.9"
services:
  web:
    build: .
    # sleep for 20 sec, wait for database server start
    command: sh -c "sleep 20 && npm start"
    ports:
      - "3000:3000"
    environment:
      NODE_ENV: prod
    depends_on:
      - db
  db:
    image: mysql:8
    command: --default-authentication-plugin=mysql_native_password
    restart: always
    ports:
      - "3306:3306"
    environment:
      MYSQL_ROOT_PASSWORD: pwd123
複製代碼

有一點須要注意的是，須要等數據庫服務啓動，而後再啓動web服務，否則會報錯，因此代碼中加了20秒的延遲。

部署到heroku

create heroku.yml

build:
  docker:
    web: Dockerfile
run:
  web: npm run start:heroku
複製代碼

modify package.json

{
  "scripts": {
    "start:heroku": "NODE_ENV=heroku node ./bin/www"
  }
}
複製代碼

deploy to heroku

# create heroku repos
heroku create upload-demos
heroku stack:set container 

# when add addons, remind to config you billing card in heroku [important]
# add mysql addons
heroku addons:create cleardb:ignite 
# get mysql connection url
heroku config | grep CLEARDB_DATABASE_URL
# will echo => DATABASE_URL: mysql://xxxxxxx:xxxxxx@xx-xxxx-east-xx.cleardb.com/heroku_9ab10c66a98486e?reconnect=true

# set mysql database url
heroku config:set DATABASE_URL='mysql://xxxxxxx:xxxxxx@xx-xxxx-east-xx.cleardb.com/heroku_9ab10c66a98486e?reconnect=true'

# add heroku.js to src/db/config folder
# use the DATABASE_URL which you get form prev step to config the js file
module.exports = {
  HOST: 'xx-xxxx-east-xx.cleardb.com',
  USER: 'xxxxxxx',
  PASSWORD: 'xxxxxx',
  DATABASE: 'heroku_9ab10c66a98486e',
};

# push source code to remote
git push heroku master
複製代碼

小結

至此全部的問題都已經解決了，整體的一個感覺是處理的細節很是多，有些事情仍是不能只是看看，花時間作出來才更加了解原理，更加有動力去學新的知識。

紙上得來終覺淺，絕知此事要躬行。

在代碼倉庫github還有不少細節，包括本地服務器開發配置、日誌存儲等等，感興趣的能夠本身fork瞭解下。創做不易，求⭐️⭐️。