學好 MP4，讓直播更給力

時間 2019-11-16

標籤學好 mp4 直播給力简体版

原文原文鏈接

原文連接爲：villainhrjavascript

MP4 實際表明的含義是 MPEG-4 Part 14。它只是 MPEG 標準中的 14 部分。它主要參考 ISO/IEC
標準來制定的。MP4 主要做用是能夠實現快進快放，邊下載邊播放的效果。他是基於 MOV，而後發展成本身相關的格式內容。而後和 MP4 相關的文件還有：3GP，M4V 這兩種格式。前端

MP4 的格式稍微比 FLV 複雜一些，它是經過嵌的方式來實現整個數據的攜帶。換句話說，它的每一段內容，均可以變成一個對象，若是須要播放的話，只要獲得相應的對象便可。java

MP4 中最基本的單元就是 Box，它內部是經過一個一個獨立的 box 拼接而成的。因此，這裏，咱們先從 Box 的講解開始。app

PS：做爲一個前端開發，在大部分場合瞭解 MP4 非但沒用，並且有點浪費時間。本文推薦閱讀是針對音視頻開發感興趣的同窗，特別是從事直播，或者，視頻播放器業務相關的開發者。ide

MP4 box

MP4 box 能夠分爲 basic box 和 full box。優化

basic box: 主要針對的是相關的基礎 box。好比 ftyp,moov 等。
full box: 主要針對視頻源的 media box。

這裏，再次強調一下，MP4 box 是 MP4 box 的核心。在 decode/encode 過程當中，最好把它的基本格式背下來，這樣，你寫起來會開心不少（經驗之談）。ui

OK，咱們來看一下，Box 的具體結構。編碼

basic box

首先來看一下 basic box 的結構：url

若是用代碼來表示就是：spa

aligned(8) class Box (unsigned int(32) boxtype, optional unsigned int(8)[16] extended_type) {
   unsigned int(32) size;
   unsigned int(32) type = boxtype;
   if (size==1) {
      unsigned int(64) largesize;
   }  else if (size==0) {
      // box extends to end of file
   }
   // 這裏針對的是 MP4 extension 的盒子類型。通常不會發生
    if (boxtype==‘uuid’) {
    unsigned int(8)[16] usertype = extended_type;
    } 
}

上面代碼其實已經說的很清楚了。這裏，我在簡單的闡述一下。

size[4B]: 用來代指該 box 的大小，包括 header 和 body。因爲其大小有限制，有可能不知足超大的 box。因此，這裏有一個判斷邏輯，當 size===1 時，會出現一個 8B 的 largesize 字段來存放大小。當 size===0 時，表示文件的結束。
type[4B]: 用來標識該 box 的類型，其實內容很簡單，就是直接取指定盒子的英文字母的 ASCII 碼。由於 boxname 的長度只有 4 個字母，因此，只須要經過 charCodeAt API 獲取 4 次便可。

// 得到指定 box 的 type 字段內容
val.charCodeAt(0)
val.charCodeAt(1)
val.charCodeAt(2)
val.charCodeAt(3)

實際整個盒子的結構能夠用下圖來表示：

這裏須要強調的一點就是，在 MP4 中，默認寫入字節序都是 Big-Endian 。因此，在上面，涉及到 4B 8B 等字段內容時，都是以 BE 來寫入的。

上面不是說了，box 有兩種基本格式嗎？

還有一種爲 fullBox

full box

full box 和 box 的主要區別是增長了 version 和 flag 字段。它的應用場景不高，主要是在 trak box 中使用。它的基本格式爲：

aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f) extends Box(boxtype) {
    unsigned int(8) version = v;
    bit(24) flags = f;
}

在實操中，若是你的沒有針對 version 和 flags 的業務場景，那麼基本上就能夠直接設爲默認值，好比 0x00。它的基本結構圖爲：

在實際 remux 中，會以 box 爲最小組合單位，來完成相關的 remux 過程。好比，這裏以 JS 來完成最小基本 box 的構造：

MP4.box = function (type) {
  let boxLength = 8; // include the total 8 byte length of size and type

  let buffers = Array.prototype.slice.call(arguments, 1);

  buffers.forEach(val => {
    boxLength += val.byteLength;
  });

  let boxBuffer = new Uint8Array(boxLength);
  // the first four byte stands for boxLength
  boxBuffer[0] = (boxLength >> 24) & 0xff;
  boxBuffer[1] = (boxLength >> 16) & 0xff;
  boxBuffer[2] = (boxLength >> 8) & 0xff;
  boxBuffer[3] = boxLength & 0xff;

  // the second four byte is box's type
  boxBuffer.set(type, 4);

  let offset = 8; // the byteLength of type and size

  buffers.forEach(val => {
    boxBuffer.set(val, offset);
    offset += val.byteLength;
  })

  return boxBuffer;

}

上述，一般的調用方法爲：

// MP4.symbolValue.FTYP 爲某一個具體的 Buffer
MP4.box(MP4.types.ftyp, MP4.symbolValue.FTYP);

接下來，咱們就要正式的來看一下，MP4 中真正用到的一些 Box 了。

這裏，咱們按照 MP4 box 的劃分來進行相關的闡述。先看一張 MP4 給出的結構圖：

說明一下，咱們只講帶星號的 box。其餘的由於不是必須 box，咱們就選擇性的忽略了。不過，裏面帶星號的 Box 仍是挺多的。由於，咱們的主要目的是爲了生成一個 MP4 文件。一個正常的 MP4 文件的結構並非全部帶星號的 Box 都必須有。

正常播放的 MP4 文件其實還能夠分爲 unfragmented MP4（簡寫爲 MP4）和 fragmented MP4（簡寫爲 FMP4)。那這二者具體有什麼區別呢？

能夠說，徹底不一樣。由於他們自己肯定 media stream 播放的方式都是徹底不一樣的模式。

MP4 格式

基本 box 爲：

上面這是最基本的 MP4 Box 內容。較完整的爲：

MP4 box 根據 trak 中的 stbl 下的 stts stsc 等基本 box 來完成在 mdat box 中的索引。那 FMP4 是啥呢？

非標：非標經常使用於生成單一 trak 的文件。
- ftyp
- moov
- moof
- mdat
標準：用來生成含有多個 trak 的文件。
- ftyp
- moov
- mdat

看起來非標還多一個 box。但在具體編解碼的時候，標準解碼須要更多關注在如何編碼 stbl 下的幾個子 box--stts,stco,ctts 等盒子。而非標不須要關注 stbl，只須要將原本處於 stbl 的數據直接抽到 moof 中。而且在轉換過程當中，moof 裏面的格式相比 stbl 來講，是很是簡單的。因此，這裏，咱們主要圍繞上面兩種的標準，來說解對應的 Box。

標準 MP4 盒子

ftyp

ftyp 盒子至關於就是該 mp4 的綱領性說明。即，告訴解碼器它的基本解碼版本，兼容格式。簡而言之，就是用來告訴客戶端，該 MP4 的使用的解碼標準。一般，ftyp 都是放在 MP4 的開頭。

它的格式爲：

aligned(8) class FileTypeBox
   extends Box(‘ftyp’) {
   unsigned int(32)  major_brand;
   unsigned int(32)  minor_version;
   unsigned int(32) compatible_brands[];
}

上面的字段一概都是放在 data 字段中（參考，box 的描述）。

major_brand: 由於兼容性通常能夠分爲推薦兼容性和默認兼容性。這裏 major_brand 就至關因而推薦兼容性。一般，在 Web 中解碼，通常而言都是使用 isom 這個萬金油便可。若是是須要特定的格式，能夠自行定義。
minor_version: 指最低兼容版本。
compatible_brands: 和 major_brand 相似，一般是針對 MP4 中包含的額外格式，好比，AVC，AAC 等至關於的音視頻解碼格式。

說這麼多概念，還不如給代碼實在。這裏，咱們能夠來看一下，對於通用 ftyp box 的建立。

FTYP: new Uint8Array([
    0x69, 0x73, 0x6F, 0x6D, // major_brand: isom
    0x0, 0x0, 0x0, 0x1, // minor_version: 0x01
    0x69, 0x73, 0x6F, 0x6D, // isom
    0x61, 0x76, 0x63, 0x31 // avc1
  ])

moov

moov box 主要是做爲一個很重要的容器盒子存在的，它自己的實際內容並不重要。moov 主要是存放相關的 trak 。其基本格式爲：

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

mvhd

mvhd 是 moov 下的第一個 box，用來描述 media 的相關信息。其基本內容爲：

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { 
    if (version==1) {
   unsigned int(64)  creation_time;
   unsigned int(64)  modification_time;
   unsigned int(32)  timescale;
   unsigned int(64)  duration;
} else { // version==0
   unsigned int(32)  creation_time;
   unsigned int(32)  modification_time;
   unsigned int(32)  timescale;
   unsigned int(32)  duration;
}
template int(32)  rate = 0x00010000; // typically 1.0
template int(16)  volume = 0x0100;   // typically, full volume
const bit(16)  reserved = 0;
const unsigned int(32)[2]  reserved = 0;
template int(32)[9]  matrix =
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
      // Unity matrix
   bit(32)[6]  pre_defined = 0;
   unsigned int(32)  next_track_ID;
}

version: 通常默認爲 0。
creation_time: 建立的時間。從 1904 年開始算起，用秒來表示。
timescale: 時間比例。經過該值和 duration 來算出實際時間
duration: 持續時間，單位是根據 timescale 來決定的。實際時間爲：duration/timescale = xx 秒。
rate: 播放比例。
volume: 音量大小。0x0100 爲最大值。
matrix: 不解釋。我也不懂
next_track_ID: 須要比當前 trak_id 最大值還大才行。通常隨便填個很大的值便可。

實際上，mvhd 大部分的值，均可以設爲固定值：

new Uint8Array([
        0x00, 0x00, 0x00, 0x00, // version(0) + flags
        0x00, 0x00, 0x00, 0x00, // creation_time
        0x00, 0x00, 0x00, 0x00, // modification_time
        (timescale >>> 24) & 0xFF, // timescale: 4 bytes
        (timescale >>> 16) & 0xFF,
        (timescale >>> 8) & 0xFF,
        (timescale) & 0xFF,
        (duration >>> 24) & 0xFF, // duration: 4 bytes
        (duration >>> 16) & 0xFF,
        (duration >>> 8) & 0xFF,
        (duration) & 0xFF,
        0x00, 0x01, 0x00, 0x00, // Preferred rate: 1.0
        0x01, 0x00, 0x00, 0x00, // PreferredVolume(1.0, 2bytes) + reserved(2bytes)
        0x00, 0x00, 0x00, 0x00, // reserved: 4 + 4 bytes
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x01, 0x00, 0x00, // ----begin composition matrix----
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x01, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x40, 0x00, 0x00, 0x00, // ----end composition matrix----
        0x00, 0x00, 0x00, 0x00, // ----begin pre_defined 6 * 4 bytes----
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, // ----end pre_defined 6 * 4 bytes----
        0xFF, 0xFF, 0xFF, 0xFF // next_track_ID
    ]);

trak

trak box 就是主要存放相關 media stream 的內容。其基本格式很簡單就是簡單的 box：

aligned(8) class TrackBox extends Box(‘trak’) { }

不過，有時候裏面也能夠帶上該 media stream 的相關描述：

tkhd

tkhd 是 trak box 的子一級 box 的內容。主要是用來描述該特定 trak 的相關內容信息。其主要內容爲：

aligned(8) class TrackHeaderBox
extends FullBox(‘tkhd’, version, flags){ if (version==1) {
      unsigned int(64)  creation_time;
      unsigned int(64)  modification_time;
      unsigned int(32)  track_ID;
      const unsigned int(32)  reserved = 0;
      unsigned int(64)  duration;
   } else { // version==0
      unsigned int(32)  creation_time;
      unsigned int(32)  modification_time;
      unsigned int(32)  track_ID;
      const unsigned int(32)  reserved = 0;
      unsigned int(32)  duration;
}
const unsigned int(32)[2] reserved = 0;
template int(16) layer = 0;
template int(16) alternate_group = 0;
template int(16) volume = {if track_is_audio 0x0100 else 0}; 
const unsigned int(16) reserved = 0;
template int(32)[9] matrix=
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
      // unity matrix
   unsigned int(32) width;
   unsigned int(32) height;
}

上面內容確實挺多的，可是，有些並非必定須要填一些合法值。這裏簡單說明一下：

creation_time: 建立時間，非必須
modification_time: 修改時間，非必須
track_ID: 指明當前描述的 track ID。
duration: 當前 track 內容持續的時間。一般結合 timescale 進行相關計算。
layer: 沒啥用。一般用來做爲分紅 video trak 的使用。
alternate_group: 可替換 track 源。若是爲 0 表示當前 track 沒有指定的 track 源替代。非 0 的話，則表示存在多個源的 group。
volume: 用來肯定音量大小。滿音量爲 1(0x0100)。
width and height：肯定視頻的寬高

mdia

mdia 主要用來包裹相關的 media 信息。自己沒啥說的，格式爲：

aligned(8) class MediaBox extends Box(‘mdia’) { }

mdhd

mdhd 和 tkhd 來講，內容大體都是同樣的。不過，tkhd 一般是對指定的 track 設定相關屬性和內容。而 mdhd 是針對於獨立的 media 來設置的。不過事實上，二者通常都是同樣的。

具體格式爲：

aligned(8) class MediaHeaderBox extends FullBox(‘mdhd’, version, 0) { if (version==1) {
      unsigned int(64)  creation_time;
      unsigned int(64)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(64)  duration;
   } else { // version==0
      unsigned int(32)  creation_time;
      unsigned int(32)  modification_time;
      unsigned int(32)  timescale;
      unsigned int(32)  duration;
}
bit(1) pad = 0;
unsigned int(5)[3] language; // ISO-639-2/T language code unsigned int(16) pre_defined = 0;
}

裏面就有 3 個額外的字段：pad，language，pre_defined。

根據字面意思很好理解：

pad: 佔位符，一般爲 0
language: 代表當前 trak 的語言。由於該字段總長爲 15bit，一般是和 pad 組合成爲 2B 的長度。
pre_defined: 默認爲 0.

實際代碼的計算方式爲：

new Uint8Array([
    0x00, 0x00, 0x00, 0x00, // version(0) + flags
    0x00, 0x00, 0x00, 0x00, // creation_time
    0x00, 0x00, 0x00, 0x00, // modification_time
    (timescale >>> 24) & 0xFF, // timescale: 4 bytes
    (timescale >>> 16) & 0xFF,
    (timescale >>> 8) & 0xFF,
    (timescale) & 0xFF,
    (duration >>> 24) & 0xFF, // duration: 4 bytes
    (duration >>> 16) & 0xFF,
    (duration >>> 8) & 0xFF,
    (duration) & 0xFF,
    0x55, 0xC4, // language: und (undetermined)
    0x00, 0x00 // pre_defined = 0
  ])

hdlr

hdlr 是用來設置不一樣 trak 的處理方式的。經常使用處理方式以下：

vide : Video track
soun : Audio track
hint : Hint track
meta : Timed Metadata track
auxv : Auxiliary Video track

這個，其實就和咱們在獲得和接收到資源時，設置的 Content-Type 類型字段是一致的，例如 application/javascript。

其基本格式爲：

aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) { 
unsigned int(32) pre_defined = 0;
unsigned int(32) handler_type;
const unsigned int(32)[3] reserved = 0;
string   name;
}

其中有兩字段須要額外說明一下：

handler_type：是代指具體 trak 的處理類型。也就是咱們上面列寫的 vide,soun,hint 字段。
name: 是用來寫名字的。其主要不是給機器讀的，而是給人讀，因此，這裏你只要以爲能表述清楚，填啥其實都行。

handler_type 填的值其實就是 string 轉換爲 hex 以後獲得的值。好比：

vide 爲 0x76, 0x69, 0x64, 0x65
soun 爲 0x73, 0x6F, 0x75, 0x6E

minf

minf 是子屬內容中，重要的容器 box，用來存放當前 track 的基本描述信息。自己沒啥說的，基本格式爲：

aligned(8) class MediaInformationBox extends Box(‘minf’) { }

v/smhd

v/smhd 是對當前 trak 的描述 box。vmhd 針對的是 video，smhd 針對的是 audio。這兩個盒子在解碼中，非不可或缺的（有時候得看播放器），缺了的話，有可能會被認爲格式不正確。

咱們先來看一下 vmhd 的基本格式：

aligned(8) class VideoMediaHeaderBox
extends FullBox(‘vmhd’, version = 0, 1) {
template unsigned int(16) graphicsmode = 0; // copy, see below 
template unsigned int(16)[3] opcolor = {0, 0, 0};
}

這很簡單都是一些默認值，我這裏就很少說了。

smhd 的格式一樣也很簡單：

aligned(8) class SoundMediaHeaderBox
   extends FullBox(‘smhd’, version = 0, 0) {
   template int(16) balance = 0;
   const unsigned int(16)  reserved = 0;
}

其中，balance 這個字段至關於和咱們一般設置的左聲道，右聲道有關。

balance: 該值是一個浮點值，0 爲 center，1.0 爲 right，-1.0 爲 left。

dinf

dinf 是用來講明在 trak 中，media 描述信息的位置。其實自己就是一個容器，沒啥內容：

aligned(8) class DataInformationBox extends Box(‘dinf’) { }

dref

dref 是用來設置當前 Box 描述信息的 data_entry。基本格式爲：

aligned(8) class DataReferenceBox
   extends FullBox(‘dref’, version = 0, 0) {
   unsigned int(32)  entry_count;
   for (i=1; i <= entry_count; i++) {
    DataEntryBox(entry_version, entry_flags) data_entry; }
}

其中的 DataEntryBox 就是 DataEntryUrlBox/DataEntryUrnBox 中的一個。簡單來講，就是 dref 下的子 box -- url 或者 urn 這兩個 box。其中，entry_version 和 entry_flags 須要額外說明一下。

entry_version: 用來指明當前 entry 的格式
entry_flags: 其值不是固定的，可是有一個特殊的值, 0x000001 用來表示當前 media 的數據和 moov 包含的數據一致。

不過，就一般來講，我真的沒有用到過有實際數據的 dref 。因此，這裏就不衍生來說了。

url

url box 是由 dref 包裹的子一級 box，裏面是對不一樣的 sample 的描述信息。不過，通常都是附帶在其它 box 裏。其基本格式爲：

aligned(8) class DataEntryUrlBox (bit(24) flags) extends FullBox(‘url ’, version = 0, flags) { 
    string location;
}

實際並無用到過 location 這個字段，因此，通常也就不須要了。

stts

stts 主要是用來存儲 refSampleDelta。即，相鄰兩幀間隔的時間。它基本格式爲：

aligned(8) class TimeToSampleBox
   extends FullBox(’stts’, version = 0, 0) {
   unsigned int(32)  entry_count;
      int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_count;
      unsigned int(32)  sample_delta;
   }
}

看代碼其實看不出什麼，咱們結合實際抓包結果，來說解。現有以下的幀：

能夠看到，上面的 Decode delta 值都是 10。這就對應着 sample_delta 的值。而 sample_count 就對應出現幾回的 sample_delta。好比，上面 10 的 delta 出現了 14 次，那麼 sample_count 就是 14。

若是對應於 RTMP 中的 Video Msg，那麼 sample_delta 就是當前 RTMP Header 中，後面一個的 timeStamp delta。

stco

stco 是 stbl 包裏面一個很是關鍵的 Box。它用來定義每個 sample 在 mdat 具體的位置。基本格式爲：

aligned(8) class ChunkOffsetBox
extends FullBox(‘stco’, version = 0, 0) { 
unsigned int(32) entry_count;
for (i=1; i u entry_count; i++) {
      unsigned int(32)  chunk_offset;
   }
}

具體能夠參考：

stco 有兩種形式，若是你的視頻過大的話，就有可能形成 chunkoffset 超過 32bit 的限制。因此，這裏針對大 Video 額外建立了一個 co64 的 Box。它的功效等價於 stco，也是用來表示 sample 在 mdat box 中的位置。只是，裏面 chunk_offset 是 64bit 的。

aligned(8) class ChunkLargeOffsetBox extends FullBox(‘co64’, version = 0, 0) { 
unsigned int(32) entry_count;
for (i=1; i u entry_count; i++) {
      unsigned int(64)  chunk_offset;
   }
}

stsc

stsc 這個 Box 有點繞，並非它的字段多，而是它的字段意思有點奇怪。其基本格式爲：

aligned(8) class SampleToChunkBox
extends FullBox(‘stsc’, version = 0, 0) { 
    unsigned int(32) entry_count;
    for (i=1; i u entry_count; i++) {
    unsigned int(32) first_chunk;
    unsigned int(32) samples_per_chunk; 
    unsigned int(32) sample_description_index;
    } 
}

關鍵點在於他們裏面的三個字段: first_chunk,samples_per_chunk,sample_description_index。

first_chunk: 每個 entry 開始的 chunk 位置。
samples_per_chunk: 每個 chunk 裏面包含多少的 sample
sample_description_index: 每個 sample 的描述。通常能夠默認設置爲 1。

這 3 個字段實際上決定了一個 MP4 中有多少個 chunks，每一個 chunks 有多少個 samples。這裏順便普及一下 chunk 和 sample 的相關概念。在 MP4 文件中，最小的基本單位是 Chunk 而不是 Sample。

sample: 包含最小單元數據的 slice。裏面有實際的 NAL 數據。
chunk: 裏面包含的是一個一個的 sample。爲了是優化數據的讀取，讓 I/O 更有效率。

看了上面字段就懂得，感受你要麼是大牛，要麼就是在裝逼。官方文檔和上面同樣的描述，可是，看了一遍後，懵逼，再看一遍後，懵逼。因此，這裏爲了你們更好的理解，這裏額外再補充一下。

前面說了，在 MP4 中最小的單位是 chunks，那麼經過 stco 中定義的 chunk_offsets 字段，它描述的就是 chunks 在 mdat 中的位置。每個 stco chunk_offset 就對應於某一個 index 的 chunks。那麼，first_chunk 就是用來定義該 chunk entry 開始的位置。

那這樣的話，stsc 須要對每個 chunk 進行定義嗎？

不須要，由於 stsc 是定義一整個 entry，即，若是他們的 samples_per_chunk，sample_description_index 不變的話，那麼後續的 chunks 都是用同樣的模式。

即，若是你的 stsc 只有：

first_chunk: 1
samples_per_chunk: 4
sample_description_index: 1

也就是說，從第一個 chunk 開始，每經過切分 4 個 sample 劃分爲一個 chunk，而且每一個 sample 的表述信息都是 1。它會按照這樣劃分方法一直持續到最後。固然，若是你的 sample 最後不能被 4 整除，最後的幾段 sample 就會當作特例進行處理。

一般狀況下，stsc 的值是不同的：

按照上面的狀況就是，第 1 個 chunk 包含 2 個 samples。第 2-4 個 chunk 包含 1 個 sample，第 5 個 chunk 包含兩個 chunk，第 6 個到最後一個 chunk 包含一個 sample。

ctts

ctts 主要針對 Video 中的 B 幀來肯定的。也就是說，若是你視頻裏面沒有 B 幀，那麼，ctts 的結構就很簡單了。它主要的做用，是用來記錄每個 sample 裏面的 cts。格式爲：

aligned(8) class CompositionOffsetBox extends FullBox(‘ctts’, version = 0, 0) { 
    unsigned int(32) entry_count;
      int i;
   for (i=0; i < entry_count; i++) {
      unsigned int(32)  sample_count;
      unsigned int(32)  sample_offset;
   }
}

仍是看實例吧，假如你視頻中幀的排列以下：

其中，sample_offset 就是 Composition offset。經過合併一致的 Composition offset，能夠獲得對應的 sample_count。最終 ctts 的結果爲：

看實例抓包的結果爲：

若是，你是針對 RTMP 的 video，因爲，其沒有 B 幀，那麼 ctts 的整個結果，就只有一個 sample_count 和 sample_offset。好比：

sample_count: 100
sample_offset: 0

一般只有 video track 才須要 ctts。

stsz

stsz 是用來存放每個 sample 的 size 信息的。基本格式爲：

aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) { 
unsigned int(32) sample_size;
unsigned int(32) sample_count;
if (sample_size==0) {
    for (i=1; i <= sample_count; i++) {
          unsigned int(32)  entry_size;
    } 
    }
}

這個沒啥說的，就是全部 sample 的 size 大小，以及相應的描述信息。

fragmented MP4

前面部分是標準 box 的全部內容。固然，fMP4 裏面大部份內容和 MP4 標準格式有不少重複的地方，剩下的就不過多贅述，只把不一樣的單獨挑出來說解。

mvex

mvex 是 fMP4 的標準盒子。它的做用是告訴解碼器這是一個 fMP4 的文件，具體的 samples 信息內容再也不放到 trak 裏面，而是在每個 moof 中。基本格式爲：

aligned(8) class MovieExtendsBox extends Box(‘mvex’){ }

trex

trex 是 mvex 的子一級 box 用來給 fMP4 的 sample 設置默認值。基本內容爲：

aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ 
    unsigned int(32) track_ID;
    unsigned int(32) default_sample_description_index;
    unsigned int(32) default_sample_duration;
    unsigned int(32) default_sample_size;
    unsigned int(32) default_sample_flags 
}

具體設哪個值，這得看你業務裏面具體的要求才行。若是實在不知道，那就能夠直接設置爲 0：

new Uint8Array([
        0x00, 0x00, 0x00, 0x00, // version(0) + flags
        (trackId >>> 24) & 0xFF, // track_ID
        (trackId >>> 16) & 0xFF,
        (trackId >>> 8) & 0xFF,
        (trackId) & 0xFF,
        0x00, 0x00, 0x00, 0x01, // default_sample_description_index
        0x00, 0x00, 0x00, 0x00, // default_sample_duration
        0x00, 0x00, 0x00, 0x00, // default_sample_size
        0x00, 0x01, 0x00, 0x01 // default_sample_flags
    ])

moof

moof 主要是用來存放 FMP4 的相關內容。它自己沒啥太多的內容：

aligned(8) class TrackFragmentBox extends Box(‘traf’){ 
}

tfhd

tfhd 主要是對指定的 trak 進行相關的默認設置。例如：sample 的時長，大小，偏移量等。不過，這些均可以忽略不設，只要你在其它 box 裏面設置完整便可：

aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){
    unsigned int(32) track_ID;
// all the following are optional fields
 unsigned int(64) base_data_offset;
 unsigned int(32) sample_description_index;
 unsigned int(32) default_sample_duration;
 unsigned int(32) default_sample_size;
 unsigned int(32) default_sample_flags
}

base_data_offset 是用來計算後面數據偏移量用到的。若是存在則會用上，不然直接是相關開頭的偏移。

tfdt

tfdt 主要是用來存放相關 sample 編碼的絕對時間的。由於 FMP4 是流式的格式，因此，不像 MP4 同樣能夠直接根據 sample 直接 seek 到具體位置。這裏就須要一個標準時間參考，來快速定位都某個具體的 fragment。

它的基本格式爲：

aligned(8) class TrackFragmentBaseMediaDecodeTimeBox extends FullBox(‘tfdt’, version, 0) {
if (version==1) {
    unsigned int(64) baseMediaDecodeTime; 
} else { // version==0
    unsigned int(32) baseMediaDecodeTime;
    }
}

baseMediaDecodeTime 基本值是前面全部指定 trak_id 中 samples 持續時長的總和，至關於就是當前 traf 裏面第一個 sample 的 dts 值。

trun

trun 存儲該 moof 裏面相關的 sample 內容。例如，每一個 sample 的 size，duration，offset 等。基本內容爲：

aligned(8) class TrackRunBox
    extends FullBox(‘trun’, version, tr_flags) {
unsigned int(32) sample_count;
// the following are optional fields
signed int(32) data_offset;
unsigned int(32) first_sample_flags;
// all fields in the following array are optional {
      unsigned int(32)  sample_duration;
      unsigned int(32)  sample_size;
      unsigned int(32)  sample_flags
      if (version == 0)
         { unsigned int(32) sample_composition_time_offset
      else
         { signed int(32) sample_composition_time_offset
   }[ sample_count ]
}

能夠說，trun 上面的字段是 traf 裏面最重要的標識字段：

tr_flags 是用來表示下列 sample 相關的標識符是否應用到每一個字段中：

0x000001: data-offset-present，只應用 data-offset
0x000004: 只對第一個 sample 應用對應的 flags。剩餘 sample flags 就無論了。
0x000100: 這個比較重要，表示每一個 sample 都有本身的 duration，不然使用默認的
0x000200: 每一個 sample 有本身的 sample_size，不然使用默認的。
0x000400: 對每一個 sample 使用本身的 flags。不然，使用默認的。
0x000800: 每一個 sample 都有本身的 cts 值

後面字段，咱們這簡單介紹一下。

data_offset: 用來表示和該 moof 配套的 mdat 中實際數據內容距 moof 開頭有多少 byte。至關於就是 moof.byteLength + mdat.headerSize。
sample_count: 一共有多少個 sample
first_sample_flags: 主要針對第一個 sample。通常來講，均可以默認設爲 0。

後面的幾個字段，我就不贅述了，對了，裏面的 sample_flags 是一個很是重要的東西，經常用它來表示，到底哪個 sampel 是對應的 keyFrame。基本計算方法爲：

(flags.isLeading << 2) | flags.dependsOn, // sample_flags
(flags.isDepended << 6) | (flags.hasRedundancy << 4) | flags.isNonSync

sdtp

sdtp 主要是用來描述具體某個 sample 是不是 I 幀，是不是 leading frame 等相關屬性值，主要用來做爲當進行點播回放時的同步參考信息。其內容一共有 4 個：

is_leading：是不是開頭部分。
- 0: 當前 sample 的 leading 屬性未知（常常用到）
- 1: 當前 sample 是 leading sample，而且不能被 decoded
- 2: 當前 sample 並非 leading sample。
- 3: 當前 sample 是 leading sample，而且能被 decoded
sample_depends_on：是不是 I 幀。
- 0: 該 sample 不知道是否依賴其餘幀
- 1: 該 sample 是 B/P 幀
- 2: 該 sample 是 I 幀。
- 3: 保留字
sample_is_depended_on: 該幀是否被依賴
- 0: 不知道是否被依賴，特指（B/P）
- 1: 被依賴，特指 I 幀
- 3: 保留字
sample_has_redundancy: 是否有冗餘編碼
- 0: 不知道是否有冗餘
- 1: 有冗餘編碼
- 2: 沒有冗餘編碼
- 3: 保留字

整個基本格式爲：

aligned(8) class SampleDependencyTypeBox extends FullBox(‘sdtp’, version = 0, 0) { 
  for (i=0; i < sample_count; i++){
    unsigned int(2) is_leading;
    unsigned int(2) sample_depends_on; 
    unsigned int(2) sample_is_depended_on; 
    unsigned int(2) sample_has_redundancy;
  } 
}

sdtp 對於 video 來講很重要，由於，其內容字段主要就是給 video 相關的幀設計的。而 audio，通常直接採用默認值：

isLeading: 0,
dependsOn: 1, 
isDepended: 0,
hasRedundancy: 0

到這裏，整個 MP4 和 fMP4 的內容就已經介紹完了。更詳細的內容能夠參考 MP4 & FMP4 doc。

固然，這裏只是很是皮毛的一部分，僅僅知道 box 的內容，並不足夠來作一些音視頻處理。更多的是關於音視頻的基礎知識，好比，dts/pts、音視頻同步、視頻盒子的封裝等等。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。