深刻理解以太坊虛擬機 (一) 中英對照

時間 2019-11-16

原文原文鏈接

原文: Diving Into The Ethereum VM
做者: Howard
譯者: 187J3X1　html

Solidity offers many high-level language abstractions, but these features make it hard to understand what’s really going on when my program is running. Reading the Solidity documentation still left me confused over very basic things.git

Solidity提供了許多高級語言的特性，但這些高級特性使得要想去理解底層程序是如何運行過程變得困難。即便讀了Solidity的官方文檔，我依然對一些基礎的內容感到困惑。github

What are the differences between string, bytes32, byte[], bytes?數據庫

Which one do I use, when?
What’s happening when I cast a string to bytes? Can I cast to byte[]?
How much do they cost?

string，byte32，byte[]，bytes這些類型到底有什麼區別?數組

什麼情景該選擇使用哪個?
將string類型轉換爲bytes類型時會發生什麼? 能轉換成 byte[]類型嗎?
使用它們各有多大的代價?

How are mappings stored by the EVM?數據結構

Why can’t I delete a mapping?
Can I have mappings of mappings? (Yes, but how does that work?)
Why is there storage mapping, but no memory mapping?

映射類型(mappings)在EVM中時被如何存儲的?app

能刪除一個映射?
能建立映射的映射嗎?
爲何只有存儲空間的映射，沒有內存空間的映射？

How does a compiled contract look to the EVM?框架

How is a contract created?
What is a constructor, really?
What is the fallback function?

EVM眼中通過編譯的合約是什麼樣子?ide

合約如何建立?
構造函數是什麼?
fallback函數是什麼？

I think it’s a good investment to learn how a high-level language like Solidity runs on the Ethereum VM (EVM). For couple of reasons.函數

Solidity is not the last word. Better EVM languages are coming. (Pretty please?)
The EVM is a database engine. To understand how smart contracts work in any EVM language, you have to understand how data is organized, stored, and manipulated.
Know-how to be a contributor. The Ethereum toolchain is still very early. Knowing the EVM well would help you make awesome tools for yourself and others.
Intellectual challenge. EVM gives you a good excuse to play at the intersection of cryptography, data structure, and programming language design.

我認爲去學習Solidity這樣的高級語言在以太坊虛擬機(EVM)中的運行過程是項很是棒的投資。至少有如下這些好處:

Solidity並非終點。更好的EVM語言已經在路上了。
EVM是一個數據庫引擎，要想理解智能合約是如何工做的，首先要理解合約數據是如何組織、存儲、操做的。
能夠成爲貢獻者。以太坊的工具鏈還很是新，深刻理解EVM能夠幫助你和他人實現一些很棒的工具。
能夠提高思惟。EVM能讓你深刻研究密碼學、數據結構、程序設計。

譯註：Solidity是高級語言，定製的編譯器能夠將這種高級語言轉化爲EVM能理解的一串二進制編碼，因此只要能生成這種二進制碼，並不必定限定在Solidity。有點相似與JAVA利用JVM實現跨平臺。

In a series of articles, I’d like to deconstruct simple Solidity contracts in order to understand how it works as EVM bytecode.

An outline of what I hope to learn and write about:

The basics of EVM bytecode.
How different types (mappings, arrays) are represented.
What is going on when a new contract is created.
What is going on when a method is called.
How the ABI bridges different EVM languages.

My final goal is to be able to understand a compiled Solidity contract in its entirety. Let’s start by reading some basic EVM bytecode!

在接下來的一些列文章中，我會以一些簡單的Solidity編寫的合約爲例展現其在EVM中是如何工做的。

如下是我但願學到的知識大綱:

EVM字節碼基礎.
不一樣數據結構的組織方式，好比映射和數組(mapping,arrays)
合約建立時發生了什麼.
方法被調用時發生了什麼.
ABI是如何橋接不一樣的EVM語言的.

個人最終目標是可以徹底理解智能合約的工做原理，讓咱們從EVM字節碼開始吧

This table of EVM Instruction Set would be a helpful reference.

你能夠隨時查看EVM支持的指令集以得到幫助。

譯註：指令集對應於源碼 core/vm/opcodes.go

A Simple Contract

Our first contract has a constructor and a state variable:
第一個合約的例子包含一個構造函數和一個狀態變量

// c1.sol
pragma solidity ^0.4.11;

contract C {
    uint256 a;

    function C() {
      a = 1;
    }
}

Compile this contract with solc:
使用solc來編譯這個合約:

$ solc --bin --asm c1.sol

======= c1.sol:C =======
EVM assembly:
    /* "c1.sol":26:94  contract C {... */
  mstore(0x40, 0x60)
    /* "c1.sol":59:92  function C() {... */
  jumpi(tag_1, iszero(callvalue))
  0x0
  dup1
  revert
tag_1:
tag_2:
    /* "c1.sol":84:85  1 */
  0x1
    /* "c1.sol":80:81  a */
  0x0
    /* "c1.sol":80:85  a = 1 */
  dup2
  swap1
  sstore
  pop
    /* "c1.sol":59:92  function C() {... */
tag_3:
    /* "c1.sol":26:94  contract C {... */
tag_4:
  dataSize(sub_0)
  dup1
  dataOffset(sub_0)
  0x0
  codecopy
  0x0
  return
stop

sub_0: assembly {
        /* "c1.sol":26:94  contract C {... */
      mstore(0x40, 0x60)
    tag_1:
      0x0
      dup1
      revert

auxdata: 0xa165627a7a72305820af3193f6fd31031a0e0d2de1ad2c27352b1ce081b4f3c92b5650ca4dd542bb770029
}

Binary:
60606040523415600e57600080fd5b5b60016000819055505b5b60368060266000396000f30060606040525b600080fd00a165627a7a72305820af3193f6fd31031a0e0d2de1ad2c27352b1ce081b4f3c92b5650ca4dd542bb770029

The number 6060604052... is bytecode that the EVM actually runs.
最後生成的6060604052...即是EVM實際運行的字節碼

In Baby Steps

Half of the compiled assembly is boilerplate that’s similar across most Solidity programs. We’ll look at those later. For now, let’s examine the unique part of our contract, the humble storage variable assignment:

一步一步分析

上面編譯生成的彙編代碼有一半都是大部分Solidity程序固定的框架，因此咱們只須要關注咱們合約中獨特的部分，即對存儲變量賦值的那部分。

a = 1

This assignment is represented by the bytecode 6001600081905550. Let’s break it up into one instruction per line:
該賦值語句轉化成字節碼後是6001600081905550。將其按指令分行展現

The EVM is basically a loop that execute each instruction from top to bottom. Let’s annotate the assembly code (indented under the label tag_2) with the corresponding bytecode to better see how they are associated:

EVM從上倒下依次執行每條指令。讓咱們將tag2如下代碼與其對應的助記符聯繫起來看：

tag_2:
  // 60 01
  0x1
  // 60 00
  0x0
  // 81
  dup2
  // 90
  swap1
  // 55
  sstore
  // 50
  pop

Note that 0x1 in the assembly code is actually a shorthand for push(0x1). This instruction pushes the number 1 onto the stack.

It still hard to grok what’s going on just staring at it. Don’t worry though, it’s simple to simulate the EVM line by line.

注意: 0x1是push(0x1)的簡化形式，它將數字1壓棧。
到目前爲止依舊不是很清楚，別擔憂！走讀EVM字節碼並無想象中的那麼困難。

Simulating The EVM

The EVM is a stack machine. Instructions might use values on the stack as arguments, and push values onto the stack as results. Let’s consider the operation add.

EVM是基於棧的機器，指令讀取棧上元素的值做爲輸入，並將運算結果壓棧。以add指令爲例：

Assume that there are two values on the stack:
假設如今棧上已經有了兩個元素以下：

[1 2]

When the EVM sees add, it adds the top 2 items together, and pushes the answer back onto the stack, resulting in:

當EVM運行到add指令時，它將棧頂兩個元素彈出，將其相加後的記過壓棧，運算以後的棧變成了：

[3]

In what follows, we’ll notate the stack with []:
下文都以[]表示EVM運行過程當中棧的狀態

// The empty stack
// 空棧
stack: []
// Stack with three items. The top item is 3. The bottom item is 1.
//一個包含3個元素的棧，棧頂元素是3,棧底元素是1．
stack: [3 2 1]

And notate the contract storage with {}:
另外，使用{}表示EVM運行時存儲器的狀態：

// Nothing in storage.
// 空的存儲器
store: {}
// The value 0x1 is stored at the position 0x0.
//在0x0位置包含一個值爲0x1的元素
store: { 0x0 => 0x1 }

譯註：在以太坊源碼中，數據結構Stack表示棧，Memory表示存儲器。

Let’s now look at some real bytecode. We’ll simulate the bytecode sequence 6001600081905550 as EVM would, and print out the machine state after each instruction:

下面咱們來看真實的字節碼。咱們會模擬EVM執行字節碼 6001600081905550並標識出每一步後機器的狀態

// 60 01: pushes 1 onto stack
// 60 01: 將 1 壓棧
0x1
  stack: [0x1]

// 60 00: pushes 0 onto stack
// 60 01: 將 ０ 壓棧
0x0
  stack: [0x0 0x1]

// 81: duplicate the second item on the stack
// 81: 將棧頂往下第２個元素複製一次放到棧頂
dup2
  stack: [0x1 0x0 0x1]

// 90: swap the top two items
// 90: 交換棧頂兩個元素
swap1
  stack: [0x0 0x1 0x1]

// 55: store the value 0x1 at position 0x0
// This instruction consumes the top 2 items
// 55:將 0x1 保存在 0x0
// 這條指令彈出棧頂前2個元素
sstore
  stack: [0x1]
  store: { 0x0 => 0x1 }

// 50: pop (throw away the top item)
// 50: 彈出棧頂元素
pop
  stack: []
  store: { 0x0 => 0x1 }

The end. The stack is empty, and there’s one item in storage.
最終，棧空了。存儲器中包含一個元素

What’s worth noting is that Solidity had decided to store the state variable uint256 a at the position 0x0. It's perfectly possible for other languages to choose to store the state variable elsewhere.

值得注意的是Solidity已經會將uint256 a固定在0x0位置，在其餘高級語言中，咱們能夠主動指定其存儲位置。

In pseudocode, what the EVM does for 6001600081905550 is essentially:
僞代碼表示6001600081905550就是

// a = 1
sstore(0x0, 0x1)

Looking carefully, you’d see that the dup2, swap1, pop are superfluous. The assembly code could be simpler:
仔細觀察，你會發現諸如dup2, swap1, pop這些指令都是多餘的。彙編代碼像下面這樣就足夠了：

0x1
0x0
sstore

You could try to simulate the above 3 instructions, and satisfy yourself that they indeed result in the same machine state:
模擬執行以上三條指令，你會發現和以前的那種方式相比，最後的結果是同樣的。

stack: []
store: { 0x0 => 0x1 }

Two Storage Variables

兩個存儲變量

Let’s add one extra storage variable of the same type:
在以前的例子的基礎上增長一個相同類型的變量

// c2.sol
pragma solidity ^0.4.11;</pre>

contract C {
    uint256 a;
    uint256 b;

function C() {
      a = 1;
      b = 2;
    }
}

Compile, focusing on tag_2:
編譯後，僅關注tag_2：

$ solc --bin --asm c2.sol

// ... more stuff omitted
tag_2:
    /* "c2.sol":99:100  1 */
  0x1
    /* "c2.sol":95:96  a */
  0x0
    /* "c2.sol":95:100  a = 1 */
  dup2
  swap1
  sstore
  pop
    /* "c2.sol":112:113  2 */
  0x2
    /* "c2.sol":108:109  b */
  0x1
    /* "c2.sol":108:113  b = 2 */
  dup2
  swap1
  sstore
  pop

The assembly in pseudocode:
彙編的僞碼爲：

// a = 1
sstore(0x0, 0x1)
// b = 2
sstore(0x1, 0x2)

What we learn here is that the two storage variables are positioned one after the other, with a in position 0x0 and b in position 0x1.
能夠看到，兩個變量以此存儲在存儲器中，a 在0x0 而 b 在 0x1.

Storage Packing

存儲空間壓縮

Each slot storage can store 32 bytes. It’d be wasteful to use all 32 bytes if a variable only needs 16 bytes. Solidity optimizes for storage efficiency by packing two smaller data types into one storage slot if possible.

(存儲器由不少個存儲槽組成，)每一個存儲槽能夠存放32字節的數據。若是一個變量只須要16字節的存儲空間，但卻讓它佔用一個完整的32字節空間，顯然很浪費的。所以Solidity編譯器會近可能地將兩個小的數據類型放到一個存儲槽中。

Let’s change a and b so they are only 16 bytes each:
將上面例子中的a 和 b定義成16字節

pragma solidity ^0.4.11;

contract C {
    uint128 a;
    uint128 b;

function C() {
      a = 1;
      b = 2;
    }
}

Compile the contract:
編譯合約

$ solc --bin --asm c3.sol

The generated assembly is now more complex:

tag_2:
  // a = 1
 0x1
stack: [0x1]
  0x0
stack: [0x1,0x1]
  dup1
stack: [0x0,0x0,0x1]
  0x100
stack: [0x100,0x0,0x0,0x1]
  exp
stack: [0x1,0x0,0x1]
  dup2
stack: [0x0,0x1,0x0,0x1]
  sload
stack: [0x0,0x1,0x0,0x1]
  dup2
stack: [0x1,0x0,0x1,0x0,0x1]
  0xffffffffffffffffffffffffffffffff
stack: [0xff..ff,0x1,0x0,0x1,0x0,0x1]
  mul
stack: [0xff..ff,0x0,0x1,0x0,0x1]
  not
stack: [0x0,0x0,0x1,0x0,0x1]
  and
stack: [0x0,0x1,0x0,0x1]
  swap1
stack: [0x1,0x0,0x0,0x1]
  dup4
stack: [0x1,0x1,0x0,0x0,0x1]
  0xffffffffffffffffffffffffffffffff
stack: [0xff..ff，0x1,0x1,0x0,0x0,0x1]
  and
stack: [0x1,0x1,0x0,0x0,0x1]
  mul
stack: [0x1,0x0,0x0,0x1]
  or
stack: [0x1,0x0,0x1] 
 swap1
stack: [0x0,0x1,0x1]
  sstore
stack: [0x1]
storage:{0x0 => 0x1}
  pop
stack: [0x0]
------------------------------------------------------------------------
  0x2
stack: [0x2]
  0x0
stack: [0x0，0x2]
  0x10
stack: [0x10,0x0,0x2]
  0x100
stack: [0x100,0x10,0x0,0x2]
  exp
stack: [0x100..00,0x0,0x2]
  dup2
stack: [0x0,0x100..00,0x0,0x2]
  sload
stack: [0x1,0x100..00,0x0,0x2]
  dup2
stack: [0x100..00,0x1,0x100..00,0x0,0x2]
  0xffffffffffffffffffffffffffffffff
stack: [0xff..ff,0x100..00,0x1,0x100..00,0x0,0x2]
  mul
stack: [0xff..ff00..00,0x1,0x100..00,0x0,0x2]
  not
stack: [0x00..00ff..ff,0x1,0x100..00,0x0,0x2]
  and
stack: [0x1,0x100..00,0x0,0x2]
  swap1
stack: [0x100..00,0x1,0x0,0x2]
  dup4
stack: [0x2,0x100..00,0x1,0x0,0x2]
  0xffffffffffffffffffffffffffffffff
stack: [0xff..ff,2,0x100..00,1,0,2]
  and
stack: [0x2,0x100..00,0x1,0x0,0x2]
  mul
stack: [0x200..00,0x1,0x0,0x2]
  or
stack: [0x200..01,0x0,0x2]
  swap1
stack: [0x0,0x200..01,0x2]
  sstore
stack: [0x2]
{0x0 => 0x200..01}
  pop
stack: []

The above assembly code packs these two variables together in one storage position (0x0), like this:
上面的彙編碼最終讓兩個變量存儲在相同的位置(0x0)

[         b         ][         a         ]
[16 bytes / 128 bits][16 bytes / 128 bits]

譯註：我將每一步執行以後的棧的狀態也顯示了出來，儘管結果符合預期，但我也不明白爲何感受繞了很大一圈。

The reason to pack is because the most expensive operations by far are storage usage:

sstore costs 20000 gas for first write to a new position.
sstore costs 5000 gas for subsequent writes to an existing position.
sload costs 500 gas.
Most instructions costs 3~10 gases.

By using the same storage position, Solidity pays 5000 for the second store variable instead of 20000, saving us 15000 in gas.

將變量壓縮存儲在一塊兒的緣由是在區塊鏈中存儲操做是到目前爲止最昂貴的操做了：

sstore 在一個新的位置存儲要花費20000 gas。
sstore 在一箇舊的位置存儲要花費5000 gas。
sload 從一個位置讀取花費500 gas。
大多數指令花費 3~10 gas。

譯註：存儲很貴！存儲很貴！存儲很貴。每條指令的花費在 corevmjumptable.go中指令表的gasCost函數獲取

More Optimization

更多的優化

Instead of storing a and b with two separate sstore instructions, it should be possible to pack the two 128 bits numbers together in memory, then store them using just one sstore, saving an additional 5000 gas.

You can ask Solidity to make this optimization by turning on the optimize flag:
上面的例子中，爲了存儲a和b兩個變量，咱們使用了兩次 sstore 指令。其實徹底能夠先將這兩個128比特變量在內存中就打包成一個變量，而後調用一次 sstore 指令，這樣足足能夠省下5000 gas。

$ solc --bin --asm --optimize c3.sol

Which produces assembly code that uses just one sload and one sstore:
(顯式地使用--optimize選項)生成的彙編碼只使用了一次sload 和 sstore

tag_2:
    /* "c3.sol":95:96  a */
  0x0
    /* "c3.sol":95:100  a = 1 */
  dup1
  sload
    /* "c3.sol":108:113  b = 2 */
  0x200000000000000000000000000000000
  not(sub(exp(0x2, 0x80), 0x1))
    /* "c3.sol":95:100  a = 1 */
  swap1
  swap2
  and
    /* "c3.sol":99:100  1 */
  0x1
    /* "c3.sol":95:100  a = 1 */
  or
  sub(exp(0x2, 0x80), 0x1)
    /* "c3.sol":108:113  b = 2 */
  and
  or
  swap1
  sstore

The bytecode is:
最終生成的字節碼爲

600080547002000000000000000000000000000000006001608060020a03199091166001176001608060020a0316179055

And formatting the bytecode to one instruction per line:
將字節碼按指令逐條顯示

譯註：一樣，我將其每一步的棧的狀態顯示出來

// push 0x0
60 00
stack: [0x0]
// dup1
80
stack: [0x0,0x0]
// sload
54
stack: [0x0,0x0]
// push17 push the the next 17 bytes as a 32 bytes number
70 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
stack: [0x200..00,0x0,0x0]

/* not(sub(exp(0x2, 0x80), 0x1)) */
// push 0x1
60 01
stack: [0x1,0x200000000000000000000000000000000,0x0,0x0]
// push 0x80 (32)
60 80
stack: [0x80,0x1,0x200000000000000000000000000000000,0x0,0x0]
// push 0x02 (2)
60 02
stack: [0x02,0x80,0x1,0x200000000000000000000000000000000,0x0,0x0]
// exp
0a
stack: [0x100000000000000000000000000000000,0x1,0x200000000000000000000000000000000,0x0,0x0]
// sub
03
stack: [0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,0x200000000000000000000000000000000,0x0,0x0]
// not
19
stack: [0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000000000000000000000000000,0x200000000000000000000000000000000,0x0,0x0]
// swap1
90
stack: [0x200000000000000000000000000000000,0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000000000000000000000000000,0x0,0x0]
// swap2
91
stack: [0x0,0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000000000000000000000000000,0x200000000000000000000000000000000,0x0]
// and
16
stack: [0x0,0x200000000000000000000000000000000,0x0]
// push 0x1
60 01
stack: [0x1, 0x0,0x200000000000000000000000000000000,0x0]
// or
17
stack: [0x1,0x200000000000000000000000000000000,0x0]

/* sub(exp(0x2, 0x80), 0x1) */
// push 0x1
60 01
stack: [0x1,0x1,0x200000000000000000000000000000000,0x0]
// push 0x80
60 80
stack: [0x80,0x1,0x1,0x200000000000000000000000000000000,0x0]
// push 0x02
60 02
stack: [0x2,0x80,0x1,0x1,0x200000000000000000000000000000000,0x0]
// exp
0a
stack: [0x100000000000000000000000000000000,0x1,0x1,0x200000000000000000000000000000000,0x0]
// sub
03
stack: [0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,0x1,0x200000000000000000000000000000000,0x0]
// and
16
stack: [0x1,0x200000000000000000000000000000000,0x0]
// or
17
stack: [0x200000000000000000000000000000001,0x0]
// swap1
90
stack: [0x0,0x200000000000000000000000000000001]
// sstore
55
stack: []
storeage:{0x0 => 0x200..01}

There are four magic values used in the assembly code:
上面的彙編代碼中出現了4個幻數(常數)

0x1 (16 bytes), using lower 16 bytes
0x1 (16 字節), 存放在低16字節

// Represented as 0x01 in bytecode
16:32 0x00000000000000000000000000000000
00:16 0x00000000000000000000000000000001

0x2 (16 bytes), using higher 16bytes
0x2 (16 字節), 存放在高16字節

// Represented as 0x200000000000000000000000000000000 in bytecode
16:32 0x00000000000000000000000000000002
00:16 0x00000000000000000000000000000000

not(sub(exp(0x2, 0x80), 0x1))

// Bitmask for the upper 16 bytes
16:32 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
00:16 0x00000000000000000000000000000000

sub(exp(0x2, 0x80), 0x1)

// Bitmask for the lower 16 bytes
16:32 0x00000000000000000000000000000000 
00:16 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

The code does some bits-shuffling with these values to arrive at the desired result:
最終經過位運算組合成最終的結果

16:32 0x00000000000000000000000000000002 
00:16 0x00000000000000000000000000000001

Finally, this 32bytes value is stored at position 0x0.
最後再把32字節的結果存儲在位置0x0

Gas Usage

Gas 使用量

60008054700 2000000000000000000000000000000006001608060020a03199091166001176001608060020a0316179055

Notice that 0x200000000000000000000000000000000 is embedded in the bytecode. But the compiler could’ve also chosen to calculate the value with the instructions exp(0x2, 0x81), which results in shorter bytecode sequence.
注意，在前面的例子中，咱們將0x200000000000000000000000000000000 直接嵌入在了最終的字節碼中。編譯器本能夠經過exp(0x2, 0x81)獲得相同的結果，顯而後者的字節碼要短一些。

But it turns out that 0x200000000000000000000000000000000 is a cheaper than exp(0x2, 0x81). Let's look at the gas fees involved:

4 gas paid for every zero byte of data or code for a transaction.
68 gas for every non-zero byte of data or code for a transaction.

但實際上，用0x200000000000000000000000000000000 的方式更節省gas，咱們能夠計算下

字節碼中的每一個值爲0的字節花費是 4 gas.
字節碼中的每一個值爲非0的字節花費是 68 gas.

Let’s compare how much either representation costs in gas.

The bytecode 0x200000000000000000000000000000000. It has many zeroes, which are cheap.
因而，使用 0x200000000000000000000000000000000的方式，得益於它包含大量的0，因而實際上它更便宜。

(1 68) + (16 4) = 196.

The bytecode 608160020a. Shorter, but no zeroes.
相比之下，608160020a更短，但因爲沒有0，實際會消耗更多的gas

5 * 68 = 340.

The longer sequence with more zeroes is actually cheaper!
結論就是：擁有更多的0的長字節碼序列更加便宜

Summary

An EVM compiler doesn’t exactly optimize for bytecode size or speed or memory efficiency. Instead, it optimizes for gas usage, which is an layer of indirection that incentivizes the sort of calculation that the Ethereum blockchain can do efficiently.

We’ve seen some quirky aspects of the EVM:

EVM is a 256bit machine. It is most natural to manipulate data in chunks of 32 bytes.
Persistent storage is quite expensive.
The Solidity compiler makes interesting choices in order to minimize gas usage.

Gas costs are set somewhat arbitrarily, and could well change in the future. As costs change, compilers would make different choices.