原文: Diving Into The Ethereum VM
做者: Howard
譯者: 187J3X1 html
Solidity offers many high-level language abstractions, but these features make it hard to understand what’s really going on when my program is running. Reading the Solidity documentation still left me confused over very basic things.git
Solidity提供了許多高級語言的特性,但這些高級特性使得要想去理解底層程序是如何運行過程變得困難。即便讀了Solidity的官方文檔,我依然對一些基礎的內容感到困惑。github
What are the differences between string, bytes32, byte[], bytes?數據庫
string,byte32,byte[],bytes這些類型到底有什麼區別?數組
How are mappings stored by the EVM?數據結構
映射類型(mappings)在EVM中時被如何存儲的?app
How does a compiled contract look to the EVM?框架
EVM眼中通過編譯的合約是什麼樣子?ide
I think it’s a good investment to learn how a high-level language like Solidity runs on the Ethereum VM (EVM). For couple of reasons.函數
Solidity is not the last word. Better EVM languages are coming. (Pretty please?) The EVM is a database engine. To understand how smart contracts work in any EVM language, you have to understand how data is organized, stored, and manipulated. Know-how to be a contributor. The Ethereum toolchain is still very early. Knowing the EVM well would help you make awesome tools for yourself and others. Intellectual challenge. EVM gives you a good excuse to play at the intersection of cryptography, data structure, and programming language design.
我認爲去學習Solidity這樣的高級語言在以太坊虛擬機(EVM)中的運行過程是項很是棒的投資。至少有如下這些好處:
譯註:Solidity是高級語言,定製的編譯器能夠將這種高級語言轉化爲EVM能理解的一串二進制編碼,因此只要能生成這種二進制碼,並不必定限定在Solidity。有點相似與JAVA利用JVM實現跨平臺。
In a series of articles, I’d like to deconstruct simple Solidity contracts in order to understand how it works as EVM bytecode.
An outline of what I hope to learn and write about:
My final goal is to be able to understand a compiled Solidity contract in its entirety. Let’s start by reading some basic EVM bytecode!
在接下來的一些列文章中,我會以一些簡單的Solidity編寫的合約爲例展現其在EVM中是如何工做的。
如下是我但願學到的知識大綱:
個人最終目標是可以徹底理解智能合約的工做原理,讓咱們從EVM字節碼開始吧
This table of EVM Instruction Set would be a helpful reference.
你能夠隨時查看EVM支持的指令集以得到幫助。
譯註:指令集對應於源碼 core/vm/opcodes.go
Our first contract has a constructor and a state variable:
第一個合約的例子包含一個構造函數和一個狀態變量
// c1.sol pragma solidity ^0.4.11; contract C { uint256 a; function C() { a = 1; } }
Compile this contract with solc:
使用solc來編譯這個合約:
$ solc --bin --asm c1.sol ======= c1.sol:C ======= EVM assembly: /* "c1.sol":26:94 contract C {... */ mstore(0x40, 0x60) /* "c1.sol":59:92 function C() {... */ jumpi(tag_1, iszero(callvalue)) 0x0 dup1 revert tag_1: tag_2: /* "c1.sol":84:85 1 */ 0x1 /* "c1.sol":80:81 a */ 0x0 /* "c1.sol":80:85 a = 1 */ dup2 swap1 sstore pop /* "c1.sol":59:92 function C() {... */ tag_3: /* "c1.sol":26:94 contract C {... */ tag_4: dataSize(sub_0) dup1 dataOffset(sub_0) 0x0 codecopy 0x0 return stop sub_0: assembly { /* "c1.sol":26:94 contract C {... */ mstore(0x40, 0x60) tag_1: 0x0 dup1 revert auxdata: 0xa165627a7a72305820af3193f6fd31031a0e0d2de1ad2c27352b1ce081b4f3c92b5650ca4dd542bb770029 } Binary: 60606040523415600e57600080fd5b5b60016000819055505b5b60368060266000396000f30060606040525b600080fd00a165627a7a72305820af3193f6fd31031a0e0d2de1ad2c27352b1ce081b4f3c92b5650ca4dd542bb770029
The number 6060604052... is bytecode that the EVM actually runs.
最後生成的6060604052...即是EVM實際運行的字節碼
Half of the compiled assembly is boilerplate that’s similar across most Solidity programs. We’ll look at those later. For now, let’s examine the unique part of our contract, the humble storage variable assignment:
上面編譯生成的彙編代碼有一半都是大部分Solidity程序固定的框架,因此咱們只須要關注咱們合約中獨特的部分,即對存儲變量賦值的那部分。
a = 1
This assignment is represented by the bytecode 6001600081905550. Let’s break it up into one instruction per line:
該賦值語句轉化成字節碼後是6001600081905550。將其按指令分行展現
60 01 60 00 81 90 55 50
The EVM is basically a loop that execute each instruction from top to bottom. Let’s annotate the assembly code (indented under the label tag_2) with the corresponding bytecode to better see how they are associated:
EVM從上倒下依次執行每條指令。讓咱們將tag2如下代碼與其對應的助記符聯繫起來看:
tag_2: // 60 01 0x1 // 60 00 0x0 // 81 dup2 // 90 swap1 // 55 sstore // 50 pop
Note that 0x1 in the assembly code is actually a shorthand for push(0x1). This instruction pushes the number 1 onto the stack.
It still hard to grok what’s going on just staring at it. Don’t worry though, it’s simple to simulate the EVM line by line.
注意: 0x1是push(0x1)的簡化形式,它將數字1壓棧。
到目前爲止依舊不是很清楚,別擔憂!走讀EVM字節碼並無想象中的那麼困難。
The EVM is a stack machine. Instructions might use values on the stack as arguments, and push values onto the stack as results. Let’s consider the operation add.
EVM是基於棧的機器,指令讀取棧上元素的值做爲輸入,並將運算結果壓棧。以add
指令爲例:
Assume that there are two values on the stack:
假設如今棧上已經有了兩個元素以下:
[1 2]
When the EVM sees add, it adds the top 2 items together, and pushes the answer back onto the stack, resulting in:
當EVM運行到add
指令時,它將棧頂兩個元素彈出,將其相加後的記過壓棧,運算以後的棧變成了:
[3]
In what follows, we’ll notate the stack with []:
下文都以[]
表示EVM運行過程當中棧的狀態
// The empty stack // 空棧 stack: [] // Stack with three items. The top item is 3. The bottom item is 1. //一個包含3個元素的棧,棧頂元素是3,棧底元素是1. stack: [3 2 1]
And notate the contract storage with {}:
另外,使用{}
表示EVM運行時存儲器的狀態:
// Nothing in storage. // 空的存儲器 store: {} // The value 0x1 is stored at the position 0x0. //在0x0位置包含一個值爲0x1的元素 store: { 0x0 => 0x1 }
譯註:在以太坊源碼中,數據結構Stack表示棧,Memory表示存儲器。
Let’s now look at some real bytecode. We’ll simulate the bytecode sequence 6001600081905550 as EVM would, and print out the machine state after each instruction:
下面咱們來看真實的字節碼。咱們會模擬EVM執行字節碼 6001600081905550並標識出每一步後機器的狀態
// 60 01: pushes 1 onto stack // 60 01: 將 1 壓棧 0x1 stack: [0x1] // 60 00: pushes 0 onto stack // 60 01: 將 0 壓棧 0x0 stack: [0x0 0x1] // 81: duplicate the second item on the stack // 81: 將棧頂往下第2個元素複製一次放到棧頂 dup2 stack: [0x1 0x0 0x1] // 90: swap the top two items // 90: 交換棧頂兩個元素 swap1 stack: [0x0 0x1 0x1] // 55: store the value 0x1 at position 0x0 // This instruction consumes the top 2 items // 55:將 0x1 保存在 0x0 // 這條指令彈出棧頂前2個元素 sstore stack: [0x1] store: { 0x0 => 0x1 } // 50: pop (throw away the top item) // 50: 彈出棧頂元素 pop stack: [] store: { 0x0 => 0x1 }
The end. The stack is empty, and there’s one item in storage.
最終,棧空了。存儲器中包含一個元素
What’s worth noting is that Solidity had decided to store the state variable uint256 a
at the position 0x0
. It's perfectly possible for other languages to choose to store the state variable elsewhere.
值得注意的是Solidity已經會將uint256 a
固定在0x0
位置,在其餘高級語言中,咱們能夠主動指定其存儲位置。
In pseudocode, what the EVM does for 6001600081905550
is essentially:
僞代碼表示6001600081905550
就是
// a = 1 sstore(0x0, 0x1)
Looking carefully, you’d see that the dup2, swap1, pop are superfluous. The assembly code could be simpler:
仔細觀察,你會發現諸如dup2
, swap1
, pop
這些指令都是多餘的。彙編代碼像下面這樣就足夠了:
0x1 0x0 sstore
You could try to simulate the above 3 instructions, and satisfy yourself that they indeed result in the same machine state:
模擬執行以上三條指令,你會發現和以前的那種方式相比,最後的結果是同樣的。
stack: [] store: { 0x0 => 0x1 }
Let’s add one extra storage variable of the same type:
在以前的例子的基礎上增長一個相同類型的變量
// c2.sol pragma solidity ^0.4.11;</pre> contract C { uint256 a; uint256 b; function C() { a = 1; b = 2; } }
Compile, focusing on tag_2
:
編譯後,僅關注tag_2
:
$ solc --bin --asm c2.sol // ... more stuff omitted tag_2: /* "c2.sol":99:100 1 */ 0x1 /* "c2.sol":95:96 a */ 0x0 /* "c2.sol":95:100 a = 1 */ dup2 swap1 sstore pop /* "c2.sol":112:113 2 */ 0x2 /* "c2.sol":108:109 b */ 0x1 /* "c2.sol":108:113 b = 2 */ dup2 swap1 sstore pop
The assembly in pseudocode:
彙編的僞碼爲:
// a = 1 sstore(0x0, 0x1) // b = 2 sstore(0x1, 0x2)
What we learn here is that the two storage variables are positioned one after the other, with a
in position 0x0
and b
in position 0x1
.
能夠看到,兩個變量以此存儲在存儲器中,a
在0x0
而 b
在 0x1
.
Each slot storage can store 32 bytes. It’d be wasteful to use all 32 bytes if a variable only needs 16 bytes. Solidity optimizes for storage efficiency by packing two smaller data types into one storage slot if possible.
(存儲器由不少個存儲槽組成,)每一個存儲槽能夠存放32字節的數據。若是一個變量只須要16字節的存儲空間,但卻讓它佔用一個完整的32字節空間,顯然很浪費的。所以Solidity編譯器會近可能地將兩個小的數據類型放到一個存儲槽中。
Let’s change a
and b
so they are only 16 bytes each:
將上面例子中的a
和 b
定義成16字節
pragma solidity ^0.4.11; contract C { uint128 a; uint128 b; function C() { a = 1; b = 2; } }
Compile the contract:
編譯合約
$ solc --bin --asm c3.sol The generated assembly is now more complex: tag_2: // a = 1 0x1 stack: [0x1] 0x0 stack: [0x1,0x1] dup1 stack: [0x0,0x0,0x1] 0x100 stack: [0x100,0x0,0x0,0x1] exp stack: [0x1,0x0,0x1] dup2 stack: [0x0,0x1,0x0,0x1] sload stack: [0x0,0x1,0x0,0x1] dup2 stack: [0x1,0x0,0x1,0x0,0x1] 0xffffffffffffffffffffffffffffffff stack: [0xff..ff,0x1,0x0,0x1,0x0,0x1] mul stack: [0xff..ff,0x0,0x1,0x0,0x1] not stack: [0x0,0x0,0x1,0x0,0x1] and stack: [0x0,0x1,0x0,0x1] swap1 stack: [0x1,0x0,0x0,0x1] dup4 stack: [0x1,0x1,0x0,0x0,0x1] 0xffffffffffffffffffffffffffffffff stack: [0xff..ff,0x1,0x1,0x0,0x0,0x1] and stack: [0x1,0x1,0x0,0x0,0x1] mul stack: [0x1,0x0,0x0,0x1] or stack: [0x1,0x0,0x1] swap1 stack: [0x0,0x1,0x1] sstore stack: [0x1] storage:{0x0 => 0x1} pop stack: [0x0] ------------------------------------------------------------------------ 0x2 stack: [0x2] 0x0 stack: [0x0,0x2] 0x10 stack: [0x10,0x0,0x2] 0x100 stack: [0x100,0x10,0x0,0x2] exp stack: [0x100..00,0x0,0x2] dup2 stack: [0x0,0x100..00,0x0,0x2] sload stack: [0x1,0x100..00,0x0,0x2] dup2 stack: [0x100..00,0x1,0x100..00,0x0,0x2] 0xffffffffffffffffffffffffffffffff stack: [0xff..ff,0x100..00,0x1,0x100..00,0x0,0x2] mul stack: [0xff..ff00..00,0x1,0x100..00,0x0,0x2] not stack: [0x00..00ff..ff,0x1,0x100..00,0x0,0x2] and stack: [0x1,0x100..00,0x0,0x2] swap1 stack: [0x100..00,0x1,0x0,0x2] dup4 stack: [0x2,0x100..00,0x1,0x0,0x2] 0xffffffffffffffffffffffffffffffff stack: [0xff..ff,2,0x100..00,1,0,2] and stack: [0x2,0x100..00,0x1,0x0,0x2] mul stack: [0x200..00,0x1,0x0,0x2] or stack: [0x200..01,0x0,0x2] swap1 stack: [0x0,0x200..01,0x2] sstore stack: [0x2] {0x0 => 0x200..01} pop stack: []
The above assembly code packs these two variables together in one storage position (0x0
), like this:
上面的彙編碼最終讓兩個變量存儲在相同的位置(0x0
)
[ b ][ a ] [16 bytes / 128 bits][16 bytes / 128 bits]
譯註:我將每一步執行以後的棧的狀態也顯示了出來,儘管結果符合預期,但我也不明白爲何感受繞了很大一圈。
The reason to pack is because the most expensive operations by far are storage usage:
sstore
costs 20000 gas for first write to a new position.sstore
costs 5000 gas for subsequent writes to an existing position.sload
costs 500 gas.By using the same storage position, Solidity pays 5000 for the second store variable instead of 20000, saving us 15000 in gas.
將變量壓縮存儲在一塊兒的緣由是在區塊鏈中存儲操做是到目前爲止最昂貴的操做了:
sstore
在一個新的位置存儲要花費20000 gas。 sstore
在一箇舊的位置存儲要花費5000 gas。 sload
從一個位置讀取花費500 gas。 譯註:存儲很貴!存儲很貴!存儲很貴。每條指令的花費在 corevmjumptable.go中指令表的gasCost函數獲取
Instead of storing a
and b
with two separate sstore
instructions, it should be possible to pack the two 128 bits numbers together in memory, then store them using just one sstore
, saving an additional 5000 gas.
You can ask Solidity to make this optimization by turning on the optimize
flag:
上面的例子中,爲了存儲a
和b
兩個變量,咱們使用了兩次 sstore
指令。其實徹底能夠先將這兩個128比特變量在內存中就打包成一個變量,而後調用一次 sstore
指令,這樣足足能夠省下5000 gas。
$ solc --bin --asm --optimize c3.sol
Which produces assembly code that uses just one sload and one sstore:
(顯式地使用--optimize選項)生成的彙編碼只使用了一次sload
和 sstore
tag_2: /* "c3.sol":95:96 a */ 0x0 /* "c3.sol":95:100 a = 1 */ dup1 sload /* "c3.sol":108:113 b = 2 */ 0x200000000000000000000000000000000 not(sub(exp(0x2, 0x80), 0x1)) /* "c3.sol":95:100 a = 1 */ swap1 swap2 and /* "c3.sol":99:100 1 */ 0x1 /* "c3.sol":95:100 a = 1 */ or sub(exp(0x2, 0x80), 0x1) /* "c3.sol":108:113 b = 2 */ and or swap1 sstore
The bytecode is:
最終生成的字節碼爲
600080547002000000000000000000000000000000006001608060020a03199091166001176001608060020a0316179055
And formatting the bytecode to one instruction per line:
將字節碼按指令逐條顯示
譯註:一樣,我將其每一步的棧的狀態顯示出來
// push 0x0 60 00 stack: [0x0] // dup1 80 stack: [0x0,0x0] // sload 54 stack: [0x0,0x0] // push17 push the the next 17 bytes as a 32 bytes number 70 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 stack: [0x200..00,0x0,0x0] /* not(sub(exp(0x2, 0x80), 0x1)) */ // push 0x1 60 01 stack: [0x1,0x200000000000000000000000000000000,0x0,0x0] // push 0x80 (32) 60 80 stack: [0x80,0x1,0x200000000000000000000000000000000,0x0,0x0] // push 0x02 (2) 60 02 stack: [0x02,0x80,0x1,0x200000000000000000000000000000000,0x0,0x0] // exp 0a stack: [0x100000000000000000000000000000000,0x1,0x200000000000000000000000000000000,0x0,0x0] // sub 03 stack: [0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,0x200000000000000000000000000000000,0x0,0x0] // not 19 stack: [0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000000000000000000000000000,0x200000000000000000000000000000000,0x0,0x0] // swap1 90 stack: [0x200000000000000000000000000000000,0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000000000000000000000000000,0x0,0x0] // swap2 91 stack: [0x0,0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000000000000000000000000000,0x200000000000000000000000000000000,0x0] // and 16 stack: [0x0,0x200000000000000000000000000000000,0x0] // push 0x1 60 01 stack: [0x1, 0x0,0x200000000000000000000000000000000,0x0] // or 17 stack: [0x1,0x200000000000000000000000000000000,0x0] /* sub(exp(0x2, 0x80), 0x1) */ // push 0x1 60 01 stack: [0x1,0x1,0x200000000000000000000000000000000,0x0] // push 0x80 60 80 stack: [0x80,0x1,0x1,0x200000000000000000000000000000000,0x0] // push 0x02 60 02 stack: [0x2,0x80,0x1,0x1,0x200000000000000000000000000000000,0x0] // exp 0a stack: [0x100000000000000000000000000000000,0x1,0x1,0x200000000000000000000000000000000,0x0] // sub 03 stack: [0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,0x1,0x200000000000000000000000000000000,0x0] // and 16 stack: [0x1,0x200000000000000000000000000000000,0x0] // or 17 stack: [0x200000000000000000000000000000001,0x0] // swap1 90 stack: [0x0,0x200000000000000000000000000000001] // sstore 55 stack: [] storeage:{0x0 => 0x200..01}
There are four magic values used in the assembly code:
上面的彙編代碼中出現了4個幻數(常數)
// Represented as 0x01 in bytecode 16:32 0x00000000000000000000000000000000 00:16 0x00000000000000000000000000000001
// Represented as 0x200000000000000000000000000000000 in bytecode 16:32 0x00000000000000000000000000000002 00:16 0x00000000000000000000000000000000
// Bitmask for the upper 16 bytes 16:32 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 00:16 0x00000000000000000000000000000000
// Bitmask for the lower 16 bytes 16:32 0x00000000000000000000000000000000 00:16 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The code does some bits-shuffling with these values to arrive at the desired result:
最終經過位運算組合成最終的結果
16:32 0x00000000000000000000000000000002 00:16 0x00000000000000000000000000000001
Finally, this 32bytes value is stored at position 0x0
.
最後再把32字節的結果存儲在位置0x0
60008054700 2000000000000000000000000000000006001608060020a03199091166001176001608060020a0316179055
Notice that 0x200000000000000000000000000000000
is embedded in the bytecode. But the compiler could’ve also chosen to calculate the value with the instructions exp(0x2, 0x81)
, which results in shorter bytecode sequence.
注意,在前面的例子中,咱們將0x200000000000000000000000000000000
直接嵌入在了最終的字節碼中。編譯器本能夠經過exp(0x2, 0x81)
獲得相同的結果,顯而後者的字節碼要短一些。
But it turns out that 0x200000000000000000000000000000000
is a cheaper than exp(0x2, 0x81)
. Let's look at the gas fees involved:
但實際上,用0x200000000000000000000000000000000
的方式更節省gas,咱們能夠計算下
Let’s compare how much either representation costs in gas.
0x200000000000000000000000000000000
. It has many zeroes, which are cheap.0x200000000000000000000000000000000
的方式,得益於它包含大量的0,因而實際上它更便宜。 (1 68) + (16 4) = 196.
608160020a
. Shorter, but no zeroes.608160020a
更短,但因爲沒有0,實際會消耗更多的gas 5 * 68 = 340.
The longer sequence with more zeroes is actually cheaper!
結論就是:擁有更多的0的長字節碼序列更加便宜
An EVM compiler doesn’t exactly optimize for bytecode size or speed or memory efficiency. Instead, it optimizes for gas usage, which is an layer of indirection that incentivizes the sort of calculation that the Ethereum blockchain can do efficiently.
We’ve seen some quirky aspects of the EVM:
Gas costs are set somewhat arbitrarily, and could well change in the future. As costs change, compilers would make different choices.
EVM編譯器並不會爲了執行速度和內存效率優化代碼,取而代之的是,它會將代碼優化地使用更少的gas。
咱們從以前的例子能夠看出EVM的一些特性
Gas的計算方式看上去有些武斷,也許在之後計算方式會改變。若是指令的花費價格發生變化,編譯器也會相應改變生成的代碼。