借 Git 的第一個 commit 探索 Git 原理

時間 2020-06-29

標籤 git 第一個 commit 探索原理欄目 Git 简体版

原文原文鏈接

最近想了解一下 Git 的實現，首先是看了 pro git 的 git 原理部分。看完以後對 git 的實現有了個大概的瞭解，但仍是不過癮。因而把 git 的源碼 clone 下來看源碼是如何實現的，可是 git 的源碼實在太多了，要耗費太多精力了。linux

我嘗試把跳轉到 Git 的第一個 commit，代碼量就少了不少，只有差很少 1000 行左右。修改了一些代碼把程序編譯起來了。看了下初始版本的 git，發現不少概念仍是能夠和如今的 git 的相通，有閱讀的價值。git

以後還看到 jacob Stopak 的解析，對 git 的代碼作了不少註釋。看到的時候非常驚喜，由於網上對 git 這樣的解析不多，看日期仍是今年的 5 月 22 號發佈的，新鮮的 >_<。github

What Can We Learn from the Code in Git’s Initial Commit?shell

獲取源代碼並編譯

在這裏使用 Jacob Stopak 的項目，它對 Git 的第一個 commit 裏面的代碼添加了大量註釋，還修改了一些代碼方便咱們在現代操做系統上編譯。數據庫

獲取代碼bash

git clone https://bitbucket.org/jacobstopak/baby-git.git
複製代碼

編譯直接 make 就行了，不過這裏仍是會出現編譯錯誤，須要在 cache.h 文件修改下面的變量，加上 externapp

extern const char *sha1_file_directory;
extern struct cache_entry **active_cache;
extern unsigned int active_nr, active_alloc;
複製代碼

直接 makedom

make
複製代碼

若是你想要嘗試使用 github 上 git 的源碼來編譯的話，能夠進行下面操做函數

# 先 clone 源碼
git clone https://github.com/git/git.git
# 根據 log 找到第一個 commit
git log --reverse
# 檢出第一個 commit
git checkout e83c5163316f89bfbde7d9ab23ca2e25604af290
複製代碼

這裏的源碼直接 make 的話會編譯失敗，還須要做出下面的修改工具

首先是 Makefile 文件

# 把 LIBS 這行修改爲這樣
LIBS= -lcrypto -lz
複製代碼

而後在 cache.h 中

#添加頭文件
#include <string.h>
複製代碼

和前面一樣的操做，把下面幾個變量加上 extern

extern const char *sha1_file_directory;
extern struct cache_entry **active_cache;
extern unsigned int active_nr, active_alloc;
複製代碼

以後就能夠直接 make 了。或者你嫌麻煩能夠用我 z.diff 文件，直接 apply 一下。

git apply z.diff
複製代碼

效果同樣。

源碼結構

能夠看到 git 的開始的時候只有 8 個 .c 文件和 1 個 .h 文件。

能夠看到一下包含代碼和註釋行數只有 1037 行

cat *.c *.h | wc -l
1037
複製代碼

其中 7 個 .c 文件能夠對應如今的 7 個 git 命令

文件	當前 Git	做用
init-db	git init	初始化 git 倉庫
update-cache	git add	添加文件到暫存區
write-tree	git write-tree	將暫存區的內容寫入到一個 tree 對象到 git 的倉庫
commit-tree	git commit	基於指定的 tree 對象建立一個 commit 對象到 git 的倉庫
read-tree	git read-tree	顯示 git 倉庫的樹對象內容
show-diff	diff	顯示暫存的文件和工做目錄的文件差別
cat-file	git cat-file	顯示存儲在 Git 倉庫中的對象內容

至於 read-cache.c 文件則定義了一些程序一些公用的函數和幾個全局變量。

概念分析

Linus Torvalds 在 readme 中對 git 的實現原理做出了一些解釋。

首先它給出了爲何要使用 git 這個名字的緣由

隨機的三個字母組合，沒有和其餘的 unix 命令衝突
簡單
當它好使的時候叫 global information tracker
很差使的時候叫 goddam idiotic truckload of sh*t

random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronounciation of "get" may or may not be relevant.

stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.

"global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.

"goddamn idiotic truckload of sh*t": when it breaks

它有兩個核心的概念

objects databases
current directory cache （至關於如今的暫存區）

Object Databases

object database 至關於一個基於文件系統的鍵值對數據庫，用文件的 sha-1 值做爲鍵，在 .dircache/objects 中存放各類類型的數據。

在這裏介紹三種類型：

blob
tree
commit

blob 對象用來存儲完整的文件內容，用於 git 追蹤文件內容。blob 的結構

blob size\0blobdata(file content)
複製代碼

git 除了要追蹤文件內容外還要追蹤文件名、文件存儲路徑和權限屬性之類的信息。這個時候咱們引入 tree 對象來存儲這些信息。tree 只會保留 blob 對象的 sha-1 值，不會追蹤文件內容。（如今的 tree 對象還會能夠 tree 對象嵌套 tree 對象，如今這裏尚未）

大體的結構是這樣的（省略了頭）

100644 a.txt (6e666502660a7e810b276afd62523c56b34c1671)
100644 b.txt (b 的 sha-1 值)
100644 c.txt (c 的 sha-1 值)
複製代碼

commit 對象，能夠理解爲咱們的一個 commit，它記錄了特定時間的目錄樹（tree 對象）、父 commit 的 sha-id（能夠有多個或 0 個）、做者和提交者信息和對應的 commit message。git 經過 commit 對象來追蹤存儲庫的完整歷史開發記錄。

一樣是大體結構：

tree 1c93ac491de01f734fedbe70f31c47ca965c93b6
parent 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
author  <zzk@archlinux> Sat Jun 27 14:41:43 2020
committer  <zzk@archlinux> Sat Jun 27 14:41:43 2020

second commit
複製代碼

如今的 git 一樣仍是一直在使用上面這些概念，經過管理這些對象來實現咱們的版本控制。

Git 把每一個文件的版本都徹底保存下來，會不會佔用不少空間？是否是太暴力了？

git 會使用 zlib 對全部對象進行壓縮，代碼這類文本數據能夠作到很高的壓縮率（由於代碼中不少都是重複的，好比關鍵字、函數調用）。在如今的 git 還引入了 packfiles 機制，會查找找命名及大小相近的文件，只保存文件不一樣版本之間的差別，也是能夠只保存 diff 的。關於暴力的話，就是空間換時間方式了，這樣作在版本跳轉的時候很快，不用一個一個地應用 diff，直接拿出來就行了。

Git 如何存儲這些對象？

它們都被存放在 .dircache/objects 文件夾中，經過內容的 sha-1 值來索引對用的對象。在 dircache/objects 下還會生成 256 個子目錄（00～ff），用來索引前兩位的 sha-1 值。

Current directory cache

current directory cache 就是咱們使用的暫存區，它存儲在 .dircache/index 文件中，能夠看做是一個臨時的 tree 對象。當執行 update-cache 時，就會建立對應文件的 blob 對象，並將樹信息加到 .dircache/index 文件中。

如何使用

前面說的那些可能沒有實際使用會不太好理解，下面咱們能夠實戰一下，使用一些初始版本的 git。

在這裏咱們會利用這些現有的命令

把暫存區的文件取出來（至關於 git checkout -- file）
提交兩個 commit 並在這兩個 commit 的之間切換

先使用 init-db 建立存儲庫

$ ./init-db
複製代碼

能夠看到建立了 .dircache 文件夾，和如今的 .git 文件是同樣的。

建立一個文件寫入到暫存區，執行 update-cache 會在 .dircache/objects 文件夾中生成一個 blob 對象並把它加入到 .dircache/index 暫存區中。

$ echo "123456" > a.txt
$ ./update-cache a.txt
複製代碼

使用 find 查看建立的 blob 對象

$ find .dircache/objects -type f
.dircache/objects/6e/666502660a7e810b276afd62523c56b34c1671
複製代碼

查看一下

$ cat .dircache/objects/6e/666502660a7e810b276afd62523c56b34c1671
xKOR0g0426156%
複製代碼

因爲是用 zlib 壓縮過的沒解壓看不出來啥，咱們可使用 cat-file 查看

$ ./cat-file 6e666502660a7e810b276afd62523c56b34c1671
temp_git_file_O5J3ZL: blob
複製代碼

他會生成一個臨時文件，裏面是解壓好的文件內容，能夠看到咱們保存到存儲庫裏 a.txt 的內容

$ cat temp_git_file_O5J3ZL
123456
複製代碼

如今咱們把當前的暫存區的內容保存爲 tree 對象到 .dircache/objects 中

$ ./write-tree
433aef473a665a9efe1cf21fbc617fbf833c71b5
複製代碼

寫入成功會打印對象的 SHA-1 值，咱們能夠用 read-tree 查看這個 tree 對象有什麼

$ ./read-tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
100644 a.txt (6e666502660a7e810b276afd62523c56b34c1671)
複製代碼

如今能夠提交咱們的第一個 commit 了，填寫完 commit message 後按 crtrl+d 退出

$ ./commit-tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
Committing initial tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
first commit 
4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
複製代碼

能夠查看一下 commit 對象有什麼東西

$ ./cat-file 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
temp_git_file_pNLmni: commit
$ cat temp_git_file_pNLmni
tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
author  <zzk@archlinux> Sat Jun 27 14:20:55 2020
committer  <zzk@archlinux> Sat Jun 27 14:20:55 2020

first commit
複製代碼

咱們完成了第一個提交，如今咱們對 a.txt 作一下修改

$ echo "version2" > a.txt
複製代碼

用 show-diff 能夠查看和暫存區的版本的差別

$ ./show-diff
a.txt:  6e666502660a7e810b276afd62523c56b34c1671
--- -   2020-06-27 14:27:19.975168003 +0800
+++ a.txt       2020-06-27 14:27:18.048129094 +0800
@@ -1 +1 @@
-123456
+version2
複製代碼

若是想取出暫存區的文件，有兩個方法

使用生成的 diff，修改文件
用 cat-file 取出暫存區文件內容（第二行有文件的 SHA-1 值）

先用 diff 試試

$ ./show-diff > a.diff
$ patch --reverse a.txt a.diff
patching file a.txt
$ cat a.txt
123456
複製代碼

能夠看到文件又回到暫存區的版本了。如今把文件再修改回去，使用 cat-file 來試試

$ echo "version2" > a.txt
$ ./show-diff
a.txt:  6e666502660a7e810b276afd62523c56b34c1671
--- -   2020-06-27 14:33:53.795263485 +0800
+++ a.txt       2020-06-27 14:33:50.726893274 +0800
@@ -1 +1 @@
-123456
+version2
$ ./cat-file 6e666502660a7e810b276afd62523c56b34c1671
temp_git_file_d4cTZ7: blob
$ mv temp_git_file_d4cTZ7 a.txt
$ cat a.txt
123456
複製代碼

如今咱們準備第二個 commit ，再執行一遍以前的操做

$ echo "verison2" > a.txt
$ ./update-cache a.txt
$ ./write-tree
1c93ac491de01f734fedbe70f31c47ca965c93b6

複製代碼

執行 commit-tree 時候要注意，因爲這是第二個提交，須要用 -p 指定一下父提交的 SHA-id 也就是第一個 commit

$ ./commit-tree 1c93ac491de01f734fedbe70f31c47ca965c93b6 -p 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
second commit
3d4340867cb1dd857987fc2db9b1b4bc14af7051

複製代碼

看一下這個 commit

$ ./cat-file 3d4340867cb1dd857987fc2db9b1b4bc14af7051
temp_git_file_8Cguru: commit
$ cat temp_git_file_8Cguru 
tree 1c93ac491de01f734fedbe70f31c47ca965c93b6
parent 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
author  <zzk@archlinux> Sat Jun 27 14:41:43 2020
committer  <zzk@archlinux> Sat Jun 27 14:41:43 2020

second commit

複製代碼

如今咱們要把 a.txt 的版本切換到上一個 commit，從上面的輸出能夠找出 parent 提交的 SHA-id

咱們利用這個父提交的 sha-id 找到對應的 tree object 的 sha-id，再從 tree object 找到對應 a.txt 的 blob object 的 sha-id。最後利用 cat-file 就能夠還原出來原始 a.txt 的內容了。

$ ./cat-file 4ae1f8aae02d7178aa4fc52f45f068cd750aa3de
temp_git_file_MssRdY: commit
$ cat temp_git_file_MssRdY
tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
author  <zzk@archlinux> Sat Jun 27 14:20:55 2020
committer  <zzk@archlinux> Sat Jun 27 14:20:55 2020

first commit
$ ./read-tree 433aef473a665a9efe1cf21fbc617fbf833c71b5
100644 a.txt (6e666502660a7e810b276afd62523c56b34c1671)
$ ./cat-file 6e666502660a7e810b276afd62523c56b34c1671
temp_git_file_2RlDAI: blob
$ mv temp_git_file_2RlDAI a.txt
$ cat a.txt
123456

複製代碼

如今 a.txt 就被還原到了第一個 commit 時的狀態。

總結：原始的 git 很難用。通過不斷地打磨才成了今天這樣子好用，如今的 git 不少人只要會 add、commit、push 就完事了，也不用瞭解 git 原理。

代碼分析

這裏對 git 的部分源碼進行分析，因爲前面已經寫了不少篇幅了，這裏就不會對太多的代碼進行分析，裏面的代碼也是超簡單，能夠說是 linus 代碼寫得漂亮。你本身看估計也花不了一下子，並且 Jacob Stopak 對代碼作超多的註釋，真的沒啥好說的。甚至你能夠根據前面的知識腦補出大概源碼了>_<。

在這裏推薦一個閱讀源碼的工具 sourcetrail。看源碼時很方便。

init-db.c

int main(int argc, char **argv)
{
	char *sha1_dir = getenv(DB_ENVIRONMENT), *path;
	int len, i, fd;
	
	// 建立 .dircache 的目錄
	if (mkdir(".dircache", 0700) < 0) {
		perror("unable to create .dircache");
		exit(1);
	}

	/*
	 * If you want to, you can share the DB area with any number of branches.
	 * That has advantages: you can save space by sharing all the SHA1 objects.
	 * On the other hand, it might just make lookup slower and messier. You
	 * be the judge.
	 */
	sha1_dir = getenv(DB_ENVIRONMENT);
	if (sha1_dir) {
		struct stat st;
		if (!stat(sha1_dir, &st) < 0 && S_ISDIR(st.st_mode))
			return 1;
		fprintf(stderr, "DB_ENVIRONMENT set to bad directory %s: ", sha1_dir);
	}

	/*
	 * The default case is to have a DB per managed directory. 
	 */
	sha1_dir = DEFAULT_DB_ENVIRONMENT;
	fprintf(stderr, "defaulting to private storage area\n");
	len = strlen(sha1_dir);
	if (mkdir(sha1_dir, 0700) < 0) {
		if (errno != EEXIST) {
			perror(sha1_dir);
			exit(1);
		}
	}
	path = malloc(len + 40);
	memcpy(path, sha1_dir, len);
	// 建立 .dircache/objects 下的 256 個子目錄
	for (i = 0; i < 256; i++) {
		sprintf(path+len, "/%02x", i);
		if (mkdir(path, 0700) < 0) {
			if (errno != EEXIST) {
				perror(path);
				exit(1);
			}
		}
	}
	return 0;
}

複製代碼

init-db 就只是單純建立好這些目錄而已，和咱們前面看到的效果也同樣。

這裏介紹 update-cache.c 文件，主要說一下怎麼建立 blob 和加入 .dircache/index 文件。

首先看 main

int main(int argc, char **argv) {
	int i, newfd, entries;
	
    // 將 .dircache/index 中的內容讀入到 active_cache 這個全局變量中
	entries = read_cache();
	if (entries < 0) {
		perror("cache corrupted");
		return -1;
	}
	
    // 這裏建立 lock 文件是爲了避免讓多個 update-cahce 同時運行的鎖文件
	newfd = open(".dircache/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600);
	if (newfd < 0) {
		perror("unable to create new cachefile");
		return -1;
	}
	for (i = 1 ; i < argc; i++) {
		char *path = argv[i];
        // 驗證路徑
		if (!verify_path(path)) {
			fprintf(stderr, "Ignoring path %s\n", argv[i]);
			continue;
		}
        // 把文件添加到 object 數據庫中
		if (add_file_to_cache(path)) {
			fprintf(stderr, "Unable to add %s to database\n", path);
			goto out;
		}
	}
    // 更新 index 文件
	if (!write_cache(newfd, active_cache, active_nr) && !rename(".dircache/index.lock", ".dircache/index"))
		return 0;
out:
	unlink(".dircache/index.lock");
}

複製代碼

經過 main 函數能夠看到生成 blob 應該是在 add_file_to_cache 中作的、而更新 index 則是在 write_cache 中作的。後面的能夠本身追蹤一下。我說一下大概作了啥

add_file_to_cache 中讀取文件而後構造一個 blob 對象進行壓縮，最後計算 sha-1 值，把壓縮好的 blob 對象存到 sha-1 值對應的位置。（如今看文檔好像是先計算 sha-1 值再進行壓縮了）以後經過文件名二分查找放入 active_cache 中（也就是 .dircahce/index 暫存區，這個變量是個全局變量開始時將 .dircache/index 中的內容讀入到裏面），由於是暫存區是經過文件名進行排序的，因此確認一個文件是否在暫存區很快，二分只要 logn。

write_cache 中就是把 active_cache 再寫回 .dircache/index.lock 文件，最後把鎖文件重命名爲 .dircache/index 暫存區文件。

這裏能夠訪問個人博客