Chapter 4: Repositories
第四章:配置庫
This is part of an online book called Source Control HOWTO, a best practices guide on source control, version control, and configuration management.
這是一篇名爲如何作源碼控制的在線書籍的一部分,一本關於源碼控制、版本控制、配置管理的最佳實踐手冊。
Cars and clocks
汽車和鍾
In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. In this chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about how an SCM tool works "under the hood". I am doing this because an SCM tool is more like a car than a clock.
- An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We just want to know what time it is. Those who understand the inner workings of a clock cannot tell time any more skillfully than the rest of us.
- An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However, people who really understand cars tend to get better performance out of them.
在以前的章節裏面,我提到過庫的概念,可是我沒有過多的談及。在本章,我想作更多的描述。請容忍我花點時間談談關於配置管理工具如何「在引擎蓋」下工做。我解釋這個是由於一個配置管理工具同鍾比起來更像汽車。
l 一個配置管理工具不像鍾。鐘的使用者不須要知道一個鐘的內部是如何工做的。咱們只須要知道時間。那些知道鍾內部如何工做的人並不能比咱們這些不知道的人可以更準確地報時。
l 一個配置管理工具更像汽車。許多開車的人都不知道它們是怎麼工做的,可是,真正知道汽車的人們更注意從汽車身上得到更好的性能。
Rest assured, that this book is still a "HOWTO". My goal here remains to create a practical explanation of how to do source control. However, I believe that you can use an SCM tool more effectively if you know a little bit about what's happening inside.
放心,這本書依舊是說「如何作」。個人目標仍是建立一個實踐來解釋如何作配置控制。固然,我相信你若是知道一點工具內部的工做,你就可以更有效的使用配置工具。
Repository = File System * Time
配置庫=文件系統*時間
A repository is the official place where you store all your source code. It keeps track of all your files, as well as the layout of the directories in which they are stored. It resides on a server where it can be shared by all the members of your team.
一個庫就是你存儲你的全部源代碼的正式的地方。它保存了對你全部文件的追蹤,並且像字典同樣的有序存放。它存放在服務器上,共享給你的團隊全部的人員。
But there has to be more. If the definition in the previous paragraph were the whole story, then an SCM repository would be no more than a network file system. A repository is much more than that. A repository contains history.
可是那裏確定還有更多的東西。若是前一段的定義是總體定義,那麼配置庫就僅僅是一個網絡文件系統。但一個庫顯然不止這些,還包含了歷史。
A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains every version of your source code that has ever existed. The additional dimension creates some rather interesting challenges in the architecture of a repository and the decisions about how it manages data.
一個文件系統是二維的:它的空間被定義爲目錄和文件。相對而言,一個庫是三維的:它存在於一個對庫、文件和時間的統一體裏面。一個配置庫包含了你的源代碼已經存在的每一個版本。這個增長的維度爲庫的結構設計和數據管理增添了一些至關有趣的挑戰。
How do we store all those old versions of everything?
咱們如何存儲每一個文件的全部舊版本?
As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not just keep a complete copy of the entire tree for every change that has happened?
作第一假設,咱們不要過於聰明。咱們須要存儲源代碼樹的每一個版本。那爲何不能在發生每一個變動時恰好保留整棵樹的一個徹底拷貝呢?
We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault in the fall of 2001. In the summer of 2002, we started "
dogfooding
". On
October 25th, 2002
, we abandoned our repository history and started a fresh repository for the core components of Vault. Since that day, this tree has been modified 4,686 times.
咱們顯然用咱們本身開發的Vault作咱們的配置管理工具。咱們開始開發Vault是在2001年秋。在2002年夏天,咱們開始咱們的「
dogfooding
」(譯者注:這是一個俚語,表示是一個自行測試的評估體系,是基於Beta或者發佈版的候選軟件).在2002.10.25,咱們放棄了咱們的庫歷史,而後開始用一個全新的庫來放Vault的關鍵組件。從那開始,這個樹被修改過4686次。
This repository contains approximately 40 MB of source code. If we chose to store the entire tree for every change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. At today's prices for disk space, this option is worth considering.
這個庫包含了大概
40M
的源代碼。若是咱們選擇保存這整棵樹的每次變動,那這4686份源碼樹的拷貝不壓縮的話就有大概
183G
。對於今天的硬盤價格來講,這種方式卻是值得考慮。
However, this particular repository is just not very large. We have several others as well, but the sum total of all the code we have ever written still doesn't qualify as "large". Many of our Vault customers have trees which are a lot bigger.
可是,這個特別的庫並非很大。還不如咱們其餘還有的幾個大,但咱們全部寫過的代碼總和仍然不夠「龐大」。許多咱們的Vault客戶的版本的樹要大些。
As an example, consider the source tree for OpenOffice.org. This tree is approximately 634 MB. Based on their claim of 270 developers and the fact that their repository is almost four years old, I'm going to conservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach of storing a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes.
舉個例子,來考慮關於開放工做室組織的源碼樹。這棵樹大概
634M
。基於他們宣稱的270名開發人員和他們的庫有4年的歷史的事實。我保守的估計他們有2萬次簽入。那麼,若是咱們在每次變動的時候用愚蠢的方式保留整個樹的拷貝,那咱們須要大概12TB的硬盤空間。那個12個兆字節(譯者注:1TB=1024GB)啊。
At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data is cheaper than it has ever been in the history of the planet. But this is mission critical data. We have to consider things like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-important data is more than just the cost of the actual disk platters.
基於這點,「硬盤空間是便宜的」的觀點就被顛覆了。12TB數據的硬盤空間比史上的行星要便宜點兒。可是這個是估計數據。咱們還要考慮了運行、備份和RAID(磁盤陣列)以及管理。因此存儲12TB極爲重要的數據所花費的比實際的大數據量硬盤還多。
So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is an obvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simple as a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store another copy of them.
因此咱們實際上有動機來使信息存儲有效率些。幸運的是,有一個很明顯的緣由是爲何這樣作很容易。咱們發現,樹N一般不是同樹N-1差異特別大。定義中,每一個樹的版本都是來自他的前一個版本。一個簽入可能只是很簡單的單線的修改一個文件。其餘的文件並無變動過,那咱們就不用存儲他們的拷貝。
So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store a tree represented as a set of changes to another tree. We call this a "delta".
那麼,咱們也不用存儲每次變動時樹的全部註釋。取而代之,咱們打算一種方式:存儲一棵樹,把一系列變動描繪成另外一棵樹。咱們稱之爲「增量」。
Delta direction
增量方向
As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving a tree which is in a deltified representation requires more effort than retrieving one which is stored in full. For example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as a delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version 1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be faster than others. When using this approach we say that we are using "forward deltas", because each delta expresses the set of changes from one version to the next.
當咱們決定用增量來存儲咱們的庫,咱們必須顧及到執行效率。得到一個增量定義的需求會比得到一個被存儲的整個樹有更多的成果。例如,咱們假設樹的版本1被徹底存儲,可是每一個後來的版本被從它的祖先開始以增量式表示。這意味着爲了得到版本4686,咱們必須先取得版本1,而後應用4685個增量。顯然,這個方式可能意味着取回一些版本會比其餘的快。當使用這種方式的時候,咱們說咱們使用了「前向增量」,由於每一個增量表示了從一個版本的變動到下一個版本的變動。
We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of the Vault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspect that we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. In fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is probably the most likely one to be needed.
咱們發現不是這棵樹的全部版本都恰好須要被取回。例如,Vault的83版本不管如何都不是特殊的。好像咱們有超過一年沒有取過那個版本。我假定咱們將永遠不會再取它了,那麼,咱們天天取這個樹的最新版本不少次,實際上,做爲一個普遍定義,咱們能夠說隨時,樹的最好的最近版本可能恰好就是最須要的。
The simplistic use of forward deltas delivers its worst performance for the most common case. Not good.
前向增量的過於簡單的使用提交了一般狀況下最壞的執行。很差。
Another idea is to use "reverse deltas". In this approach, we store the most recent tree in full. Every other tree N is represented as a set of differences from tree N+1. This approach delivers its best performance for the most common case, but it can still take an awfully long time to retrieve older trees.
還有一個辦法是使用「反向增量」。這種方式裏面,咱們存儲最近的這棵徹底樹。每一個其餘的樹N都被描繪成一套不一樣於N+1的樹。這個方式提交了它對最普通的狀況的最好的執行,可是它依然花掉很長的時間來取回舊的樹。
Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree and representing every other tree as a delta, we sprinkle a few more full trees along the way. For example, suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCM server never has to apply more than 9 deltas to retrieve any tree.
一些配置管理工具使用了一些折中的設計。一種方式是:取代恰好存儲一棵完整的樹並描述每棵其餘的樹爲一個增量,沿着這種方式咱們散列分佈了少數完整的樹。例如,假設咱們每十個版本存儲一棵完整的樹。這個方式須要更多的磁盤空間,可是配置管理服務器不須要應用多於9個增量來得到任何樹了。
What is a delta?
什麼是增量?
I've been throwing around this concept of deltas, but I haven't stopped to describe them.
我已經拋出了增量這個概念,可是我沒有停下來描述過它們。
A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees do not need to be related. However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other. Some developer started with tree N and made one or more changes, resulting in tree N+1.
一棵樹就是一個目錄和文件的層級結構。一個增量是兩棵樹之間的差異。理論上講,這兩棵樹不須要相近。然而,事實上,咱們計算差異的惟一緣由是由於它們中的一個來源於另外一個。一些開發人員從樹N開始製造變動,而後在樹N+1計算結果。
We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this purpose. A changeset is merely a list of the changes which express the difference between two trees.
咱們能夠認爲增量就是一系列變化。事實上,不少配置管理工具使用了術語「changset(變動集合)」偏偏是爲了這個目的。一個變動集合僅僅是變動的列表,列出了兩棵樹的差異。
For example, let's suppose that Wilbur starts with tree N and makes the following changes:
- He deletes $/top/subfolder/foo.c because it is no longer needed.
- He edits $/top/subfolder/Makefile to remove foo.c from the list of file names
- He edits $/top/bar.c to remove all the calls to the functions in foo.c
- He renames $/top/hello.c and gives it the new name hola.c
- He adds a new file called feature_creep.c to $/top/
- He edits $/top/Makefile to add feature_creep.c to the list of filenames
- He moves $/top/subfolder/readme.txt into $/top
例如,假設Wilbur從樹N開始製造變動:
1. 他刪除了$/top/subfolder/foo.c,由於這個文件不須要了
2. 他編輯$/top/subfolder/Makefile,刪除文件列表中foo.c的名字
3. 他編輯$/top/bar.c,刪除全部對foo.c中的功能的調用
4. 他重命名了$/top/hello.c,新的名字爲hola.c
5. 他增長了一個名爲feature_creep.c的新文件放到$/top/下
6. 他編輯了$/top/Makefile來增長feature_creep.c到文件名列表
7. 他移動$/top/subfolder/readme.txt到$/top
At this point, he commits all of these changes to the repository as a single transaction. When the SCM server stores this delta, it must remember all of these changes.
這時,他提交了全部的變動到庫裏面,以一個單獨的事務提交。當配置管理服務器存儲這個增量的時候,它必須記住全部的變動。
For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed in tree N but does not exist in tree N+1.
對於變動集中的第1項,刪除foo.c是很容易描述的,咱們簡單的記住foo.c在樹n中存在而不在樹N+1存在。
For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object in the repository to have an identifier which never changes, even when the name or location of the item changes.
對於變動集中的第4項,重命名hello.c就要複雜些。爲了處理重命名,咱們須要庫中的對每一個象有一個是否變動的標示,甚至在文件名和位置變動的時候都有標示。
For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item. If we simply remember every item by its path, we cannot remember the occasions when that path changes.
對於變動集中的第7項,移動readme.txt是另外一個爲何庫須要爲每一個項分配ID的例子。若是咱們簡單記住每一個項的路徑,咱們就不能記住當路徑變化時的情形。
Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need to remember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a full representation of this changeset item needs to contain the entire contents of that file.
變動集中的第5項正變得比其餘的項更大。對這個項,咱們須要記住樹N+1有一個文件叫feature_creep.c, 歷來沒有在樹N中出現過。而後,關於這個變動集合項的完整描述須要包含整個文件的內容。
Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some way. We could handle these items the same way as item 5, by storing the entire contents of the new version of the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree level.
變動集中的第2,3和6項,描述了一個已經存在並被用某種方式修改過的文件的狀況。咱們可以用同第5項一樣的方式來處理這幾項,經過對文件的新版本的整個內容的存儲。然而,咱們可以在文件層面作增量就像咱們在樹的層面作增量的話,咱們會更高興的。
File deltas
文件增量
A file delta merely expresses the difference between two files. Once again, the reason we calculate a file delta is because we believe it will be smaller than the file itself, usually because one of the files is derived from the other.
一個文件的增量僅僅表達了兩個文件的不一樣。還有,咱們計算一個文件的增量是由於咱們相信它本身發生了一些小變化,一般由於一個文件來源於另外一個。
For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of lines which have been modified, inserted or changed. This is the same kind of results which are produced by the Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that software developers and web developers have a lot of text files.
對於文本文件,處理文件增量的著名的方式是一行一行的對比,而後輸出被修改了的、插入的或變動了的行的列表。這同在UNIX環境下使用「diff」命令同樣,生成一樣類型的結果。很差的是這個方式只在文本格式有效。好的消息是軟件或網絡開發人員有不少文本文件。
CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff. Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them.
CVS
和Perforce使用這種方式來存儲庫。文本文件被增量標示使用了一個線性導向的對比。二進制文件沒有被完全增量標示,儘管Perforce經過壓縮它們減小了點處罰。
Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a file delta algorithm called VCDiff, as described in RFC 3284. This algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. This means it can handle any kind of file, binary or text. As an ancillary benefit, the VCDiff algorithm compresses the data at the same time.
Subversion
和Vault是使用了二進制文件增量的存儲庫的工具實例。Vault使用一個叫VCDiff的文件增量運算法則,被在RFC 3284中進行了描述。這個運算法則是字節導向的,不是線性導向的。它輸出了那些變動了的字節列表排序。這意味着它能夠提交任何類型的文件,二進制或文本文件。做爲一個輔助的益處,VCDiff運算法則同時壓縮了數據。
Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are large. Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in. In CVS, the size of the repository will increase by 10 MB. In Subversion and Vault, the repository will only grow by a small amount.
二進制增量對配置管理工具用戶是一個重要的特徵,特別是當二進制文件很大的狀況下。考慮到那種一個用戶簽出一個10兆的文件只變動幾個字節就簽入。在CVS裏面,數據庫會一樣的增長十兆。在Subversion和Vault中,數據庫會只增加一點點。
Deltas and diffs are different
增量和差異是不一樣的
Please note that I make a distinction between the terms "delta" and "diff".
請注意,我在「增量」和「差異」之間使用了一個區別。
- A "delta" is the difference between two versions. If we have one full file and a delta, then we can construct the other full file. A delta is used primarily because it is smaller than the full file, not because it is useful for a human being to read. The purpose of a delta is efficiency. When deltas are done at the level of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just text files.
- 一個「增量」是兩個版本之間的差別。若是咱們有一個完整的文件和一個增量,那麼咱們可以構建另外一個完整的文件。一個增量被使用的首要緣由是它比整個的文件小,不是由於它是對人類閱讀有益。增量的這個目的是有效的。當增量是在字節層面運做,取代了文本行級別,那效率就變得不只僅對二進制的文件而是全部類型有用了。
- A "diff" is the human-readable difference between two versions of a text file. It is usually line-oriented, but really cool visual diff tools can also highlight the specific characters on a line which differ. The purpose of a diff is to show a developer exactly what has changed between two versions of a file. Diffs are really useful for text files, because human beings tend to read text files. Most human beings don't read binary files, and human-readable diffs of binary files are similarly uninteresting.
- 差異是人類可讀的兩個版本之間的文本差別。它一般是線性的,可是真正很酷的視窗比較文具能夠在一行上面高亮特殊的字段。差異的目的是顯示一個開發人員恰好在兩個版本之間變動了什麼。差異是真正可用的文本文件,由於人們趨向於讀文本文件。許多人不會讀二進制文件,而人類可讀的二進制文件的差異一樣很無趣。
As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance over slow network lines. However, those tools also support textual diffs. Deltas and diffs serve two distinct purposes, both of which are important. It is merely coincidence that some SCM tools use textual diffs as their repository deltas.
如上面所提到,一些配置管理工具使用二進制增量來存儲庫或者提升低速網絡的執行效率。然而,那些工具也支持文本的差異。增量和差異爲兩種不一樣的目的服務,它們都很重要。這僅在一些配置管理工具直接使用文本的差異做爲它們庫的增量的時候一致。
The evolution of source control technology
源碼控制技術的發展
At this point I should admit that I have presented a somewhat idealized view of the world. Not all SCM tools work the way I have described. In fact, I have presented things exactly backwards, discussing tree-wide deltas before file deltas. That is not the way the history of the world unfolded.
在這點上,我要認可我提出過一個有點理想化的世界觀。不是全部的配置管理工具都經過這種我描述過的方式進行工做。事實上,我也正確地向後描述過事情,在文件增量以前討論過tree-wide增量。那不是這個世界展開過的歷史之路。
Prehistoric ancestors of modern programmers had to live with extremely primitive tools. Early version control systems like RCS only handled file deltas. There was no way for the system to remember folder-level operations like add, renaming or deleting files.
現代編程的史前祖先曾經經過極其古老的工具生存,早點的版本控制系統,好比RCS,只是提交文件增量。這種系統沒有其餘的方式來記憶目錄層級,好比增長、重命名或刪除文件。
Over time, the design of SCM tools matured. CVS is probably the most popular source control tool in the world today. It was originally developed as a set of wrappers around RCS which essentially provided support for some folder-level operations. Although CVS still has some important limitations, it was a big step forward.
時光流逝,配置管理工具的設計成熟了。CVS多是當今世界最流行的源碼控制工具。它最開始是做爲一套RCS的外殼來進行開發的,提供了支持目錄層級的操做。儘管CVS仍然有一些重要的侷限,可是配置管理工具向前發展了一大步。
Today, several modern source control systems are designed around the notion of tree-wide deltas. By accurately remembering every possible operation which can happen to a repository, these tools provide a truly complete history of a project.
如今,一些流行的源碼控制系統圍繞tree-wide增量的概念來設計。經過精確的保留每一個對庫可能產生的操做,這些工具提供了一個真正的項目的歷史。
What can be stored in a repository?
什麼能夠被放到庫裏面?
Best Practice: Checkin all the canonical stuff, and nothing else
最佳實踐:簽入全部規範的素材,其餘的所有不要
Although you can store anything you want in a repository, that doesn't mean you should. The best practice here is to store everything which is necessary to do a build, and nothing else. I call this "the canonical stuff".
儘管你能夠在庫裏保存任何東西,可是那不意味着你就應該隨便放。這裏的最佳實踐是:放入真正須要構建的東西,其餘的都不要。我將這些稱爲「規範素材」。
To put this another way, I recommend that you do not store any file which is automatically generated. Checkin your hand-edited source code. Don't checkin EXEs and DLLs. If you use a code generation tool, checkin the input file, not the generated code file. If you generate your product documentation in several different formats, checkin the original format, the one that you manually edit.
爲了經過另外的方式這樣作,我建議你不要存儲任何能夠自動生成的文件。簽入你手工編輯的源碼。不要簽入EXE文件和DLL文件。若是你使用一個代碼生成工具,簽入這個輸入文件,不是生成的代碼文件。若是你用幾種不一樣的格式生成你的產品文檔,簽入你手工編輯的原始格式。
If you have two files, one of which is automatically generated from the other, then you just don't need to checkin both of them. You would in effect be managing two expressions of the same thing. If one of them gets out of sync with the other, then you have a problem.
若是你有兩個文件,一個是從另外一個文件自動生成的,那麼你就不用簽入兩個文件。你能夠有效的管理一樣事情的兩個表達方式。若是它們中的一個被取出來同另外一個同步,那你纔會出一些問題。
People sometimes ask us what kind of things can be stored in a repository. In general, the answer is: "Any file". It is true that I am focusing on tools which are designed for software developers and web developers. However, those tools don't really care what kind of file you store inside them. Vault doesn't care. Perforce, Subversion and CVS don't care. Any of these tools will gratefully accept any file you want to store.
人們有的時候問咱們什麼類型的東西能夠放到庫裏面。一般答案都是:「任何文件」。這是真的,由於我集中精力在爲軟件和WEB開發人員設計工具上。然而,那些工具沒有真正的關心哪一種文件能夠放進庫裏。Vault也不關心。Perforce,Subversion和CVS都不關心。這些工具都積極的接受你要存儲的文件。
If you will be storing a lot of binary files, it is helpful to know how your SCM tool handles them. A tool which uses binary deltas in the repository may be a better choice.
若是你要存儲不少二進制文件,這將對你瞭解配置管理工具如何提交他們有幫助。一個工具在配置庫中使用了二進制增量多是一個更好的選擇。
If all of your files are binary, you may want to explore other solutions. Tools like Vault and Subversion were designed for programmers. These products contain features designed specifically for use with source code, including diff and automerge. You can use these systems to store all of your Excel spreadsheets, but they are probably not the best tool for the job. Consider exploring "document management" systems instead.
若是你全部的文件都是二進制的,你打算用其餘的方案來瀏覽。像Vault和Subversion是爲程序人員設計的工具。這些產品包含了特別的爲源碼設計的特性,包含了差別比較和自動合併。你可以使用這些系統來存儲全部的你的Excel表格,可是他們可能不是最好的工具。你應該考慮使用「文件管理」系統。
How is the repository itself stored?
配置庫本身是怎麼存儲的?
We need to descend through one more layer of abstraction before we turn our attention back to more practical matters. So far I have been talking about how things are stored and managed within a repository, but I have not broached the subject of how the repository itself is stored.
在咱們將咱們的注意力回過來在更多的實際問題中,咱們須要下降更多提取的層次。目前爲止,我談過了文件在一個庫裏面是怎樣被存儲和管理的,可是我沒有討論配置庫本身是怎麼存儲的。
A repository must store every version of every file. It must remember the hierarchy of files and folders for every version of the tree. It must remember metadata, information about every file and folder. It must remember checkin comments, explanations provided by the developer for each checkin. For large trees and trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably. There are several different ways of approaching the problem.
一個庫必須存儲任何文件的任何版本。它必須記住樹中每一個版本的文件和目錄的層級。它必須記住元數據,每一個文件和目錄的信息。它必須記住簽入的內容,開發人員每次簽入的時候的註釋。對於大的樹和樹的衆多的版本,還須要有效可靠的管理大量的數據。有幾種不一樣的方式能夠解決這個問題。
RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive file was called "foo.c,v". Usually these archive files were kept in a subdirectory of the working directory, just one level down. RCS files were plain text, you could just look at them with any editor. Inside the file you would find a bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one for each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fond memories, that particular phase of my life is over.)
RCS
爲每一個被管理的文件保留了一個檔案文件。若是你的文件名是「foo.c」,那它的檔案文件就是「foo.c,v」。一般這些檔案文件被保存在工做目錄的一個子目錄中,就像一個下級目錄同樣。RCS文件是純文本的,你能夠用編輯器打開他們。在文件裏面你能夠看到一串元數據和文件最近版本的所有拷貝,加上一系列線性的針對以前每一個版本的文件增量。(請原諒我在過去的句子裏談到RCS。不管多麼美好的記憶,都是我生命中已通過去的片段了。)
CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separate from the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repository contains some additional metadata.
CVS
使用了一個相似的設計,雖然具備了更多的能力。一個CVS庫是明顯的、完全的同工做目錄分離的,可是它仍然像RCS那樣使用「,V」文件。CVS的目錄結構包含了一些額外的元數據。
When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are exactly the same as the storage challenges of a database. For this reason, many SCM tools use an actual database as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefit of this approach is enormous, especially for SCM tools which support atomic transactions. Microsoft has invested lots of time and money to ensure that SQL Server is a safe place to store important information. Data corruption simply doesn't happen. All of the ultra-tricky details of transactions are handled by the underlying database.
當管理愈來愈大的源碼樹的時候,事情變得愈來愈清晰:一個配置庫存儲的挑戰一樣是數據庫存儲的挑戰。由於這個緣由,許多配置管理工具使用一個真正的數據庫來存儲數據。Subversion使用BerkeleyDB。Vault使用SQLSERVER2000。使用這種方式的好處是很巨大的,特別是對於那些支持原子事務的工具。微軟已經投入不少時間和錢來保證SQLSERVER是一個存儲重要信息的安全地方。數據崩潰一般不容易發生。全部關於事務是如何的提交的至關機警的討論就在商用數據庫中。
Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of the actual file contents in RCS files. This approach trades some safety for speed. Since Perforce manages its own archive files, it has to take responsibility for all the strange things that threaten to corrupt them. On the other hand, writing a file is a bit faster than writing a blob into a SQL database. Perforce has the reputation of being one of the fastest SCM tools.
Perforce
使用比較混雜的方式,在數據庫中存儲全部的元數據,可是在RCS中保持全部的真實文件的內容。這種方式帶來一個速度的安全性。自從Perforce管理它本身的檔案文件,它不得不對全部奇怪的威脅到數據崩潰的事情負責。另外一方面,寫一個文件比寫一個blob字段到SQL中要快些。Perforce有最快的配置管理工具的聲譽。
Managing repositories
管理配置庫
Best Practice: Use separate repositories for things which are truly separate
最佳實踐:對真正分離的事物使用分離的庫
Most SCM tools offer the ability to have multiple distinct repositories. Vault can even host multiple repositories on the same Vault server. People often ask us when this capability should be used.
許多配置管理工具均可以創建許多不一樣的庫。Vault甚至能夠在同一臺Vault服務器上創建多個庫。人們經常問咱們這有什麼用。
In general, you should store related items in the same repository. Start a separate repository only in situations where the contents of the two are completely unrelated. In a small ISV, it may be quite logical to have only one repository which contains every project.
一般,你能夠存儲相似的項目到同一個庫。創建一個分離的庫僅僅是在兩個項內容徹底不相關的狀況下。在一個小的獨立軟件開發商那裏,一個包含了全部項目的庫是至關合理的。
Creating a source control repository is kind of a special event. It's a little bit like adopting a cat. People often get a cat without realizing the animal is going to be around for 10-20 years. Your repository may have similar longevity, or even longer.
建立一個源碼庫是有點特殊的狀況。有點象收養一隻貓。人們一般收養一隻貓的時候沒有想過這個貓要在本身身邊10-20年。你的庫可能有相似的壽命,甚至更長。
Shortly after SourceGear was founded in 1997, we created a SourceSafe repository. Over seven years later, that repository is still in use, almost every day. (Along with a whole bunch of legacy projects, it contains the source code for SourceOffSite. We never migrated that project to Vault because we wanted the SourceOffSite developers to continue eating their own dogfood.)
SourceGear
在1997年被建立,咱們建立了一個SourceSafe的庫。7年以後,那個庫幾乎是天天都還在使用。(它包含了SourceOffSite的源碼,還伴隨着遺留項目的整個樹串。咱們歷來沒有移植那個項目到Vault上,由於咱們但願SourceOffSite的開發人員繼續去啃它們本身的狗骨頭。)
That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has never been a very big company). It contains thousands of files, thousands of checkins, and has been backed up thousands of times.
這個庫在十億字節的時候會溢出(這實際上至關小了,而SourceGear卻已是一個很大的公司了)。它包含了數以千計的文件,數以千計的簽入和數以千計的回滾。
Treat your repository well and it will serve you well:
對你的庫好點它就會對你好點:
- Obviously you should do regular backups. That repository contains everything your fussy and expensive programmers have ever created. Don't risk losing it.
- 顯然你應該規範備份。庫包含了你全部的瑣碎的事情和程序人員寶貴的代碼。不要冒丟失它的險。
- Just for fun, take an hour this week and check your backup to see if it actually works. It's shocking how many people are doing daily backups that cannot actually be restored when they are needed.
- 可笑的是,要每週花一個小時來檢查你的備份是否能夠真正的可用。不少人在他們真正須要的時候卻恐怖的發現作了每日備份可是備份卻沒有真正的被保存起來。
- Put your repository on a reliable server. If your repository goes down, your entire team is blocked from doing work. Disk drives like to fail, so use RAID. Power supplies like to fail, so get a server with redundant power supplies. The electrical grid likes to fail, so get a good Uninterruptible Power Supply (UPS).
- 把你的庫放到一個可信的服務器上。若是你的庫壞了,你整個團隊工做就得停滯。硬盤喜歡壞掉,因此用RAID。供電電源也愛壞掉,那就讓一個服務器擁有多個供電電源。電路也喜歡壞掉,那就用一個好的UPS。
- Be conservative in the way your SCM server machine is managed. Don't put anything on that machine that doesn't need to be there. Don't feel the need to install every single Service Pack on the day it gets released. I've been shocked how many times one of our servers went south simply because we installed a service pack or hotfix from Windows Update. Obviously I want our machines to be kept current with the latest security fixes, but I've been burned too many times not to be cautious. Install those patches on some other machine before you put them on critical servers.
- 讓你的配置管理服務器被用傳統的方式管理。不要放不須要的東西到那臺機器上。不要以爲有必要在SP發佈的時候就馬上去安裝每一個SP。我遇到好屢次由於咱們安裝了一個SP或者使用了Windows自動更新進行了自動修復,咱們的服務器就輕易的死掉了。顯然我但願咱們的服務器能保持一個有當前最新的安全性修復,可是我屢次由於沒有當心而受處處罰。請在安裝它們到正式服務器以前在其餘機器上安裝那些補丁。
- Keep your SCM server inside a firewall. If you need to allow your developers to access the repository from home, carefully poke a hole, but leave everything else as tight as you can. Make sure your developers are using some sort of bulk encryption. Vault uses SSL. Tools like Perforce, CVS and Subversion can be tunneled through ssh or something similar.
- 保證你的配置管理服務器同其餘機器在一個防火牆內。若是你容許你的開發人員從家裏就能夠訪問配置庫,那就當心的開一個洞,不要再放其餘的任何東西,能有多謹慎就多謹慎。確信你的開發人員在使用一些必須的加密協議。Vault使用SSL。象Perforce, CVS 和 Subversion能夠經過SSH或者相似的協議。
This brief list of tips is hardly a complete guide for administrators. I am merely trying to describe the level of care and caution which should be used for your SCM repository.
這上面列出的還僅僅是一個管理員的指南。我只不過試圖描述在你的配置管理庫中須要關心和當心的程度。
Undo
撤銷
As I have mentioned, one of the best things about source control is that it contains your entire history. Every version of everything is stored. Nothing is ever deleted.
如我所說過,源碼控制最好的就是包含你整個的歷史。每一個版本的每一個事件都被保存了,沒有任何東西被刪除。
However, sometimes this benefit can be a real pain. What if I made a mistake and checked in something that should not be checked in? My history contains something I would rather forget. I want to pretend that it never happened. Isn't there some way to really delete from a repository?
然而,有的時候這個益處恰是一個真正的痛苦。若是我產生了一個失誤而且簽入了不須要簽入的東西的時候會發生什麼?個人歷史包含了我願意遺忘的歷史。我但願它好像歷來沒有發上過。那有沒有什麼辦法從庫裏面真正的刪除它們?
In general, the recommended way to fix a problem is to checkin a new version which fixes it. Try not to worry about the fact that your repository contains a full history of the error. Your mistakes are a part of your past. Accept them and move on with your life.
一般,解決這個問題的建議是在修改的時候簽入一個新的版本。不要擔憂你的庫中包含了整個失誤的歷史。你的失誤是你過去的一個部分。接受它們而後繼續你的生命吧。
However, most SCM tools do provide one or more ways of dealing with this situation. First, there is a command I call "rollback". This command is essentially an "undo" for revisions of a file. For example, let's say that a certain file is at version 7 and we want to go back to version 6. In Vault, we select version 6 and choose the Rollback command.
然而,不少配置管理工具提供了一種或更多種方式來處理這種狀況。首先,有一個我稱爲「回滾」的命令。這個命令實質上就是「撤銷」一個文件的修訂。例如,咱們說一個文件在版本7,而咱們但願回到版本6。在Vault裏面,咱們選擇版本6而後使用回滾命令。
To be fair, I should admit that the rollback command is not always destructive. In some SCM tools, the rollback feature really does make version 7 disappear forever. Vault's rollback is non-destructive. It simply creates a version 8 which is identical to version 6. The designers of Vault are fanatical purists, or at the very least, one of them is.
爲了公平,我容許回滾命令不是破壞性的。有些配置管理工具,回滾功能真的使版本7永遠消失掉了。Vault的回滾功能是非破壞性的。它簡單的建立一個同版本6同樣的版本8。Vault設計者都是狂熱的理論愛好者,最起碼他們中的一個是。
As a concession to those who are less fanatical, Vault does support a way to truly destroy things in a repository. We call this feature "obliterate". I believe Subversion and Perforce use the same term. The obliterate command is the only way to delete something and make it truly gone forever.
做爲一種對那些不那麼狂熱的人的讓步,Vault也支持真正的在庫裏面破壞東西。咱們稱這個功能爲「刪除」。我相信Subversion和Perforce使用了一樣的術語。刪除命令是惟一的刪除一些東西而且使它真正的消失的命令。
Best Practice: Never obliterate anything that was real work
最佳實踐:不要刪除真正工做的任何東西
The purist in me wants to recommend that nothing should ever be obliterated. However, my pragmatist side prevails. There are situations where obliterate is not sinful.
在我腦殼裏理想化的一面但願任何東西都不要被刪除,可是個人現實的一面卻成功了,有時有些地方被刪除並無那麼可怕。
However, obliterate should never be used to delete actual work. Don't obliterate a file simply because you discovered it to be a bad idea. Don't obliterate a file simply because you don't need it anymore. Obliterate is for situations where something in the repository should never have been there at all. For example, if you accidentally checkin a gigabyte of MP3s alongside your C++ include files, obliterate is a justifiable choice.
固然,刪除應該決不用於刪除真正的工做。不要由於你發現它很差就刪除一個文件。也不要由於再也不須要就刪除。刪除是爲了一些在庫中根本不須要的。例如,若是你意外的簽入一個MP3的文件到你的C++文件裏面,那刪除就是一個正確的選擇。
In my original spec for Vault, I had decided that we would not implement any form of destructive delete. We eventually decided to compromise and implement this command, but I really wanted to discourage its use. SourceSafe makes it far too easy to rewrite history and pretend that something never happened. In the Delete dialog box, SourceSafe includes a checkbox called "Destroy Permanently". This is an atrocious design decision, roughly equivalent to leaving a sledgehammer next to the server machine so that people can bash the hard disks with it every once in a while. This checkbox is almost irresistible. It simply begs to be checked, even though it is very rarely the right thing to do.
在Vault的原始規則裏面,我曾經肯定咱們不會執行任何破壞性的刪除。咱們最後決定妥協並使用這個命令,可是我真正的但願阻止它的使用。SourceSafe使這個命令很簡單快速的重寫歷史和假設什麼都沒有發生過。在刪除對話框,SourceSafe包含了一個成爲「永久破壞」的選擇框。這是一個很兇悍的設計思想,粗糙的等於拿一個大的錘子讓人們能夠在硬盤旋轉中去敲打服務器。這個選擇框是至關有誘惑的。它簡單的要求檢查,儘管不多有正確的事情來作。
When we first designed the obliterate command for Vault, I wanted its user interface to somehow make the user feel guilty. I argued that the obliterate dialog box should include a photograph of a 75-year old catholic nun scowling and holding a yardstick.
當咱們開始爲Vault設計刪除命令的時候,我但願它的用戶界面可以使用戶莫名其妙的以爲不舒服。我辯論說這個刪除對話框包含了一個拿着一根繩子的75歲的修女。
The rest of the team agreed that we should discourage people from using this command, but in the end, we settled on a less graphical approach. In Vault, the obliterate command is available only in the Admin client, not the regular client people use every day. In effect, we made the obliterate command available, but inconvenient. People who really need to obliterate can find the command and get it done. Everyone else has to think twice before they try to rewrite history and pretend something never happened.
其餘的團隊成員贊成我應該勸阻人民不要使用這個命令,可是到最後,咱們決定採起了一個小的圖形方式。在Vault裏面,刪除命令是僅僅在管理員端可使用的,不是其餘的客戶端的客戶能夠天天使用的。咱們還使這個命令可用,卻並不方便。真正須要刪除的人們能夠找這個命令而後執行。其餘的人在他們試圖重寫歷史並假裝什麼事情都沒有發生以前須要思考兩次。
Kimchi again?
再來點韓國泡菜?
Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that "everyone in
Korea
eats kimchi at every meal, every day". In the world of a ten-year-old, things are simpler. Rules don't have exceptions. Generalizations always apply.
最近我問我五年級的女兒她從學校學到了什麼,她驕傲的告訴我「在韓國的人天天、每頓都吃韓國泡菜」。在一個十歲的年紀,事情很是簡單。規則沒有例外。一般老是被運用。
This is how we learn. We understand the basic rules first and see the finer points later. First we learn that memory leaks are impossible in the CLR. Later, when our app consumes all available RAM, we learn more.
這就是咱們如何來學習。咱們首先了解了基本規則,而後再看重點。首先咱們認識到內存泄漏在語音錄音器裏面是不可能的。後來,當咱們的程序消耗了全部可用的RAM,咱們就學到了更多。
My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarely acknowledging that there are exceptions to my broad generalizations. I did this during the chapter on checkins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-edit-checkin".
個人習慣就象我寫這些文章同樣,首先以一種事實方式呈現基礎,個人寬泛的歸納很罕見的獲得承認。我在章節簽入裏面作這些事情,直到我完全的研究了「簽出-編輯-簽入」以前我都沒有說起「編輯-合併-提交」。
In this chapter, I have written everything from the perspective of just one specific architecture. SCM tools like Vault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single repository. Each client has a working folder. All clients contact the same server.
在這個章節,我只以一個特定結構的見解去描述每件事情。配置管理工具,好比Vault,Perforce,CVS和Subversion都是基於集中只有一個單獨的庫的服務器的概念。每一個客戶端有一個工做目錄,全部的客戶端同同一臺服務器聯繫。
I confess that not all SCM tools work this way. Tools like BitKeeper and Arch are based on the concept of distributed repositories. Instead of one repository, there can be several, or even many. Things can be retrieved or committed to any repository at any time. The repositories are synchronized by migrating changesets from one repository to another. This results in a merge situation which is not altogether different from merging branches.
我認可不是全部的配置管理工具都是用那種方式工做。好比BitKeeper 和Arch都是基於分佈式數據庫的。一個庫能夠有好幾個,甚至更多。工做可以在任什麼時候間從任何庫中得到或提交。這個庫是經過從一個庫移動變動到另外一個庫同步的。在一個合併的地方這個結果不是同合併分支差別相同的。
From the perspective of this SCM geek, distributed repositories are an attractive concept. Admittedly, they are advanced and complex, requiring a bit more of a learning curve on the part of the end user. But for the power user, this paradigm for source control is very cool.
關於這個配置管理討厭的見解,分佈庫是一個吸引人的概念。誠然,他們是高級和複雜的,須要終端用戶更多的學習。可是對高級用戶,這個例子對版本控制很是酷。
Having no experience in the implementation of these systems, I will not be explaining their behavior in any detail. Suffice it to say that this approach is similar in some ways, but very different in others. This series of articles will continue to focus on the more mainstream architecture for source control.
尚未執行這些系統的經驗,我將不會解釋他們的行爲。有力的說明這個方式在某些地方是相同的,可是又同其餘的很是不一樣。這個系列文章將繼續關注主流結構的版本控制工具。
Looking ahead
In this chapter, I discussed the details of repositories. In the next chapter, I'll go back over to the client side and dive into the details of working folders.
這一章節,我論述了關於庫的狀況。下一章節,我將回頭來描述客戶端和深刻鑽研工做目錄