序列化(serialization)是指將結構化的對象轉化爲字節流,以便在網絡上傳輸或者寫入到硬盤進行永久存儲;相對的反序列化(deserialization)是指將字節流轉回到結構化對象的過程。 java
在分佈式系統中進程將對象序列化爲字節流,經過網絡傳輸到另外一進程,另外一進程接收到字節流,經過反序列化轉回到結構化對象,以達到進程間通訊。在Hadoop中,Mapper,Combiner,Reducer等階段之間的通訊都須要使用序列化與反序列化技術。舉例來講,Mapper產生的中間結果(<key: value1, value2...>)須要寫入到本地硬盤,這是序列化過程(將結構化對象轉化爲字節流,並寫入硬盤),而Reducer階段讀取Mapper的中間結果的過程則是一個反序列化過程(讀取硬盤上存儲的字節流文件,並轉回爲結構化對象),須要注意的是,可以在網絡上傳輸的只能是字節流,Mapper的中間結果在不一樣主機間洗牌時,對象將經歷序列化和反序列化兩個過程。 apache
序列化是Hadoop核心的一部分,在Hadoop中,位於org.apache.hadoop.io包中的Writable接口是Hadoop序列化格式的實現。 網絡
Hadoop Writable接口是基於DataInput和DataOutput實現的序列化協議,緊湊(高效使用存儲空間),快速(讀寫數據、序列化與反序列化的開銷小)。Hadoop中的鍵(key)和值(value)必須是實現了Writable接口的對象(鍵還必須實現WritableComparable,以便進行排序)。 app
如下是Hadoop(使用的是Hadoop 1.1.2)中Writable接口的聲明: 分佈式
package org.apache.hadoop.io; import java.io.DataOutput; import java.io.DataInput; import java.io.IOException; public interface Writable { /** * Serialize the fields of this object to <code>out</code>. * * @param out <code>DataOuput</code> to serialize this object into. * @throws IOException */ void write(DataOutput out) throws IOException; /** * Deserialize the fields of this object from <code>in</code>. * * <p>For efficiency, implementations should attempt to re-use storage in the * existing object where possible.</p> * * @param in <code>DataInput</code> to deseriablize this object from. * @throws IOException */ void readFields(DataInput in) throws IOException; }
Hadoop自身提供了多種具體的Writable類,包含了常見的Java基本類型(boolean、byte、short、int、float、long和double等)和集合類型(BytesWritable、ArrayWritable和MapWritable等)。這些類型都位於org.apache.hadoop.io包中。 ide
(圖片來源:safaribooksonline.com) oop
雖然Hadoop內建了多種Writable類提供用戶選擇,Hadoop對Java基本類型的包裝Writable類實現的RawComparable接口,使得這些對象不須要反序列化過程,即可以在字節流層面進行排序,從而大大縮短了比較的時間開銷,可是當咱們須要更加複雜的對象時,Hadoop的內建Writable類就不能知足咱們的需求了(須要注意的是Hadoop提供的Writable集合類型並無實現RawComparable接口,所以也不知足咱們的須要),這時咱們就須要定製本身的Writable類,特別將其做爲鍵(key)的時候更應該如此,以求達到更高效的存儲和快速的比較。 ui
下面的實例展現瞭如何定製一個Writable類,一個定製的Writable類首先必須實現Writable或者WritableComparable接口,而後爲定製的Writable類編寫write(DataOutput out)和readFields(DataInput in)方法,來控制定製的Writable類如何轉化爲字節流(write方法)和如何從字節流轉回爲Writable對象。 this
package com.yoyzhou.weibo; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.VLongWritable; import org.apache.hadoop.io.Writable; /** *This MyWritable class demonstrates how to write a custom Writable class * **/ public class MyWritable implements Writable{ private VLongWritable field1; private VLongWritable field2; public MyWritable(){ this.set(new VLongWritable(), new VLongWritable()); } public MyWritable(VLongWritable fld1, VLongWritable fld2){ this.set(fld1, fld2); } public void set(VLongWritable fld1, VLongWritable fld2){ //make sure the smaller field is always put as field1 if(fld1.get() <= fld2.get()){ this.field1 = fld1; this.field2 = fld2; }else{ this.field1 = fld2; this.field2 = fld1; } } //How to write and read MyWritable fields from DataOutput and DataInput stream @Override public void write(DataOutput out) throws IOException { field1.write(out); field2.write(out); } @Override public void readFields(DataInput in) throws IOException { field1.readFields(in); field2.readFields(in); } /** Returns true if <code>o</code> is a MyWritable with the same values. */ @Override public boolean equals(Object o) { if (!(o instanceof MyWritable)) return false; MyWritable other = (MyWritable)o; return field1.equals(other.field1) && field2.equals(other.field2); } @Override public int hashCode(){ return field1.hashCode() * 163 + field2.hashCode(); } @Override public String toString() { return field1.toString() + "\t" + field2.toString(); } }
未完待續,下一篇中將介紹Writable對象序列化爲字節流時佔用的字節長度以及其字節序列的構成。 spa
Tom White, Hadoop: The Definitive Guide, 3rd Edition