Hadoop學習筆記(8) ——實戰作個倒排索引

時間 2019-12-11

標籤 hadoop 學習筆記實戰索引欄目 Hadoop 简体版

原文原文鏈接

http://www.cnblogs.com/zjfstudio/p/3913549.htmlhtml

Hadoop學習筆記(8) 數據結構

——實戰作個倒排索引 app

倒排索引是文檔檢索系統中最經常使用數據結構。根據單詞反過來查在文檔中出現的頻率，而不是根據文檔來，因此稱倒排索引(Inverted Index)。結構以下: ide

這張索引表中，每一個單詞都對應着一系列的出現該單詞的文檔，權表示該單詞在該文檔中出現的次數。如今咱們假定輸入的是如下的文件清單：函數

T1 ： hello world hello china oop

T2 : hello hadoop 學習

T3 ： bye world bye hadoop bye byeurl

輸入這些文件，咱們最終將會獲得這樣的索引文件： spa

bye T3:4; 調試

china T1:1;

hadoop T2:1;T3:1;

hello T1:2;T2:1;

world T1:1;T3:1;

接下來，咱們就是要想辦法利用hadoop來把這個輸入，變成輸出。從上一章中，其實也就是分析如何將hadoop中的步驟個性化，讓其工做。整個步驟中，最主要的仍是map和reduce過程，其它的均可稱之爲配角，因此咱們先來分析下map和reduce的過程將會是怎樣？

首先是Map的過程。Map的輸入是文本輸入，一條條的行記錄進入。輸出呢？應該包含：單詞、所在文件、單詞數。 Map的輸入是key-value。那這三個信息誰是key，誰是value呢？數量是須要累計的，單詞數確定在value裏，單詞在key中，文件呢？不一樣文件內的相同單詞也不能累加的，因此這個文件應該在key中。這樣key中就應該包含兩個值：單詞和文件，value則是默認的數量1，用於後面reduce來進行合併。

因此Map後的結果應該是這樣的：

Key value

Hello;T1 1

Hello:T1 1

World:T1 1

China:T1 1

Hello:T2 1

…

即然這個key是複合的，因此常歸的類型已經不能知足咱們的要求了，因此得設置一個複合健。複合健的寫法在上一章中描述到了。因此這裏咱們就直接上代碼：

public static class MyType implements WritableComparable<MyType>{
public MyType(){
}
private String word;
public String Getword(){return word;}
public void Setword(String value){ word = value;}
private String filePath;
public String GetfilePath(){return filePath;}
public void SetfilePath(String value){ filePath = value;}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(word);
out.writeUTF(filePath);
}
@Override
public void readFields(DataInput in) throws IOException {
word = in.readUTF();
filePath = in.readUTF();
}
@Override
public int compareTo(MyType arg0) {
if (word != arg0.word)
return word.compareTo(arg0.word);
return filePath.compareTo(arg0.filePath);
}
}

有了這個複合健的定義後，這個Map函數就好寫了：

public static class InvertedIndexMapper extends
Mapper<Object, Text, MyType, Text> {
public void map(Object key, Text value, Context context)
throws InterruptedException, IOException {
FileSplit split = (FileSplit) context.getInputSplit();
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
MyType keyInfo = new MyType();
keyInfo.Setword(itr.nextToken());
keyInfo.SetfilePath(split.getPath().toUri().getPath().replace("/user/zjf/in/", ""));
context.write(keyInfo, new Text("1"));
}
}
}