在Hadoop分佈式文件系統的索引和搜索

時間 2019-11-13

原文原文鏈接

FROM:http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3html

在今天的信息飽和的世界，地理分佈的數據，須要一種系統的巨大增加，有利於快速檢索有意義的結果的解析。分佈式數據的可搜索的索引去加速的過程很長的路要走。在這篇文章中，我演示瞭如何使用Lucene和Java的基本數據索引和搜索，如何使用RAM目錄索引和搜索，如何建立居住在HDF的數據索引，以及如何搜索這些索引。由開發環境，Eclipse的Java 1.6的Lucene的2.4.0，3.4.2，和Hadoop 0.19.1上運行微軟Windows XP SP3。java

爲了解決這個任務，我把Hadoop的。Apache Hadoop項目的開發可靠，可擴展，分佈式計算開源軟件，Hadoop分佈式文件系統（HDFS）是專爲跨廣域網的存儲和共享文件。HDFS是創建在商品硬件上運行，並提供了容錯，資源管理，以及最重要的是，應用程序數據訪問的高吞吐量。web

在本地文件系統上建立索引

第一步是建立一個索引存儲在本地文件系統上的數據。開始經過建立一個Eclipse項目中，建立一個類，而後添加所需的JAR文件添加到項目。以這個例子發如今Web服務器中的日誌文件的應用程序的數據：apache

2010-04-21 02:24:01 GET /blank 200 120

此數據被映射到某些字段：服務器

2010-04-21 - 日期字段
2時24分01秒 - 時間字段
GET - 法域（GET或POST） - 咱們將記爲「CS-方法」
/空白 - 請求的URL字段 - 咱們將表示爲「CS-URI」
200 - 狀態代碼的請求 - 咱們會記爲「SC-狀態」

120 - 時間採起現場（完成請求所需的時間）

目前在咱們的樣本文件的數據位於一個"E:\DataFile"名爲「test.txt的」以下：

2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:01 GET /US/registrationFrame 200 605
2010-04-21 02:24:02 GET /US/kids/boys 200 785
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /US/search 200 658
2010-04-21 02:24:05 POST /US/shop 200 768
2010-04-21 02:24:05 GET /blank 200 347

咱們要創建索引的數據出如今這個「test.txt的」文件，並保存到本地文件系統的索引。下面的Java代碼，這樣作。（注意每一個部分的代碼作什麼的詳細信息）的意見。分佈式

 1 // Creating IndexWriter object and specifying the path where Indexed
 2 //files are to be stored.
 3 IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true);
 4              
 5 // Creating BufferReader object and specifying the path of the file
 6 //whose data is required to be indexed.
 7 BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt"));
 8              
 9 String row=null;
10          
11 // Reading each line present in the file.
12 while ((row=reader.readLine())!= null)
13 {
14 // Getting each field present in a row into an Array and file delimiter is "space separated"
15 String Arow[] = row.split(" ");
16                  
17 // For each row, creating a document and adding data to the document with the associated fields.
18 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();
19                  
20 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
21 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
22 document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
23 document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
24 document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
25 document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
26                  
27 // Adding document to the index file.
28 indexWriter.addDocument(document);
29 }        
30 indexWriter.optimize();
31 indexWriter.close();
32 reader.close();

的Java代碼一旦被執行，將建立和索引文件存放在「E :/ /DataFile/ IndexFiles的位置。」ide

如今，咱們能夠搜索索引文件中的數據，咱們剛剛建立的。基本上，搜索的「場」的數據上完成。您可使用Lucene搜索引擎支持各類搜索語義搜索，你能夠在一個特定的字段或字段組合執行搜索。下面的Java代碼搜索索引：工具

 1 // Creating Searcher object and specifying the path where Indexed files are stored.
 2 Searcher searcher = new IndexSearcher("E://DataFile/IndexFiles");
 3 Analyzer analyzer = new StandardAnalyzer();
 4  
 5 // Printing the total number of documents or entries present in the index file.
 6 System.out.println("Total Documents = "+searcher.maxDoc()) ;
 7              
 8 // Creating the QueryParser object and specifying the field name on
 9 //which search has to be done.
10 QueryParser parser = new QueryParser("cs-uri", analyzer);
11              
12 // Creating the Query object and specifying the text for which search has to be done.
13 Query query = parser.parse("/blank");
14              
15 // Below line performs the search on the index file and
16 Hits hits = searcher.search(query);
17              
18 // Printing the number of documents or entries that match the search query.
19 System.out.println("Number of matching documents = "+ hits.length());
20  
21 // Printing documents (or rows of file) that matched the search criteria.
22 for (int i = 0; i < hits.length(); i++)
23 {
24     Document doc = hits.doc(i);
25     System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
26     doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));

在這個例子中，搜索完成領域cs的uri的cs的uri的字段/空白內搜索的文本。所以，搜索代碼運行時，全部的文件（或行）的CS-URI字段包含/空白，顯示在輸出中。的輸出以下所示：oop

1 Total Documents = 11
2 Number of matching documents = 7
3 2010-04-21 02:24:01 GET /blank 200 120
4 2010-04-21 02:24:02 POST /blank 304 56
5 2010-04-21 02:24:04 GET /blank 304 233
6 2010-04-21 02:24:04 GET /blank 500 567
7 2010-04-21 02:24:04 GET /blank 200 897
8 2010-04-21 02:24:04 POST /blank 200 567
9 2010-04-21 02:24:05 GET /blank 200 347

HDFS上的基於內存的索引

如今考慮數據的狀況下，位於一個像Hadoop DFS分佈式文件系統。上述代碼將沒法正常工做分佈式數據上直接建立索引，因此咱們不得不完成前幾步的訴訟程序，如從HDFS數據複製到本地文件系統，建立索引的數據出如今本地文件系統，最後將索引文件存儲到HDFS。一樣的步驟將須要搜索。但這種方法耗時且最理想的，相反，讓咱們的索引和搜索咱們的數據使用HDFS節點的內存中的數據是居住。ui

假設數據文件「Test.txt的」早期使用如今居住在HDFS上，裏面一個工做目錄文件夾，名爲「/數據文件/ Test.txt的。」建立另外一個稱爲「/ IndexFiles」HDFS的工做目錄裏面的文件夾，咱們生成的索引文件將被存儲。下面的Java代碼在內存中的文件存儲在HDFS上建立索引文件：

 1 // Path where the index files will be stored.
 2 String Index_DIR="/IndexFiles/";
 3 // Path where the data file is stored.
 4 String File_DIR="/DataFile/test.txt";
 5 // Creating FileSystem object, to be able to work with HDFS
 6 Configuration config = new Configuration();
 7 config.set("fs.default.name","hdfs://127.0.0.1:9000/");
 8 FileSystem dfs = FileSystem.get(config);
 9 // Creating a RAMDirectory (memory) object, to be able to create index in memory.
10 RAMDirectory rdir = new RAMDirectory();
11 
12 // Creating IndexWriter object for the Ram Directory
13 IndexWriter indexWriter = new IndexWriter (rdir, new StandardAnalyzer(), true);
14             
15 // Creating FSDataInputStream object, for reading the data from "Test.txt" file residing on HDFS.
16 FSDataInputStream filereader = dfs.open(new Path(dfs.getWorkingDirectory()+ File_DIR));
17 String row=null;
18         
19 // Reading each line present in the file.
20 while ((row=reader.readLine())!=null)
21 {
22 
23 // Getting each field present in a row into an Array and file //delimiter is "space separated".
24 String Arow[]=row.split(" ");
25                 
26 // For each row, creating a document and adding data to the document 
27 //with the associated fields.
28 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();
29                 
30 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
31 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
32 document.add(new Field ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
33 document.add(new Field ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
34 document.add(new Field ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
35 document.add(new Field ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
36                 
37 // Adding document to the index file.
38 indexWriter.addDocument(document);
39 }          
40 indexWriter.optimize();
41 indexWriter.close();
42 reader.close();

所以，對於「test.txt的」居住在HDFS上的文件，咱們如今有在內存中建立索引文件。存儲索引文件，在HDFS文件夾：

 1 // Getting files present in memory into an array.
 2 String fileList[]=rdir.list();
 3  
 4 // Reading index files from memory and storing them to HDFS.
 5 for (int i = 0; I < fileList.length; i++)
 6 {
 7     IndexInput indxfile = rdir.openInput(fileList[i].trim());
 8     long len = indxfile.length();
 9     int len1 = (int) len;
10  
11     // Reading data from file into a byte array.
12     byte[] bytarr = new byte[len1];
13     indxfile.readBytes(bytarr, 0, len1);
14              
15 // Creating file in HDFS directory with name same as that of    
16 //index file   
17 Path src = new Path(dfs.getWorkingDirectory()+Index_DIR+ fileList[i].trim());
18     dfs.createNewFile(src);
19  
20     // Writing data from byte array to the file in HDFS
21 FSDataOutputStream fs = dfs.create(new    Path(dfs.getWorkingDirectory()+Index_DIR+fileList[i].trim()),true);
22     fs.write(bytarr);
23     fs.close();

如今咱們有必要的Test.txt的「數據文件建立並存儲在HDFS目錄的索引文件。

基於內存搜索HDFS上

咱們如今能夠搜索存儲在HDFS中的索引。首先，咱們必須使HDFS的索引文件在內存中進行搜索。下面的代碼是用於這一過程：

 1 // Creating FileSystem object, to be able to work with HDFS
 2 Configuration config = new Configuration();
 3 config.set("fs.default.name","hdfs://127.0.0.1:9000/");
 4 FileSystem dfs = FileSystem.get(config);
 5  
 6 // Creating a RAMDirectory (memory) object, to be able to create index in memory.
 7 RAMDirectory rdir = new RAMDirectory();
 8              
 9 // Getting the list of index files present in the directory into an array.
10 Path pth = new Path(dfs.getWorkingDirectory()+Index_DIR);
11 FileSystemDirectory fsdir = new FileSystemDirectory(dfs,pth,false,config);
12 String filelst[] = fsdir.list();
13 FSDataInputStream filereader = null;
14 for (int i = 0; i<filelst.length; i++)
15 {
16 // Reading data from index files on HDFS directory into filereader object.
17 filereader = dfs.open(new Path(dfs.getWorkingDirectory()+Index_DIR+filelst[i]));
18                  
19     int size = filereader.available();
20  
21     // Reading data from file into a byte array.           
22     byte[] bytarr = new byte[size];
23     filereader.read(bytarr, 0, size);
24      
25 // Creating file in RAM directory with names same as that of
26 //index files present in HDFS directory.
27     IndexOutput indxout = rdir.createOutput(filelst[i]);
28  
29     // Writing data from byte array to the file in RAM directory
30     indxout.writeBytes(bytarr,bytarr.length);
31     indxout.flush();       
32     indxout.close();               
33 }
34 filereader.close();

如今咱們有了全部所需的索引文件在RAM中的目錄（或存儲器），因此咱們能夠直接執行搜索索引文件。搜索代碼將被用於搜索本地文件系統相似，惟一的變化是，如今將使用RAM的目錄對象（RDIR），而不是使用本地文件系統目錄路徑建立的搜索對象。

 1 Searcher searcher = new IndexSearcher(rdir);
 2 Analyzer analyzer = new StandardAnalyzer();
 3  
 4 System.out.println("Total Documents = "+searcher.maxDoc()) ;
 5              
 6 QueryParser parser = new QueryParser("time", analyzer);
 7              
 8 Query query = parser.parse("02\\:24\\:04");
 9              
10 Hits hits = searcher.search(query);
11              
12 System.out.println("Number of matching documents = "+ hits.length());
13  
14 for (int i = 0; i < hits.length(); i++)
15 {
16 Document doc = hits.doc(i);
17 System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
18 doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));

如下輸出，搜索是場上的「時間」和「時間」字段內的文本搜索「02 \ \ 24 \ \ 04。」所以，運行代碼時，全部的文件（或行）的「時間」字段中包含「02：\ \ 24 \ \ 04」，在輸出中顯示：

1 Total Documents = 11
2 Number of matching documents = 4
3 2010-04-21 02:24:04 GET /blank 304 233
4 2010-04-21 02:24:04 GET /blank 500 567
5 2010-04-21 02:24:04 GET /blank 200 897
6 2010-04-21 02:24:04 POST /blank 200 567