FROM:http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241?pgno=3html
在今天的信息飽和的世界,地理分佈的數據,須要一種系統的巨大增加,有利於快速檢索有意義的結果的解析。分佈式數據的可搜索的索引去加速的過程很長的路要走。在這篇文章中,我演示瞭如何使用Lucene和Java的基本數據索引和搜索,如何使用RAM目錄索引和搜索,如何建立居住在HDF的數據索引,以及如何搜索這些索引。由開發環境,Eclipse的Java 1.6的Lucene的2.4.0,3.4.2,和Hadoop 0.19.1上運行微軟Windows XP SP3。java
爲了解決這個任務,我把Hadoop的。Apache Hadoop項目的開發可靠,可擴展,分佈式計算開源軟件,Hadoop分佈式文件系統(HDFS)是專爲跨廣域網的存儲和共享文件。HDFS是創建在商品硬件上運行,並提供了容錯,資源管理,以及最重要的是,應用程序數據訪問的高吞吐量。web
第一步是建立一個索引存儲在本地文件系統上的數據。開始經過建立一個Eclipse項目中,建立一個類,而後添加所需的JAR文件添加到項目。以這個例子發如今Web服務器中的日誌文件的應用程序的數據:apache
2010-04-21 02:24:01 GET /blank 200 120
此數據被映射到某些字段:服務器
2010-04-21 02:24:01 GET /blank 200 120 2010-04-21 02:24:01 GET /US/registrationFrame 200 605 2010-04-21 02:24:02 GET /US/kids/boys 200 785 2010-04-21 02:24:02 POST /blank 304 56 2010-04-21 02:24:04 GET /blank 304 233 2010-04-21 02:24:04 GET /blank 500 567 2010-04-21 02:24:04 GET /blank 200 897 2010-04-21 02:24:04 POST /blank 200 567 2010-04-21 02:24:05 GET /US/search 200 658 2010-04-21 02:24:05 POST /US/shop 200 768 2010-04-21 02:24:05 GET /blank 200 347
咱們要創建索引的數據出如今這個「test.txt的」文件,並保存到本地文件系統的索引。下面的Java代碼,這樣作。(注意每一個部分的代碼作什麼的詳細信息)的意見。分佈式
1 // Creating IndexWriter object and specifying the path where Indexed 2 //files are to be stored. 3 IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true); 4 5 // Creating BufferReader object and specifying the path of the file 6 //whose data is required to be indexed. 7 BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt")); 8 9 String row=null; 10 11 // Reading each line present in the file. 12 while ((row=reader.readLine())!= null) 13 { 14 // Getting each field present in a row into an Array and file delimiter is "space separated" 15 String Arow[] = row.split(" "); 16 17 // For each row, creating a document and adding data to the document with the associated fields. 18 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document(); 19 20 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED)); 21 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED)); 22 document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED)); 23 document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED)); 24 document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED)); 25 document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED)); 26 27 // Adding document to the index file. 28 indexWriter.addDocument(document); 29 } 30 indexWriter.optimize(); 31 indexWriter.close(); 32 reader.close();
的Java代碼一旦被執行,將建立和索引文件存放在「E :/ /DataFile/ IndexFiles的位置。」ide
如今,咱們能夠搜索索引文件中的數據,咱們剛剛建立的。基本上,搜索的「場」的數據上完成。您可使用Lucene搜索引擎支持各類搜索語義搜索,你能夠在一個特定的字段或字段組合執行搜索。下面的Java代碼搜索索引:工具
1 // Creating Searcher object and specifying the path where Indexed files are stored. 2 Searcher searcher = new IndexSearcher("E://DataFile/IndexFiles"); 3 Analyzer analyzer = new StandardAnalyzer(); 4 5 // Printing the total number of documents or entries present in the index file. 6 System.out.println("Total Documents = "+searcher.maxDoc()) ; 7 8 // Creating the QueryParser object and specifying the field name on 9 //which search has to be done. 10 QueryParser parser = new QueryParser("cs-uri", analyzer); 11 12 // Creating the Query object and specifying the text for which search has to be done. 13 Query query = parser.parse("/blank"); 14 15 // Below line performs the search on the index file and 16 Hits hits = searcher.search(query); 17 18 // Printing the number of documents or entries that match the search query. 19 System.out.println("Number of matching documents = "+ hits.length()); 20 21 // Printing documents (or rows of file) that matched the search criteria. 22 for (int i = 0; i < hits.length(); i++) 23 { 24 Document doc = hits.doc(i); 25 System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+ 26 doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));
在這個例子中,搜索完成領域cs的uri的cs的uri的字段/空白內搜索的文本。所以,搜索代碼運行時,全部的文件(或行)的CS-URI字段包含/空白,顯示在輸出中。的輸出以下所示:oop
1 Total Documents = 11 2 Number of matching documents = 7 3 2010-04-21 02:24:01 GET /blank 200 120 4 2010-04-21 02:24:02 POST /blank 304 56 5 2010-04-21 02:24:04 GET /blank 304 233 6 2010-04-21 02:24:04 GET /blank 500 567 7 2010-04-21 02:24:04 GET /blank 200 897 8 2010-04-21 02:24:04 POST /blank 200 567 9 2010-04-21 02:24:05 GET /blank 200 347
如今考慮數據的狀況下,位於一個像Hadoop DFS分佈式文件系統。上述代碼將沒法正常工做分佈式數據上直接建立索引,因此咱們不得不完成前幾步的訴訟程序,如從HDFS數據複製到本地文件系統,建立索引的數據出如今本地文件系統,最後將索引文件存儲到HDFS。一樣的步驟將須要搜索。但這種方法耗時且最理想的,相反,讓咱們的索引和搜索咱們的數據使用HDFS節點的內存中的數據是居住。ui
假設數據文件「Test.txt的」早期使用如今居住在HDFS上,裏面一個工做目錄文件夾,名爲「/數據文件/ Test.txt的。」 建立另外一個稱爲「/ IndexFiles」HDFS的工做目錄裏面的文件夾,咱們生成的索引文件將被存儲。下面的Java代碼在內存中的文件存儲在HDFS上建立索引文件:
1 // Path where the index files will be stored. 2 String Index_DIR="/IndexFiles/"; 3 // Path where the data file is stored. 4 String File_DIR="/DataFile/test.txt"; 5 // Creating FileSystem object, to be able to work with HDFS 6 Configuration config = new Configuration(); 7 config.set("fs.default.name","hdfs://127.0.0.1:9000/"); 8 FileSystem dfs = FileSystem.get(config); 9 // Creating a RAMDirectory (memory) object, to be able to create index in memory. 10 RAMDirectory rdir = new RAMDirectory(); 11 12 // Creating IndexWriter object for the Ram Directory 13 IndexWriter indexWriter = new IndexWriter (rdir, new StandardAnalyzer(), true); 14 15 // Creating FSDataInputStream object, for reading the data from "Test.txt" file residing on HDFS. 16 FSDataInputStream filereader = dfs.open(new Path(dfs.getWorkingDirectory()+ File_DIR)); 17 String row=null; 18 19 // Reading each line present in the file. 20 while ((row=reader.readLine())!=null) 21 { 22 23 // Getting each field present in a row into an Array and file //delimiter is "space separated". 24 String Arow[]=row.split(" "); 25 26 // For each row, creating a document and adding data to the document 27 //with the associated fields. 28 org.apache.lucene.document.Document document = new org.apache.lucene.document.Document(); 29 30 document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED)); 31 document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED)); 32 document.add(new Field ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED)); 33 document.add(new Field ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED)); 34 document.add(new Field ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED)); 35 document.add(new Field ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED)); 36 37 // Adding document to the index file. 38 indexWriter.addDocument(document); 39 } 40 indexWriter.optimize(); 41 indexWriter.close(); 42 reader.close();
所以,對於「test.txt的」居住在HDFS上的文件,咱們如今有在內存中建立索引文件。存儲索引文件,在HDFS文件夾:
1 // Getting files present in memory into an array. 2 String fileList[]=rdir.list(); 3 4 // Reading index files from memory and storing them to HDFS. 5 for (int i = 0; I < fileList.length; i++) 6 { 7 IndexInput indxfile = rdir.openInput(fileList[i].trim()); 8 long len = indxfile.length(); 9 int len1 = (int) len; 10 11 // Reading data from file into a byte array. 12 byte[] bytarr = new byte[len1]; 13 indxfile.readBytes(bytarr, 0, len1); 14 15 // Creating file in HDFS directory with name same as that of 16 //index file 17 Path src = new Path(dfs.getWorkingDirectory()+Index_DIR+ fileList[i].trim()); 18 dfs.createNewFile(src); 19 20 // Writing data from byte array to the file in HDFS 21 FSDataOutputStream fs = dfs.create(new Path(dfs.getWorkingDirectory()+Index_DIR+fileList[i].trim()),true); 22 fs.write(bytarr); 23 fs.close();
如今咱們有必要的Test.txt的「數據文件建立並存儲在HDFS目錄的索引文件。
咱們如今能夠搜索存儲在HDFS中的索引。首先,咱們必須使HDFS的索引文件在內存中進行搜索。下面的代碼是用於這一過程:
1 // Creating FileSystem object, to be able to work with HDFS 2 Configuration config = new Configuration(); 3 config.set("fs.default.name","hdfs://127.0.0.1:9000/"); 4 FileSystem dfs = FileSystem.get(config); 5 6 // Creating a RAMDirectory (memory) object, to be able to create index in memory. 7 RAMDirectory rdir = new RAMDirectory(); 8 9 // Getting the list of index files present in the directory into an array. 10 Path pth = new Path(dfs.getWorkingDirectory()+Index_DIR); 11 FileSystemDirectory fsdir = new FileSystemDirectory(dfs,pth,false,config); 12 String filelst[] = fsdir.list(); 13 FSDataInputStream filereader = null; 14 for (int i = 0; i<filelst.length; i++) 15 { 16 // Reading data from index files on HDFS directory into filereader object. 17 filereader = dfs.open(new Path(dfs.getWorkingDirectory()+Index_DIR+filelst[i])); 18 19 int size = filereader.available(); 20 21 // Reading data from file into a byte array. 22 byte[] bytarr = new byte[size]; 23 filereader.read(bytarr, 0, size); 24 25 // Creating file in RAM directory with names same as that of 26 //index files present in HDFS directory. 27 IndexOutput indxout = rdir.createOutput(filelst[i]); 28 29 // Writing data from byte array to the file in RAM directory 30 indxout.writeBytes(bytarr,bytarr.length); 31 indxout.flush(); 32 indxout.close(); 33 } 34 filereader.close();
如今咱們有了全部所需的索引文件在RAM中的目錄(或存儲器),因此咱們能夠直接執行搜索索引文件。搜索代碼將被用於搜索本地文件系統相似,惟一的變化是,如今將使用RAM的目錄對象(RDIR),而不是使用本地文件系統目錄路徑建立的搜索對象。
1 Searcher searcher = new IndexSearcher(rdir); 2 Analyzer analyzer = new StandardAnalyzer(); 3 4 System.out.println("Total Documents = "+searcher.maxDoc()) ; 5 6 QueryParser parser = new QueryParser("time", analyzer); 7 8 Query query = parser.parse("02\\:24\\:04"); 9 10 Hits hits = searcher.search(query); 11 12 System.out.println("Number of matching documents = "+ hits.length()); 13 14 for (int i = 0; i < hits.length(); i++) 15 { 16 Document doc = hits.doc(i); 17 System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+ 18 doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));
如下輸出,搜索是場上的「時間」和「時間」字段內的文本搜索「02 \ \ 24 \ \ 04。」 所以,運行代碼時,全部的文件(或行)的「時間」字段中包含「02:\ \ 24 \ \ 04」,在輸出中顯示:
1 Total Documents = 11 2 Number of matching documents = 4 3 2010-04-21 02:24:04 GET /blank 304 233 4 2010-04-21 02:24:04 GET /blank 500 567 5 2010-04-21 02:24:04 GET /blank 200 897 6 2010-04-21 02:24:04 POST /blank 200 567
像HDFS分佈式文件系統是一個強大的工具,用於存儲和訪問大量的數據提供給咱們的今天。隨着內存的索引和搜索,訪問數據,你真的想找到你不關心數據的羣山之中獲得稍微容易一些。