hadoop之 hdfs FilePattern

舉一個例子:使用mapreduce統計一個月或者兩個的日誌文件,這裏可能有大量的日誌文件。如何快速的提取文件路徑?
在HDFS中,可使用通配符來解決這個問題。與linux shell的通配符相同。java

例如:linux

Tables Are
2016/* 2016/05 2016/04
2016/0[45] 2016/05 2016/04
2016/0[4-5] 2016/05 2016/04

代碼:正則表達式

public static void globFiles(String pattern){

        try {
            FileSystem fileSystem = FileSystem.get(configuration);

            FileStatus[] statuses = fileSystem.globStatus(new Path(pattern));
            Path[] listPaths = FileUtil.stat2Paths(statuses);
            for (Path path : listPaths){
                System.out.println(path);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

hdfs 還提供了一個PathFilter 對咱們獲取的文件路徑進行過濾,與java.io.FileFilter相似shell

/**
   * Return an array of FileStatus objects whose path names match pathPattern
   * and is accepted by the user-supplied path filter. Results are sorted by
   * their path names.
   * Return null if pathPattern has no glob and the path does not exist.
   * Return an empty array if pathPattern has a glob and no path matches it. 
   * 
   * @param pathPattern
   *          a regular expression specifying the path pattern
   * @param filter
   *          a user-supplied path filter
   * @return an array of FileStatus objects
   * @throws IOException if any I/O error occurs when fetching file status
   */
  public FileStatus[] globStatus(Path pathPattern, PathFilter filter)
      throws IOException {
    return new Globber(this, pathPattern, filter).glob();
  }

hdfs自身提供了許多filter,在hadoop權威指南中,提供一種 正則表達式filter的實現express

public class RegexExcludePathFilter implements PathFilter {

    private  String regex;

    public RegexExcludePathFilter(String regex) {
        this.regex = regex;
    }

    @Override
    public boolean accept(Path path) {
        return !path.toString().matches(regex);
    }
}

利用正則表達式優化結果ide

fileSystem.listStatus(new Path(uri),new RegexExcludePathFilter("^.*/2016/0$"));

結果輸出以下:oop

hdfs://hadoop:9000/hadoop/2016/04
hdfs://hadoop:9000/hadoop/2016/05

過濾器由Path表示,只能做用於文件名以及路徑。fetch

相關文章
相關標籤/搜索