舉一個例子:使用mapreduce統計一個月或者兩個的日誌文件,這裏可能有大量的日誌文件。如何快速的提取文件路徑?
在HDFS中,可使用通配符來解決這個問題。與linux shell的通配符相同。java
例如:linux
Tables | Are |
---|---|
2016/* | 2016/05 2016/04 |
2016/0[45] | 2016/05 2016/04 |
2016/0[4-5] | 2016/05 2016/04 |
代碼:正則表達式
public static void globFiles(String pattern){ try { FileSystem fileSystem = FileSystem.get(configuration); FileStatus[] statuses = fileSystem.globStatus(new Path(pattern)); Path[] listPaths = FileUtil.stat2Paths(statuses); for (Path path : listPaths){ System.out.println(path); } } catch (IOException e) { e.printStackTrace(); } }
hdfs 還提供了一個PathFilter 對咱們獲取的文件路徑進行過濾,與java.io.FileFilter相似shell
/** * Return an array of FileStatus objects whose path names match pathPattern * and is accepted by the user-supplied path filter. Results are sorted by * their path names. * Return null if pathPattern has no glob and the path does not exist. * Return an empty array if pathPattern has a glob and no path matches it. * * @param pathPattern * a regular expression specifying the path pattern * @param filter * a user-supplied path filter * @return an array of FileStatus objects * @throws IOException if any I/O error occurs when fetching file status */ public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException { return new Globber(this, pathPattern, filter).glob(); }
hdfs自身提供了許多filter,在hadoop權威指南中,提供一種 正則表達式filter的實現express
public class RegexExcludePathFilter implements PathFilter { private String regex; public RegexExcludePathFilter(String regex) { this.regex = regex; } @Override public boolean accept(Path path) { return !path.toString().matches(regex); } }
利用正則表達式優化結果ide
fileSystem.listStatus(new Path(uri),new RegexExcludePathFilter("^.*/2016/0$"));
結果輸出以下:oop
hdfs://hadoop:9000/hadoop/2016/04 hdfs://hadoop:9000/hadoop/2016/05
過濾器由Path表示,只能做用於文件名以及路徑。fetch