轉載自:
http://blog.fens.me/hadoop-mapreduce-log-kpi/html
今天學習了這一篇博客,寫得十分好,照着這篇博客敲了一遍。java
發現幾個問題,node
一是這篇博客中採用的hadoop版本太低,若是在hadoop2.x上面跑的話,可能會出現結果文件沒有寫入任何數據,爲了解決這個問題,我試着去參照官網http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html的API進行操做,發現官網裏講得十分詳細,只要有一點英文基礎的同行均可以看得懂,直白簡單。hadoop2.x相比較hadoop1.x而言編寫Mapper類,能夠直接繼承import org.apache.hadoop.mapreduce.Mapper;無需再實現Mapper接口了,其中關於map方法的寫法也變了改爲以下:python
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // TODO Auto-generated method stub KPI kpi = KPI.filterPVS(value.toString()); System.out.println(kpi); if (kpi.isValid()) { word.set(kpi.getIp()); context.write(word, one); } }
hadoop1.x的寫法以下:nginx
@Override public void map(Object key, Text value, OutputCollector output, Reporter reporter) throws IOException { KPI kpi = KPI.filterPVs(value.toString()); if (kpi.isValid()) { word.set(kpi.getRequest()); output.collect(word, one); } }
hadoop2.x的寫法就必須改變了,相應的Reducer中的reduce方法隨之改變。一開始沒有發現文中的github網址去百度了一下費了很大勁找到了一個150多M的文件,須要自取:git
連接: https://pan.baidu.com/s/1hz5dTX69Hc_l9Aj-axvfqw 提取碼: ssys 複製這段內容後打開百度網盤手機App,操做更方便哦,固然這個日誌文件內容與博客的不一致,少了兩個屬性,請自行對照代碼修改。angularjs
2、在hadoop2.x上面運行,在main方法裏配置運行參數我此次使用的hadoop2.9.2這個版本的,須要用到winuitil.exe和hadoop.dll這兩個工具。已經上傳到百度網盤上面,地址以下,連接: https://pan.baidu.com/s/1RTSeGjV2VwWxRAvsUMkkrA 提取碼: dkxt ,有三個文件分別是hadoop.2.9.2,eclipse插件,以及winutil,須要把hadoo2.6x裏面的文件所有複製到hadoop.2.9.2/bin文件夾下,其中hadoop2.6.x中的haoop.dll須要複製到c:/Windows/System32目錄下。關閉全部應用重啓計算機,在main方法中設置以下系統屬性:github
System.setProperty("HADOOP_HOME", "E:\\hadoop\\hadoop2.6"); System.setProperty("hadoop.home.dir", "E:\\hadoop\\hadoop-2.9.2"); System.setProperty("HADOOP_USER_NAME", "hadoop");
設置好之後運行會報錯:Acess$0之類的錯誤:遇到這種狀況,在項目src下新建NativeIO.java文件,修改以下:web
/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.hadoop.io.nativeio; import java.io.File; import java.io.FileDescriptor; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.RandomAccessFile; import java.lang.reflect.Field; import java.nio.ByteBuffer; import java.nio.MappedByteBuffer; import java.nio.channels.FileChannel; import java.util.Map; import java.util.concurrent.ConcurrentHashMap; import org.apache.hadoop.classification.InterfaceAudience; import org.apache.hadoop.classification.InterfaceStability; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.CommonConfigurationKeys; import org.apache.hadoop.fs.HardLink; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.SecureIOUtils.AlreadyExistsException; import org.apache.hadoop.util.NativeCodeLoader; import org.apache.hadoop.util.Shell; import org.apache.hadoop.util.PerformanceAdvisory; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import sun.misc.Unsafe; import com.google.common.annotations.VisibleForTesting; /** * JNI wrappers for various native IO-related calls not available in Java. * These functions should generally be used alongside a fallback to another * more portable mechanism. */ @InterfaceAudience.Private @InterfaceStability.Unstable public class NativeIO { public static class POSIX { // Flags for open() call from bits/fcntl.h - Set by JNI public static int O_RDONLY = -1; public static int O_WRONLY = -1; public static int O_RDWR = -1; public static int O_CREAT = -1; public static int O_EXCL = -1; public static int O_NOCTTY = -1; public static int O_TRUNC = -1; public static int O_APPEND = -1; public static int O_NONBLOCK = -1; public static int O_SYNC = -1; // Flags for posix_fadvise() from bits/fcntl.h - Set by JNI /* No further special treatment. */ public static int POSIX_FADV_NORMAL = -1; /* Expect random page references. */ public static int POSIX_FADV_RANDOM = -1; /* Expect sequential page references. */ public static int POSIX_FADV_SEQUENTIAL = -1; /* Will need these pages. */ public static int POSIX_FADV_WILLNEED = -1; /* Don't need these pages. */ public static int POSIX_FADV_DONTNEED = -1; /* Data will be accessed once. */ public static int POSIX_FADV_NOREUSE = -1; // Updated by JNI when supported by glibc. Leave defaults in case kernel // supports sync_file_range, but glibc does not. /* Wait upon writeout of all pages in the range before performing the write. */ public static int SYNC_FILE_RANGE_WAIT_BEFORE = 1; /* Initiate writeout of all those dirty pages in the range which are not presently under writeback. */ public static int SYNC_FILE_RANGE_WRITE = 2; /* Wait upon writeout of all pages in the range after performing the write. */ public static int SYNC_FILE_RANGE_WAIT_AFTER = 4; private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class); // Set to true via JNI if possible public static boolean fadvisePossible = false; private static boolean nativeLoaded = false; private static boolean syncFileRangePossible = true; static final String WORKAROUND_NON_THREADSAFE_CALLS_KEY = "hadoop.workaround.non.threadsafe.getpwuid"; static final boolean WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT = true; private static long cacheTimeout = -1; private static CacheManipulator cacheManipulator = new CacheManipulator(); public static CacheManipulator getCacheManipulator() { return cacheManipulator; } public static void setCacheManipulator(CacheManipulator cacheManipulator) { POSIX.cacheManipulator = cacheManipulator; } /** * Used to manipulate the operating system cache. */ @VisibleForTesting public static class CacheManipulator { public void mlock(String identifier, ByteBuffer buffer, long len) throws IOException { POSIX.mlock(buffer, len); } public long getMemlockLimit() { return NativeIO.getMemlockLimit(); } public long getOperatingSystemPageSize() { return NativeIO.getOperatingSystemPageSize(); } public void posixFadviseIfPossible(String identifier, FileDescriptor fd, long offset, long len, int flags) throws NativeIOException { NativeIO.POSIX.posixFadviseIfPossible(identifier, fd, offset, len, flags); } public boolean verifyCanMlock() { return NativeIO.isAvailable(); } } /** * A CacheManipulator used for testing which does not actually call mlock. * This allows many tests to be run even when the operating system does not * allow mlock, or only allows limited mlocking. */ @VisibleForTesting public static class NoMlockCacheManipulator extends CacheManipulator { public void mlock(String identifier, ByteBuffer buffer, long len) throws IOException { LOG.info("mlocking " + identifier); } public long getMemlockLimit() { return 1125899906842624L; } public long getOperatingSystemPageSize() { return 4096; } public boolean verifyCanMlock() { return true; } } static { if (NativeCodeLoader.isNativeCodeLoaded()) { try { Configuration conf = new Configuration(); workaroundNonThreadSafePasswdCalls = conf.getBoolean( WORKAROUND_NON_THREADSAFE_CALLS_KEY, WORKAROUND_NON_THREADSAFE_CALLS_DEFAULT); initNative(); nativeLoaded = true; cacheTimeout = conf.getLong( CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_KEY, CommonConfigurationKeys.HADOOP_SECURITY_UID_NAME_CACHE_TIMEOUT_DEFAULT) * 1000; LOG.debug("Initialized cache for IDs to User/Group mapping with a " + " cache timeout of " + cacheTimeout/1000 + " seconds."); } catch (Throwable t) { // This can happen if the user has an older version of libhadoop.so // installed - in this case we can continue without native IO // after warning PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t); } } } /** * Return true if the JNI-based native IO extensions are available. */ public static boolean isAvailable() { return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded; } private static void assertCodeLoaded() throws IOException { if (!isAvailable()) { throw new IOException("NativeIO was not loaded"); } } /** Wrapper around open(2) */ public static native FileDescriptor open(String path, int flags, int mode) throws IOException; /** Wrapper around fstat(2) */ private static native Stat fstat(FileDescriptor fd) throws IOException; /** Native chmod implementation. On UNIX, it is a wrapper around chmod(2) */ private static native void chmodImpl(String path, int mode) throws IOException; public static void chmod(String path, int mode) throws IOException { if (!Shell.WINDOWS) { chmodImpl(path, mode); } else { try { chmodImpl(path, mode); } catch (NativeIOException nioe) { if (nioe.getErrorCode() == 3) { throw new NativeIOException("No such file or directory", Errno.ENOENT); } else { LOG.warn(String.format("NativeIO.chmod error (%d): %s", nioe.getErrorCode(), nioe.getMessage())); throw new NativeIOException("Unknown error", Errno.UNKNOWN); } } } } /** Wrapper around posix_fadvise(2) */ static native void posix_fadvise( FileDescriptor fd, long offset, long len, int flags) throws NativeIOException; /** Wrapper around sync_file_range(2) */ static native void sync_file_range( FileDescriptor fd, long offset, long nbytes, int flags) throws NativeIOException; /** * Call posix_fadvise on the given file descriptor. See the manpage * for this syscall for more information. On systems where this * call is not available, does nothing. * * @throws NativeIOException if there is an error with the syscall */ static void posixFadviseIfPossible(String identifier, FileDescriptor fd, long offset, long len, int flags) throws NativeIOException { if (nativeLoaded && fadvisePossible) { try { posix_fadvise(fd, offset, len, flags); } catch (UnsatisfiedLinkError ule) { fadvisePossible = false; } } } /** * Call sync_file_range on the given file descriptor. See the manpage * for this syscall for more information. On systems where this * call is not available, does nothing. * * @throws NativeIOException if there is an error with the syscall */ public static void syncFileRangeIfPossible( FileDescriptor fd, long offset, long nbytes, int flags) throws NativeIOException { if (nativeLoaded && syncFileRangePossible) { try { sync_file_range(fd, offset, nbytes, flags); } catch (UnsupportedOperationException uoe) { syncFileRangePossible = false; } catch (UnsatisfiedLinkError ule) { syncFileRangePossible = false; } } } static native void mlock_native( ByteBuffer buffer, long len) throws NativeIOException; /** * Locks the provided direct ByteBuffer into memory, preventing it from * swapping out. After a buffer is locked, future accesses will not incur * a page fault. * * See the mlock(2) man page for more information. * * @throws NativeIOException */ static void mlock(ByteBuffer buffer, long len) throws IOException { assertCodeLoaded(); if (!buffer.isDirect()) { throw new IOException("Cannot mlock a non-direct ByteBuffer"); } mlock_native(buffer, len); } /** * Unmaps the block from memory. See munmap(2). * * There isn't any portable way to unmap a memory region in Java. * So we use the sun.nio method here. * Note that unmapping a memory region could cause crashes if code * continues to reference the unmapped code. However, if we don't * manually unmap the memory, we are dependent on the finalizer to * do it, and we have no idea when the finalizer will run. * * @param buffer The buffer to unmap. */ public static void munmap(MappedByteBuffer buffer) { if (buffer instanceof sun.nio.ch.DirectBuffer) { sun.misc.Cleaner cleaner = ((sun.nio.ch.DirectBuffer)buffer).cleaner(); cleaner.clean(); } } /** Linux only methods used for getOwner() implementation */ private static native long getUIDforFDOwnerforOwner(FileDescriptor fd) throws IOException; private static native String getUserName(long uid) throws IOException; /** * Result type of the fstat call */ public static class Stat { private int ownerId, groupId; private String owner, group; private int mode; // Mode constants - Set by JNI public static int S_IFMT = -1; /* type of file */ public static int S_IFIFO = -1; /* named pipe (fifo) */ public static int S_IFCHR = -1; /* character special */ public static int S_IFDIR = -1; /* directory */ public static int S_IFBLK = -1; /* block special */ public static int S_IFREG = -1; /* regular */ public static int S_IFLNK = -1; /* symbolic link */ public static int S_IFSOCK = -1; /* socket */ public static int S_ISUID = -1; /* set user id on execution */ public static int S_ISGID = -1; /* set group id on execution */ public static int S_ISVTX = -1; /* save swapped text even after use */ public static int S_IRUSR = -1; /* read permission, owner */ public static int S_IWUSR = -1; /* write permission, owner */ public static int S_IXUSR = -1; /* execute/search permission, owner */ Stat(int ownerId, int groupId, int mode) { this.ownerId = ownerId; this.groupId = groupId; this.mode = mode; } Stat(String owner, String group, int mode) { if (!Shell.WINDOWS) { this.owner = owner; } else { this.owner = stripDomain(owner); } if (!Shell.WINDOWS) { this.group = group; } else { this.group = stripDomain(group); } this.mode = mode; } @Override public String toString() { return "Stat(owner='" + owner + "', group='" + group + "'" + ", mode=" + mode + ")"; } public String getOwner() { return owner; } public String getGroup() { return group; } public int getMode() { return mode; } } /** * Returns the file stat for a file descriptor. * * @param fd file descriptor. * @return the file descriptor file stat. * @throws IOException thrown if there was an IO error while obtaining the file stat. */ public static Stat getFstat(FileDescriptor fd) throws IOException { Stat stat = null; if (!Shell.WINDOWS) { stat = fstat(fd); stat.owner = getName(IdCache.USER, stat.ownerId); stat.group = getName(IdCache.GROUP, stat.groupId); } else { try { stat = fstat(fd); } catch (NativeIOException nioe) { if (nioe.getErrorCode() == 6) { throw new NativeIOException("The handle is invalid.", Errno.EBADF); } else { LOG.warn(String.format("NativeIO.getFstat error (%d): %s", nioe.getErrorCode(), nioe.getMessage())); throw new NativeIOException("Unknown error", Errno.UNKNOWN); } } } return stat; } private static String getName(IdCache domain, int id) throws IOException { Map<Integer, CachedName> idNameCache = (domain == IdCache.USER) ? USER_ID_NAME_CACHE : GROUP_ID_NAME_CACHE; String name; CachedName cachedName = idNameCache.get(id); long now = System.currentTimeMillis(); if (cachedName != null && (cachedName.timestamp + cacheTimeout) > now) { name = cachedName.name; } else { name = (domain == IdCache.USER) ? getUserName(id) : getGroupName(id); if (LOG.isDebugEnabled()) { String type = (domain == IdCache.USER) ? "UserName" : "GroupName"; LOG.debug("Got " + type + " " + name + " for ID " + id + " from the native implementation"); } cachedName = new CachedName(name, now); idNameCache.put(id, cachedName); } return name; } static native String getUserName(int uid) throws IOException; static native String getGroupName(int uid) throws IOException; private static class CachedName { final long timestamp; final String name; public CachedName(String name, long timestamp) { this.name = name; this.timestamp = timestamp; } } private static final Map<Integer, CachedName> USER_ID_NAME_CACHE = new ConcurrentHashMap<Integer, CachedName>(); private static final Map<Integer, CachedName> GROUP_ID_NAME_CACHE = new ConcurrentHashMap<Integer, CachedName>(); private enum IdCache { USER, GROUP } public final static int MMAP_PROT_READ = 0x1; public final static int MMAP_PROT_WRITE = 0x2; public final static int MMAP_PROT_EXEC = 0x4; public static native long mmap(FileDescriptor fd, int prot, boolean shared, long length) throws IOException; public static native void munmap(long addr, long length) throws IOException; } private static boolean workaroundNonThreadSafePasswdCalls = false; public static class Windows { // Flags for CreateFile() call on Windows public static final long GENERIC_READ = 0x80000000L; public static final long GENERIC_WRITE = 0x40000000L; public static final long FILE_SHARE_READ = 0x00000001L; public static final long FILE_SHARE_WRITE = 0x00000002L; public static final long FILE_SHARE_DELETE = 0x00000004L; public static final long CREATE_NEW = 1; public static final long CREATE_ALWAYS = 2; public static final long OPEN_EXISTING = 3; public static final long OPEN_ALWAYS = 4; public static final long TRUNCATE_EXISTING = 5; public static final long FILE_BEGIN = 0; public static final long FILE_CURRENT = 1; public static final long FILE_END = 2; public static final long FILE_ATTRIBUTE_NORMAL = 0x00000080L; /** * Create a directory with permissions set to the specified mode. By setting * permissions at creation time, we avoid issues related to the user lacking * WRITE_DAC rights on subsequent chmod calls. One example where this can * occur is writing to an SMB share where the user does not have Full Control * rights, and therefore WRITE_DAC is denied. * * @param path directory to create * @param mode permissions of new directory * @throws IOException if there is an I/O error */ public static void createDirectoryWithMode(File path, int mode) throws IOException { createDirectoryWithMode0(path.getAbsolutePath(), mode); } /** Wrapper around CreateDirectory() on Windows */ private static native void createDirectoryWithMode0(String path, int mode) throws NativeIOException; /** Wrapper around CreateFile() on Windows */ public static native FileDescriptor createFile(String path, long desiredAccess, long shareMode, long creationDisposition) throws IOException; /** * Create a file for write with permissions set to the specified mode. By * setting permissions at creation time, we avoid issues related to the user * lacking WRITE_DAC rights on subsequent chmod calls. One example where * this can occur is writing to an SMB share where the user does not have * Full Control rights, and therefore WRITE_DAC is denied. * * This method mimics the semantics implemented by the JDK in * {@link java.io.FileOutputStream}. The file is opened for truncate or * append, the sharing mode allows other readers and writers, and paths * longer than MAX_PATH are supported. (See io_util_md.c in the JDK.) * * @param path file to create * @param append if true, then open file for append * @param mode permissions of new directory * @return FileOutputStream of opened file * @throws IOException if there is an I/O error */ public static FileOutputStream createFileOutputStreamWithMode(File path, boolean append, int mode) throws IOException { long desiredAccess = GENERIC_WRITE; long shareMode = FILE_SHARE_READ | FILE_SHARE_WRITE; long creationDisposition = append ? OPEN_ALWAYS : CREATE_ALWAYS; return new FileOutputStream(createFileWithMode0(path.getAbsolutePath(), desiredAccess, shareMode, creationDisposition, mode)); } /** Wrapper around CreateFile() with security descriptor on Windows */ private static native FileDescriptor createFileWithMode0(String path, long desiredAccess, long shareMode, long creationDisposition, int mode) throws NativeIOException; /** Wrapper around SetFilePointer() on Windows */ public static native long setFilePointer(FileDescriptor fd, long distanceToMove, long moveMethod) throws IOException; /** Windows only methods used for getOwner() implementation */ private static native String getOwner(FileDescriptor fd) throws IOException; /** Supported list of Windows access right flags */ public static enum AccessRight { ACCESS_READ (0x0001), // FILE_READ_DATA ACCESS_WRITE (0x0002), // FILE_WRITE_DATA ACCESS_EXECUTE (0x0020); // FILE_EXECUTE private final int accessRight; AccessRight(int access) { accessRight = access; } public int accessRight() { return accessRight; } }; /** Windows only method used to check if the current process has requested * access rights on the given path. */ private static native boolean access0(String path, int requestedAccess); /** * Checks whether the current process has desired access rights on * the given path. * * Longer term this native function can be substituted with JDK7 * function Files#isReadable, isWritable, isExecutable. * * @param path input path * @param desiredAccess ACCESS_READ, ACCESS_WRITE or ACCESS_EXECUTE * @return true if access is allowed * @throws IOException I/O exception on error */ public static boolean access(String path, AccessRight desiredAccess) throws IOException { return true; } /** * Extends both the minimum and maximum working set size of the current * process. This method gets the current minimum and maximum working set * size, adds the requested amount to each and then sets the minimum and * maximum working set size to the new values. Controlling the working set * size of the process also controls the amount of memory it can lock. * * @param delta amount to increment minimum and maximum working set size * @throws IOException for any error * @see POSIX#mlock(ByteBuffer, long) */ public static native void extendWorkingSetSize(long delta) throws IOException; static { if (NativeCodeLoader.isNativeCodeLoaded()) { try { initNative(); nativeLoaded = true; } catch (Throwable t) { // This can happen if the user has an older version of libhadoop.so // installed - in this case we can continue without native IO // after warning PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t); } } } } private static final Logger LOG = LoggerFactory.getLogger(NativeIO.class); private static boolean nativeLoaded = false; static { if (NativeCodeLoader.isNativeCodeLoaded()) { try { initNative(); nativeLoaded = true; } catch (Throwable t) { // This can happen if the user has an older version of libhadoop.so // installed - in this case we can continue without native IO // after warning PerformanceAdvisory.LOG.debug("Unable to initialize NativeIO libraries", t); } } } /** * Return true if the JNI-based native IO extensions are available. */ public static boolean isAvailable() { return NativeCodeLoader.isNativeCodeLoaded() && nativeLoaded; } /** Initialize the JNI method ID and class ID cache */ private static native void initNative(); /** * Get the maximum number of bytes that can be locked into memory at any * given point. * * @return 0 if no bytes can be locked into memory; * Long.MAX_VALUE if there is no limit; * The number of bytes that can be locked into memory otherwise. */ static long getMemlockLimit() { return isAvailable() ? getMemlockLimit0() : 0; } private static native long getMemlockLimit0(); /** * @return the operating system's page size. */ static long getOperatingSystemPageSize() { try { Field f = Unsafe.class.getDeclaredField("theUnsafe"); f.setAccessible(true); Unsafe unsafe = (Unsafe)f.get(null); return unsafe.pageSize(); } catch (Throwable e) { LOG.warn("Unable to get operating system page size. Guessing 4096.", e); return 4096; } } private static class CachedUid { final long timestamp; final String username; public CachedUid(String username, long timestamp) { this.timestamp = timestamp; this.username = username; } } private static final Map<Long, CachedUid> uidCache = new ConcurrentHashMap<Long, CachedUid>(); private static long cacheTimeout; private static boolean initialized = false; /** * The Windows logon name has two part, NetBIOS domain name and * user account name, of the format DOMAIN\UserName. This method * will remove the domain part of the full logon name. * * @param Fthe full principal name containing the domain * @return name with domain removed */ private static String stripDomain(String name) { int i = name.indexOf('\\'); if (i != -1) name = name.substring(i + 1); return name; } public static String getOwner(FileDescriptor fd) throws IOException { ensureInitialized(); if (Shell.WINDOWS) { String owner = Windows.getOwner(fd); owner = stripDomain(owner); return owner; } else { long uid = POSIX.getUIDforFDOwnerforOwner(fd); CachedUid cUid = uidCache.get(uid); long now = System.currentTimeMillis(); if (cUid != null && (cUid.timestamp + cacheTimeout) > now) { return cUid.username; } String user = POSIX.getUserName(uid); LOG.info("Got UserName " + user + " for UID " + uid + " from the native implementation"); cUid = new CachedUid(user, now); uidCache.put(uid, cUid); return user; } } /** * Create a FileDescriptor that shares delete permission on the * file opened at a given offset, i.e. other process can delete * the file the FileDescriptor is reading. Only Windows implementation * uses the native interface. */ public static FileDescriptor getShareDeleteFileDescriptor( File f, long seekOffset) throws IOException { if (!Shell.WINDOWS) { RandomAccessFile rf = new RandomAccessFile(f, "r"); if (seekOffset > 0) { rf.seek(seekOffset); } return rf.getFD(); } else { // Use Windows native interface to create a FileDescriptor that // shares delete permission on the file opened, and set it to the // given offset. // FileDescriptor fd = NativeIO.Windows.createFile( f.getAbsolutePath(), NativeIO.Windows.GENERIC_READ, NativeIO.Windows.FILE_SHARE_READ | NativeIO.Windows.FILE_SHARE_WRITE | NativeIO.Windows.FILE_SHARE_DELETE, NativeIO.Windows.OPEN_EXISTING); if (seekOffset > 0) NativeIO.Windows.setFilePointer(fd, seekOffset, NativeIO.Windows.FILE_BEGIN); return fd; } } /** * Create the specified File for write access, ensuring that it does not exist. * @param f the file that we want to create * @param permissions we want to have on the file (if security is enabled) * * @throws AlreadyExistsException if the file already exists * @throws IOException if any other error occurred */ public static FileOutputStream getCreateForWriteFileOutputStream(File f, int permissions) throws IOException { if (!Shell.WINDOWS) { // Use the native wrapper around open(2) try { FileDescriptor fd = NativeIO.POSIX.open(f.getAbsolutePath(), NativeIO.POSIX.O_WRONLY | NativeIO.POSIX.O_CREAT | NativeIO.POSIX.O_EXCL, permissions); return new FileOutputStream(fd); } catch (NativeIOException nioe) { if (nioe.getErrno() == Errno.EEXIST) { throw new AlreadyExistsException(nioe); } throw nioe; } } else { // Use the Windows native APIs to create equivalent FileOutputStream try { FileDescriptor fd = NativeIO.Windows.createFile(f.getCanonicalPath(), NativeIO.Windows.GENERIC_WRITE, NativeIO.Windows.FILE_SHARE_DELETE | NativeIO.Windows.FILE_SHARE_READ | NativeIO.Windows.FILE_SHARE_WRITE, NativeIO.Windows.CREATE_NEW); NativeIO.POSIX.chmod(f.getCanonicalPath(), permissions); return new FileOutputStream(fd); } catch (NativeIOException nioe) { if (nioe.getErrorCode() == 80) { // ERROR_FILE_EXISTS // 80 (0x50) // The file exists throw new AlreadyExistsException(nioe); } throw nioe; } } } private synchronized static void ensureInitialized() { if (!initialized) { cacheTimeout = new Configuration().getLong("hadoop.security.uid.cache.secs", 4*60*60) * 1000; LOG.info("Initialized cache for UID to User mapping with a cache" + " timeout of " + cacheTimeout/1000 + " seconds."); initialized = true; } } /** * A version of renameTo that throws a descriptive exception when it fails. * * @param src The source path * @param dst The destination path * * @throws NativeIOException On failure. */ public static void renameTo(File src, File dst) throws IOException { if (!nativeLoaded) { if (!src.renameTo(dst)) { throw new IOException("renameTo(src=" + src + ", dst=" + dst + ") failed."); } } else { renameTo0(src.getAbsolutePath(), dst.getAbsolutePath()); } } /** * Creates a hardlink "dst" that points to "src". * * This is deprecated since JDK7 NIO can create hardlinks via the * {@link java.nio.file.Files} API. * * @param src source file * @param dst hardlink location * @throws IOException */ @Deprecated public static void link(File src, File dst) throws IOException { if (!nativeLoaded) { HardLink.createHardLink(src, dst); } else { link0(src.getAbsolutePath(), dst.getAbsolutePath()); } } /** * A version of renameTo that throws a descriptive exception when it fails. * * @param src The source path * @param dst The destination path * * @throws NativeIOException On failure. */ private static native void renameTo0(String src, String dst) throws NativeIOException; private static native void link0(String src, String dst) throws NativeIOException; /** * Unbuffered file copy from src to dst without tainting OS buffer cache * * In POSIX platform: * It uses FileChannel#transferTo() which internally attempts * unbuffered IO on OS with native sendfile64() support and falls back to * buffered IO otherwise. * * It minimizes the number of FileChannel#transferTo call by passing the the * src file size directly instead of a smaller size as the 3rd parameter. * This saves the number of sendfile64() system call when native sendfile64() * is supported. In the two fall back cases where sendfile is not supported, * FileChannle#transferTo already has its own batching of size 8 MB and 8 KB, * respectively. * * In Windows Platform: * It uses its own native wrapper of CopyFileEx with COPY_FILE_NO_BUFFERING * flag, which is supported on Windows Server 2008 and above. * * Ideally, we should use FileChannel#transferTo() across both POSIX and Windows * platform. Unfortunately, the wrapper(Java_sun_nio_ch_FileChannelImpl_transferTo0) * used by FileChannel#transferTo for unbuffered IO is not implemented on Windows. * Based on OpenJDK 6/7/8 source code, Java_sun_nio_ch_FileChannelImpl_transferTo0 * on Windows simply returns IOS_UNSUPPORTED. * * Note: This simple native wrapper does minimal parameter checking before copy and * consistency check (e.g., size) after copy. * It is recommended to use wrapper function like * the Storage#nativeCopyFileUnbuffered() function in hadoop-hdfs with pre/post copy * checks. * * @param src The source path * @param dst The destination path * @throws IOException */ public static void copyFileUnbuffered(File src, File dst) throws IOException { if (nativeLoaded && Shell.WINDOWS) { copyFileUnbuffered0(src.getAbsolutePath(), dst.getAbsolutePath()); } else { FileInputStream fis = new FileInputStream(src); FileChannel input = null; try { input = fis.getChannel(); try (FileOutputStream fos = new FileOutputStream(dst); FileChannel output = fos.getChannel()) { long remaining = input.size(); long position = 0; long transferred = 0; while (remaining > 0) { transferred = input.transferTo(position, remaining, output); remaining -= transferred; position += transferred; } } } finally { IOUtils.cleanupWithLogger(LOG, input, fis); } } } private static native void copyFileUnbuffered0(String src, String dst) throws NativeIOException; }
3、關於這個使用maven構建的項目,我在運行時由於使用公司內網,速度很慢,因此改變策略。建立java項目,而後把hadoop2.9.2裏面的share目錄下的common、hdfs、httpfs、yarn、mapreduce目錄下的jar文件都拷了進來,運行中出了很多bug。算法
hadoop-hdfs-2.9.2.jar hadoop-hdfs-client-2.9.2.jar hadoop-mapreduce-client-app-2.9.2.jar hadoop-mapreduce-client-common-2.9.2.jar hadoop-mapreduce-client-core-2.9.2.jar hadoop-mapreduce-client-hs-2.9.2.jar hadoop-mapreduce-client-jobclient-2.9.2-tests.jar hadoop-mapreduce-client-shuffle-2.9.2.jar hadoop-yarn-api-2.9.2.jar hadoop-yarn-applications-distributedshell-2.9.2.jar hadoop-yarn-applications-unmanaged-am-launcher-2.9.2.jar hadoop-yarn-client-2.9.2.jar activation-1.1.jar aopalliance-1.0.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar asm-3.2.jar avro-1.7.7.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-io-2.4.jar commons-lang-2.6.jar commons-lang3-3.4.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar ehcache-3.3.1.jar fst-2.50.jar geronimo-jcache_1.0_spec-1.0-alpha-1.jar gson-2.2.4.jar guava-11.0.2.jar guice-3.0.jar guice-servlet-3.0.jar HikariCP-java7-2.4.12.jar htrace-core4-4.1.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-core-asl-1.9.13.jar jackson-jaxrs-1.9.13.jar jackson-mapper-asl-1.9.13.jar jackson-xc-1.9.13.jar java-util-1.9.0.jar java-xmlbuilder-0.4.jar javax.inject-1.jar jaxb-api-2.2.2.jar jaxb-impl-2.2.3-1.jar jcip-annotations-1.0-1.jar jersey-client-1.9.jar jersey-core-1.9.jar jersey-guice-1.9.jar jersey-json-1.9.jar jersey-server-1.9.jar jets3t-0.9.0.jar jettison-1.1.jar jetty-6.1.26.jar jetty-sslengine-6.1.26.jar jetty-util-6.1.26.jar jsch-0.1.54.jar json-io-2.5.1.jar json-smart-1.3.1.jar jsp-api-2.1.jar jsr305-3.0.0.jar leveldbjni-all-1.8.jar log4j-1.2.17.jar metrics-core-3.0.1.jar mssql-jdbc-6.2.1.jre7.jar netty-3.6.2.Final.jar nimbus-jose-jwt-4.41.1.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar snappy-java-1.0.5.jar stax-api-1.0-2.jar stax2-api-3.1.4.jar woodstox-core-5.0.3.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar hadoop-common-2.9.2.jar slf4j-api-1.7.25.jar slf4j-log4j12-1.7.25.jar hadoop-yarn-server-nodemanager-2.9.2.jar hadoop-yarn-server-resourcemanager-2.9.2.jar hadoop-yarn-server-router-2.9.2.jar hadoop-yarn-server-sharedcachemanager-2.9.2.jar hadoop-yarn-server-timeline-pluginstorage-2.9.2.jar hadoop-yarn-server-web-proxy-2.9.2.jar hadoop-yarn-ui-2.9.2.war hadoop-annotations-2.9.2.jar hadoop-auth-2.9.2.jar hadoop-nfs-2.9.2.jar hamcrest-core-1.3.jar junit-4.11.jar hadoop-mapreduce-client-jobclient-2.9.2.jar mockito-all-1.8.5.jar ojdbc7.jar orai18n.jar hadoop-yarn-common-2.9.2.jar hadoop-yarn-registry-2.9.2.jar hadoop-yarn-server-applicationhistoryservice-2.9.2.jar hadoop-yarn-server-common-2.9.2.jar
前言
Web日誌包含着網站最重要的信息,經過日誌分析,咱們能夠知道網站的訪問量,哪一個網頁訪問人數最多,哪一個網頁最有價值等。通常中型的網站(10W的PV以上),天天會產生1G以上Web日誌文件。大型或超大型的網站,可能每小時就會產生10G的數據量。
對於日誌的這種規模的數據,用Hadoop進行日誌分析,是最適合不過的了。
目錄
Web日誌由Web服務器產生,多是Nginx, Apache, Tomcat等。從Web日誌中,咱們能夠獲取網站每類頁面的PV值(PageView,頁面訪問量)、獨立IP數;稍微複雜一些的,能夠計算得出用戶所檢索的關鍵詞排行榜、用戶停留時間最高的頁面等;更復雜的,構建廣告點擊模型、分析用戶行爲特徵等等。
在Web日誌中,每條日誌一般表明着用戶的一次訪問行爲,例以下面就是一條nginx日誌:
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
拆解爲如下8個變量
注:要更多的信息,則要用其它手段去獲取,經過js代碼單獨發送請求,使用cookies記錄用戶的訪問信息。
利用這些日誌信息,咱們能夠深刻挖掘網站的祕密了。
少許數據的狀況
少許數據的狀況(10Mb,100Mb,10G),在單機處理尚能忍受的時候,我能夠直接利用各類Unix/Linux工具,awk、grep、sort、join等都是日誌分析的利器,再配合perl, python,正則表達工,基本就能夠解決全部的問題。
例如,咱們想從上面提到的nginx日誌中獲得訪問量最高前10個IP,實現很簡單:
~ cat access.log.10 | awk '{a[$1]++} END {for(b in a) print b"\t"a[b]}' | sort -k2 -r | head -n 10 163.177.71.12 972 101.226.68.137 972 183.195.232.138 971 50.116.27.194 97 14.17.29.86 96 61.135.216.104 94 61.135.216.105 91 61.186.190.41 9 59.39.192.108 9 220.181.51.212 9
海量數據的狀況
當數據量天天以10G、100G增加的時候,單機處理能力已經不能知足需求。咱們就須要增長系統的複雜性,用計算機集羣,存儲陣列來解決。在Hadoop出現以前,海量數據存儲,和海量日誌分析都是很是困難的。只有少數一些公司,掌握着高效的並行計算,分步式計算,分步式存儲的核心技術。
Hadoop的出現,大幅度的下降了海量數據處理的門檻,讓小公司甚至是我的都能力,搞定海量數據。而且,Hadoop很是適用於日誌分析系統。
下面咱們將從一個公司案例出發來全面的解釋,如何用進行海量Web日誌分析,提取KPI數據。
案例介紹
某電子商務網站,在線團購業務。每日PV數100w,獨立IP數5w。用戶一般在工做日上午10:00-12:00和下午15:00-18:00訪問量最大。日間主要是經過PC端瀏覽器訪問,休息日及夜間經過移動設備訪問較多。網站搜索瀏量佔整個網站的80%,PC用戶不足1%的用戶會消費,移動用戶有5%會消費。
經過簡短的描述,咱們能夠粗略地看出,這家電商網站的經營情況,並認識到願意消費的用戶從哪裏來,有哪些潛在的用戶能夠挖掘,網站是否存在倒閉風險等。
KPI指標設計
注:商業保密限制,沒法提供電商網站的日誌。
下面的內容,將以個人我的網站爲例提取數據進行分析。
百度統計,對我我的網站作的統計!http://www.fens.me
從商業的角度,我的網站的特徵與電商網站不太同樣,沒有轉化率,同時跳出率也比較高。從技術的角度,一樣都關注KPI指標設計。
並行算法的設計:
注:找到第一節有定義的8個變量
PV(PageView): 頁面訪問量統計
IP: 頁面獨立IP的訪問量統計
Time: 用戶每小時PV的統計
Source: 用戶來源域名的統計
Browser: 用戶的訪問設備統計
上圖中,左邊是Application業務系統,右邊是Hadoop的HDFS, MapReduce。
上面這幅圖,咱們能夠看得更清楚,數據是如何流動的。藍色背景的部分是在Hadoop中的,接下來咱們的任務就是完成MapReduce的程序實現。
請參考文章:用Maven構建Hadoop項目
win7的開發環境 和 Hadoop的運行環境 ,在上面文章中已經介紹過了。
咱們須要放日誌文件,上傳的HDFS裏/user/hdfs/log_kpi/目錄,參考下面的命令操做
~ hadoop fs -mkdir /user/hdfs/log_kpi ~ hadoop fs -copyFromLocal /home/conan/datafiles/access.log.10 /user/hdfs/log_kpi/
我已經把整個MapReduce的實現都放到了github上面:
https://github.com/bsspirit/maven_hadoop_template/releases/tag/kpi_v1
開發流程:
1). 對日誌行的解析
新建文件:org.conan.myhadoop.mr.kpi.KPI.java
package org.conan.myhadoop.mr.kpi; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; import java.util.Locale; /* * KPI Object */ public class KPI { private String remote_addr;// 記錄客戶端的ip地址 private String remote_user;// 記錄客戶端用戶名稱,忽略屬性"-" private String time_local;// 記錄訪問時間與時區 private String request;// 記錄請求的url與http協議 private String status;// 記錄請求狀態;成功是200 private String body_bytes_sent;// 記錄發送給客戶端文件主體內容大小 private String http_referer;// 用來記錄從那個頁面連接訪問過來的 private String http_user_agent;// 記錄客戶瀏覽器的相關信息 private boolean valid = true;// 判斷數據是否合法 @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append("valid:" + this.valid); sb.append("\nremote_addr:" + this.remote_addr); sb.append("\nremote_user:" + this.remote_user); sb.append("\ntime_local:" + this.time_local); sb.append("\nrequest:" + this.request); sb.append("\nstatus:" + this.status); sb.append("\nbody_bytes_sent:" + this.body_bytes_sent); sb.append("\nhttp_referer:" + this.http_referer); sb.append("\nhttp_user_agent:" + this.http_user_agent); return sb.toString(); } public String getRemote_addr() { return remote_addr; } public void setRemote_addr(String remote_addr) { this.remote_addr = remote_addr; } public String getRemote_user() { return remote_user; } public void setRemote_user(String remote_user) { this.remote_user = remote_user; } public String getTime_local() { return time_local; } public Date getTime_local_Date() throws ParseException { SimpleDateFormat df = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.US); return df.parse(this.time_local); } public String getTime_local_Date_hour() throws ParseException{ SimpleDateFormat df = new SimpleDateFormat("yyyyMMddHH"); return df.format(this.getTime_local_Date()); } public void setTime_local(String time_local) { this.time_local = time_local; } public String getRequest() { return request; } public void setRequest(String request) { this.request = request; } public String getStatus() { return status; } public void setStatus(String status) { this.status = status; } public String getBody_bytes_sent() { return body_bytes_sent; } public void setBody_bytes_sent(String body_bytes_sent) { this.body_bytes_sent = body_bytes_sent; } public String getHttp_referer() { return http_referer; } public String getHttp_referer_domain(){ if(http_referer.length()<8){ return http_referer; } String str=this.http_referer.replace("\"", "").replace("http://", "").replace("https://", ""); return str.indexOf("/")>0?str.substring(0, str.indexOf("/")):str; } public void setHttp_referer(String http_referer) { this.http_referer = http_referer; } public String getHttp_user_agent() { return http_user_agent; } public void setHttp_user_agent(String http_user_agent) { this.http_user_agent = http_user_agent; } public boolean isValid() { return valid; } public void setValid(boolean valid) { this.valid = valid; } public static void main(String args[]) { String line = "222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] \"GET /images/my.jpg HTTP/1.1\" 200 19939 \"http://www.angularjs.cn/A00n\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\""; System.out.println(line); KPI kpi = new KPI(); String[] arr = line.split(" "); kpi.setRemote_addr(arr[0]); kpi.setRemote_user(arr[1]); kpi.setTime_local(arr[3].substring(1)); kpi.setRequest(arr[6]); kpi.setStatus(arr[8]); kpi.setBody_bytes_sent(arr[9]); kpi.setHttp_referer(arr[10]); kpi.setHttp_user_agent(arr[11] + " " + arr[12]); System.out.println(kpi); try { SimpleDateFormat df = new SimpleDateFormat("yyyy.MM.dd:HH:mm:ss", Locale.US); System.out.println(df.format(kpi.getTime_local_Date())); System.out.println(kpi.getTime_local_Date_hour()); System.out.println(kpi.getHttp_referer_domain()); } catch (ParseException e) { e.printStackTrace(); } } }
從日誌文件中,取一行經過main函數寫一個簡單的解析測試。
控制檯輸出:
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" valid:true remote_addr:222.68.172.190 remote_user:- time_local:18/Sep/2013:06:49:57 request:/images/my.jpg status:200 body_bytes_sent:19939 http_referer:"http://www.angularjs.cn/A00n" http_user_agent:"Mozilla/5.0 (Windows 2013.09.18:06:49:57 2013091806 www.angularjs.cn
咱們看到日誌行,被正確的解析成了kpi對象的屬性。咱們把解析過程,單獨封裝成一個方法。
private static KPI parser(String line) { System.out.println(line); KPI kpi = new KPI(); String[] arr = line.split(" "); if (arr.length > 11) { kpi.setRemote_addr(arr[0]); kpi.setRemote_user(arr[1]); kpi.setTime_local(arr[3].substring(1)); kpi.setRequest(arr[6]); kpi.setStatus(arr[8]); kpi.setBody_bytes_sent(arr[9]); kpi.setHttp_referer(arr[10]); if (arr.length > 12) { kpi.setHttp_user_agent(arr[11] + " " + arr[12]); } else { kpi.setHttp_user_agent(arr[11]); } if (Integer.parseInt(kpi.getStatus()) >= 400) {// 大於400,HTTP錯誤 kpi.setValid(false); } } else { kpi.setValid(false); } return kpi; }
對map方法,reduce方法,啓動方法,咱們單獨寫一個類來實現
下面將分別介紹MapReduce的實現類:
1). PV:org.conan.myhadoop.mr.kpi.KPIPV.java
package org.conan.myhadoop.mr.kpi; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class KPIPV { public static class KPIPVMapper extends MapReduceBase implements Mapper<object, text,="" intwritable=""> { private IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(Object key, Text value, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException { KPI kpi = KPI.filterPVs(value.toString()); if (kpi.isValid()) { word.set(kpi.getRequest()); output.collect(word, one); } } } public static class KPIPVReducer extends MapReduceBase implements Reducer<text, intwritable,="" text,="" intwritable=""> { private IntWritable result = new IntWritable(); @Override public void reduce(Text key, Iterator values, OutputCollector<text, intwritable=""> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } result.set(sum); output.collect(key, result); } } public static void main(String[] args) throws Exception { String input = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/"; String output = "hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv"; JobConf conf = new JobConf(KPIPV.class); conf.setJobName("KPIPV"); conf.addResource("classpath:/hadoop/core-site.xml"); conf.addResource("classpath:/hadoop/hdfs-site.xml"); conf.addResource("classpath:/hadoop/mapred-site.xml"); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(KPIPVMapper.class); conf.setCombinerClass(KPIPVReducer.class); conf.setReducerClass(KPIPVReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(input)); FileOutputFormat.setOutputPath(conf, new Path(output)); JobClient.runJob(conf); System.exit(0); } }
在程序中會調用KPI類的方法
KPI kpi = KPI.filterPVs(value.toString());
經過filterPVs方法,咱們能夠實現對PV,更多的控制。
在KPK.java中,增長filterPVs方法
/** * 按page的pv分類 */ public static KPI filterPVs(String line) { KPI kpi = parser(line); Set pages = new HashSet(); pages.add("/about"); pages.add("/black-ip-list/"); pages.add("/cassandra-clustor/"); pages.add("/finance-rhive-repurchase/"); pages.add("/hadoop-family-roadmap/"); pages.add("/hadoop-hive-intro/"); pages.add("/hadoop-zookeeper-intro/"); pages.add("/hadoop-mahout-roadmap/"); if (!pages.contains(kpi.getRequest())) { kpi.setValid(false); } return kpi; }
在filterPVs方法,咱們定義了一個pages的過濾,就是隻對這個頁面進行PV統計。
咱們運行一下KPIPV.java
2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush 信息: Starting flush of map output 2013-10-9 11:53:28 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill 信息: Finished spill 0 2013-10-9 11:53:28 org.apache.hadoop.mapred.Task done 信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757 2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 信息: hdfs://192.168.1.210:9000/user/hdfs/log_kpi/access.log.10:0+3025757 2013-10-9 11:53:30 org.apache.hadoop.mapred.Task sendDone 信息: Task 'attempt_local_0001_m_000000_0' done. 2013-10-9 11:53:30 org.apache.hadoop.mapred.Task initialize 信息: Using ResourceCalculatorPlugin : null 2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 信息: 2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge 信息: Merging 1 sorted segments 2013-10-9 11:53:30 org.apache.hadoop.mapred.Merger$MergeQueue merge 信息: Down to the last merge-pass, with 1 segments left of total size: 213 bytes 2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 信息: 2013-10-9 11:53:30 org.apache.hadoop.mapred.Task done 信息: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 2013-10-9 11:53:30 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 信息: 2013-10-9 11:53:30 org.apache.hadoop.mapred.Task commit 信息: Task attempt_local_0001_r_000000_0 is allowed to commit now 2013-10-9 11:53:30 org.apache.hadoop.mapred.FileOutputCommitter commitTask 信息: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/log_kpi/pv 2013-10-9 11:53:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob 信息: map 100% reduce 0% 2013-10-9 11:53:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate 信息: reduce > reduce 2013-10-9 11:53:33 org.apache.hadoop.mapred.Task sendDone 信息: Task 'attempt_local_0001_r_000000_0' done. 2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob 信息: map 100% reduce 100% 2013-10-9 11:53:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob 信息: Job complete: job_local_0001 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Counters: 20 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: File Input Format Counters 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Bytes Read=3025757 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: File Output Format Counters 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Bytes Written=183 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: FileSystemCounters 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: FILE_BYTES_READ=545 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: HDFS_BYTES_READ=6051514 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: FILE_BYTES_WRITTEN=83472 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: HDFS_BYTES_WRITTEN=183 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Map-Reduce Framework 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Map output materialized bytes=217 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Map input records=14619 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Reduce shuffle bytes=0 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Spilled Records=16 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Map output bytes=2004 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Total committed heap usage (bytes)=376569856 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Map input bytes=3025757 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: SPLIT_RAW_BYTES=110 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Combine input records=76 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Reduce input records=8 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Reduce input groups=8 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Combine output records=8 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Reduce output records=8 2013-10-9 11:53:34 org.apache.hadoop.mapred.Counters log 信息: Map output records=76
用hadoop命令查看HDFS文件
~ hadoop fs -cat /user/hdfs/log_kpi/pv/part-00000 /about 5 /black-ip-list/ 2 /cassandra-clustor/ 3 /finance-rhive-repurchase/ 13 /hadoop-family-roadmap/ 13 /hadoop-hive-intro/ 14 /hadoop-mahout-roadmap/ 20 /hadoop-zookeeper-intro/ 6
這樣咱們就獲得了,剛剛日誌文件中的,指定頁面的PV值。
指定頁面,就像網站的站點地圖同樣,若是沒有指定全部訪問連接都會被找出來,經過「站點地圖」的指定,咱們能夠更容易地找到,咱們所須要的信息。
後面,其餘的統計指標的提取思路,和PV的實現過程都是相似的,你們能夠直接下載源代碼,運行看到結果!!
後面我會把我代碼上傳到github上面:
https://github.com/blench/