想要從日誌數據中分析一下操做系統、瀏覽器、版本號使用狀況。但是hive中的函數不能直接解析useragent,因而可以寫一個UDF來解析。useragent用於表示用戶的當前操做系統,瀏覽器版本號信息,形如:java
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 180.173.196.29
當中解析ua可以用一個開源的工具包,叫作useragentutils.jar來處理,但是不能直接引入這個包,因爲hadoop和hive都不支持直接引用第三方的包,要導入源代碼。項目結構應該例如如下圖sql
如下的代碼用來打印出操做系統、瀏覽器版本號信息:apache
import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; import eu.bitwalker.useragentutils.UserAgent; public class ParseUserAgent_UDF extends UDF{ public Text evaluate(final Text userAgent){ StringBuilder builder = new StringBuilder(); UserAgent ua = new UserAgent(userAgent.toString()); builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion()); return new Text(builder.toString()); } }使用:打成jar包,hive中add jar xx.jar;
create temporary function ua_parse as 'com.xx.ParseUserAgent_UDF';瀏覽器
select ua_parse(ua) from table_name limit 3;app
結果:ide
WINDOWS_7 CHROME21 21.0.1180.89
WINDOWS_7 CHROME33 33.0.1750.146
WINDOWS_7 CHROME21 21.0.1180.89
函數
此種方式僅僅能處理一行。生成一行,沒法進行統計分析。工具
如下使用UDTF(User Defined Table Generating Function),處理一行,生成多列。oop
import java.util.ArrayList; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import eu.bitwalker.useragentutils.UserAgent; public class ParseUserAgent_UDTF extends GenericUDTF{ @Override public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException { if (args.length != 1) { throw new UDFArgumentLengthException("ExplodeMap takes only one argument"); } if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) { throw new UDFArgumentException("ExplodeMap takes string as a parameter"); } ArrayList<String> fieldNames = new ArrayList<String>(); ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(); fieldNames.add("system"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldNames.add("browser"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldNames.add("version"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } @Override public void process(Object[] arg){ try { if(arg == null || arg.length == 0) return; String input = arg[0].toString(); String result[] = ua_parse(input).split("\t"); forward(result); } catch (Exception e) { e.printStackTrace(); } } @Override public void close() throws HiveException { } public String ua_parse(String userAgent){ StringBuilder builder = new StringBuilder(); UserAgent ua = new UserAgent(userAgent.toString()); builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion()); return builder.toString(); } }
select t.browser,count(*) c from (select ua_parse(ua) as (system,browser,version) from table_name) t group by t.browser order by c desc;前十名:
CHROME31 987220571
UNKNOWN 708890045
IE8 420021677
IE7 411500373
MOBILE_SAFARI 291920740
IE6 217574865
IE11 179582201
IE9 165160040
CHROME30 158623163
CHROME21 155192489
ui
未識別的仍是很是多!
參考:http://blog.csdn.net/ruidongliu/article/details/8791865
http://computerdragon.blog.51cto.com/6235984/1288567