Hadoop相關平常操做

時間 2019-11-08

標籤 hadoop 相關平常欄目 Hadoop 简体版

原文原文鏈接

1.Hive相關

腳本導數據，並設置運行隊列sql

bin/beeline -u 'url' --outputformat=tsv -e "set mapreduce.job.queuename=queue_1" 
-e "select * from search_log where date <= 20150525 and date >= 20150523" > test.txt

將毫秒轉換爲日期數據庫

select from_unixtime(cast(createTime/1000 as bigint)) from video_information;

對值類型爲JSON的數據進行解析，以下就是一個字段data爲json類型，其中的type表明日誌類型，查詢搜索日誌。json

get_json_object(field, "$.field") 
select * from video where date=20151215 and get_json_object(data, "$.type")="search" limit 1;

JSONArray類型解析bash

表格有3個字段（asrtext array, asraudiourl string)app

asraudiourl	string	https://xxx
asrtext	array	[{"text":"我是業主","confidence":1.0,"queryvendor":"1009","querydebug":"{\"recordId\":\"92e12fe7\",\"applicationId\":\"\",\"eof\":1,\"result\":{\"rec\":\"我是業主\",\"eof\":1}}","isfinal":true}]

select asr, asraudiourl, asrvendor from aiservice.asr_info LATERAL VIEW explode(asrtext) asrTable As asr where date=20170523 and asrvendor='AiSpeech' and asr.isfinal=true and asr.text="我是業主" limit 1;ide

distinct誤區oop

當distinct要求字段值不是null，當distinct x,y時，若是有null，會形成數據錯誤。因此咱們來人工把null轉換成一個值url

select count(distinct requestid, CASE WHEN resid is null THEN "1" ELSE resid END)

2.Spark相關

spark任務提交，jar包後面跟的是jar包須要的參數

$SPARK_HOME/bin/spark-submit --class com.test.SimilarQuery 
--master yarn-cluster --num-executors 40 --driver-memory 4g 
--executor-memory 2g --executor-cores 1 
similar-query-0.0.1-SNAPSHOT-jar-with-dependencies.jar 20150819 /user/similar-query

3.Hadoop

執行MapReduce Job，並設置運行隊列，後面兩個是主類須要的參數

hadoop jar game-query-down-0.0.1-SNAPSHOT.jar QueryDownJob 
-Dmapreduce.job.queuename=sns_default arg1 arg2

4.MapReduce輸入輸出格式

TextInputFormat：默認格式，讀取文件的行，key是行的字節偏移量(LongWritable)，value是行內容(Text)spa

KeyValueInputFormat:把行解析爲鍵值對，key+\tab+valuedebug

SequenceFileInputFormat/SequenceFileOutputFormat:二進制格式，key/value都是用戶自定義，input和output要保持一致

TextOutputFormat:輸出純文本，每行爲key+\tab+value

NullOutputFormat:沒有輸出，忽略輸出數據

MapFileOutputFormat：將結果寫入一個MapFile中。MapFile中的鍵必須是排序的，因此在reducer中必須保證輸出的鍵有序

DBInputFormat/DBOutputFormat：使用JDBC從關係數據庫讀文件或寫文件

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。