公司一共不到30臺的hadoop集羣,hdfs大小共有120T,最近監控總是報警,磁盤不足(低於5%時候報警),以前一直忙於業務,沒時間整理集羣,整理以後發現現有文件一共在34T左右,加上3份冗餘,整個hdfs佔用在103T,以前清洗的時候直接是文本存入,且沒有進行任何壓縮,這塊兒應該會有很大的優化空間。其中有一份記錄用戶手機安裝應用的日誌文件佔用在5T左右,先拿他下手。java
由於hive有三種文件存儲格式,TEXTFILE、SEQUENCEFILE、RCFILE,其中前兩個是基於行存儲,RCFile是Hive推出的一種專門面向列的數據格式。 它遵循「先按列劃分,再垂直劃分」的設計理念,當查詢過程當中,針對它並不關心的列時,它會在IO上跳過這些列,因此選擇RCFILE,再用Gzip壓縮。sql
之間還犯了一個比較2的錯誤:由於以前有同事調研過rcfile(已離職),因此用show create table XX的方式查看建表語句,發現是apache
CREATE EXTERNAL TABLE XX( ...... ) PARTITIONED BY ( day int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' LOCATION '/user/hive/data/XX';
就照搬改一下字段,建了一張app_install的RCFile表,sql導入以前的數據app
set mapred.job.priority=VERY_HIGH; set hive.merge.mapredfiles=true; set hive.merge.smallfiles.avgsize=200000000; set hive.exec.compress.output=true; set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; set mapred.job.name=app_install.$_DAY; insert overwrite table app_install1 PARTITION (day=$_DAY) select XXX from tb1 where day=$_DAY
報錯,查看hadoop運行日誌,發現是oop
FATAL ExecReducer: java.lang.UnsupportedOperationException: Currently the writer can only accept BytesRefArrayWritableat org.apache.hadoop.hive.ql.io.RCFile$Writer.append(RCFile.java:880) at org.apache.hadoop.hive.ql.io.RCFileOutputFormat$2.write(RCFileOutputFormat.java:140) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:588) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject(CommonJoinOperator.java:389)at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:715) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:697) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject(CommonJoinOperator.java:697)at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:856) at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup(JoinOperator.java:265) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:198) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249)
網上說是hive的一個bug,一直覺得就是這個bug,折騰了一天,最後試着按照網上的方式修改了一下建表語句優化
REATE EXTERNAL TABLE XX( ...... ) PARTITIONED BY ( day int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS RCFILE LOCATION '/user/hive/data/XX';
結果正常運行,而後用show create table XX查看語句發現又變成了spa
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
鬱悶死了,就是建表語句而後用show create table顯示的不同致使,雖然是個小問題,可是也頗費經歷,但願發家之後也有這種狀況能夠避免。
設計