需求:統計每小時的PV數php
用來描述將數據歷來源端通過抽取(extract)、轉換(transform)、加載(load)至目的端的過程java
字段過濾node
字段補全mysql
字段格式化linux
將數據導出git
由Facebook開源的,用於解決海量結構化日誌的數據統計的項目github
本質: 將HQL轉化爲MapReduce程序算法
Hive的其實時HDFS上的目錄和文件sql
元數據信息被保存在自帶的Deybe數據庫中shell
只容許建立一個鏈接
多用於Demo
元數據信息被保存在MySQL數據庫
MySQL數據庫與Hive運行在同一臺物理機器上
多用於開發和測試
元數據信息被保存在MySQL數據庫
MySQL數據庫與Hive運行在不一樣臺物理機器上
用於實際生成環境
1) 卸載
$ rpm -qa | grep mysql
$ sudo rpm -e mysql-libs-5.1.71-1.el6.x86_64 --nodeps
2) 安裝
可選擇將緩存替換,而後再安裝 $ sudo cp -r /opt/software/x86_64/ /var/cache/yum/
$ sudo yum install -y mysql-server mysql mysql-devel
3) 啓動mysql服務
$ sudo service mysqld start
4) 設置密碼
$ /usr/bin/mysqladmin -u root password '新密碼'
5) 開機啓動
$ sudo chkconfig mysqld on
6) 受權root的權限及設置遠程登陸
登陸
$ mysql -u root -p
受權
mysql> grant all privileges on *.* to 'root'@'%' identified by '密碼'; mysql> grant all privileges on *.* to 'root'@'linux01' identified by '密碼'; -- 必須有這一句,%包括全部
all privileges 全部權限
. 全部數據庫的全部表
'root'@'%' 在任意主機以root身份登陸
'root'@'linux03.ibf.com' 在linux03主機以root登陸
by 'root' 使用root做爲密碼
7)刷新受權
mysql> flush privileges;
8)測試,在windows中是否能夠登陸
mysql -h linux03.ibf.com -u root -p
必須先安裝HDFS和Yarn
1)安裝:
$ tar -zxvf /opt/software/hive-0.13.1-bin.tar.gz -C /opt/modules/
重命名hive文件夾名字
$ cd /opt/modules
$ mv apache-hive-0.13.1-bin/ hive-0.13.1/
2)在HDFS上 建立tmp目錄和hive倉庫
$ bin/hdfs dfs -mkdir -p /user/hive/warehouse
$ bin/hdfs dfs -mkdir /tmp #已存在
$ bin/hdfs dfs -chmod g+w /user/hive/warehouse
$ bin/hdfs dfs -chmod g+w /tmp
3)修改配置
$ cd hive-0.13.1/
$ cp conf/hive-default.xml.template conf/hive-site.xml
$ cp conf/hive-log4j.properties.template conf/hive-log4j.properties
$ cp conf/hive-env.sh.template conf/hive-env.sh
3-1)修改hive-env.sh
JAVA_HOME=/opt/modules/jdk1.7.0_67 #添加 HADOOP_HOME=/opt/modules/hadoop-2.5.0 export HIVE_CONF_DIR=/opt/modules/hive-0.13.1/conf
3-2)修改hive.site.xml
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://linux01:3306/metastore?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> </property>
3-3)修改日誌配置hive-log4j.properties
hive.log.dir=/opt/modules/hive-0.13.1/logs
3-4)拷貝jdbc驅動到hive的lib目錄
$ cp /opt/software/mysql-connector-java-5.1.34-bin.jar /opt/modules/hive-0.13.1/lib/
4)肯定yarn和hdfs啓動
$ jps
6468 ResourceManager
6911 Jps
6300 RunJar
6757 NodeManager
2029 NameNode
2153 DataNode
此時使用bin/hive 能夠進入hive
進入hive目錄
$ cd /opt/modules/hive-0.13.1/
bin/hive
show databases;
create database mydb;
use mydb;
show tables;
create table student ( id int comment 'id of student', name string comment 'name of student', age int comment 'age of student', gender string comment 'sex of student', addr string ) comment 'this is a demo' row format delimited fields terminated by '\t';
表默認建立在/user/hive/warehouse裏
經過hive.metastore.warhouse.dir配置
desc student; 查看錶字段
或
desc formatted student; 能夠查看元數據
此時mysql的metastore數據庫情況
mysql> select * from TBLS;
+--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+ | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | +--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+ | 1 | 1556132119 | 6 | 0 | chen | 0 | 1 | student | MANAGED_TABLE | NULL | NULL | +--------+-------------+-------+------------------+-------+-----------+-------+----------+---------------+--------------------+--------------------+ 1 row in set (0.00 sec)
mysql> select * from COLUMNS_V2;
+-------+-----------------+-------------+-----------+-------------+ | CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX | +-------+-----------------+-------------+-----------+-------------+ | 1 | NULL | addr | string | 4 | | 1 | age of student | age | int | 2 | | 1 | sex of student | gender | string | 3 | | 1 | id of student | id | int | 0 | | 1 | name of student | name | string | 1 | +-------+-----------------+-------------+-----------+-------------+ 5 rows in set (0.00 sec)
load data local inpath '/home/hadoop/student.log' into table student;
load data inpath '/input/student.data' into table student;
重啓無效
set hive.cli.print.header=true; #列名
set hive.cli.print.current.db=true; #表名
reset; 重置
重啓有效
<property> <name>hive.cli.print.header</name> <value>true</value> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property>
!ls
!pwd
dfs -ls /
dfs -mkdir /hive
-e 執行sql
-f 執行sql文件
-S 靜默執行
hive -e
$ bin/hive -e "select *from test_db.emp_p"
hive -f
$ bin/hive -S -f /home/hadoop/emp.sql > ~/result.txt
drop table user;
truncate table user;
create table emp( empId int, empString string, job string, salary float, deptId int ) row format delimited fields terminated by '\t';
load data inpath '/input/dept.txt' into table dept;
# 或從本地加載 load data local inpath '/home/hadoop/dept.txt' into table dept;
create external table emp_ex ( empId int, empName string, job string, salary float, deptId int ) row format delimited fields terminated by '\t' location '/hive/table/emp';
把數據移動到表所在位置
hive (mydb)> dfs -mv /input/emp.txt /hive/table/emp/emp.txt
服務器加載
hive (mydb)> load data local inpath '/home/hadoop/emp.data' into table emp;
或者直接使用dfs命令移動數據到hive表目錄下
hive (mydb)> dfs -put /home/hadoop/emp.data /hello/table/emp;
外部表建立表的時候,須要用external
外部表在刪除表的時候只會刪除表的元數據(metadata)信息不會刪除表數據(data)
內部表刪除時會將元數據信息和表數據同時刪除
create table emp_part( empno int, empname string, empjob string, mgrno int, birthday string, salary float, bonus float, deptno int ) partitioned by (province string) row format delimited fields terminated by '\t';
向分區表加載數據
顯式指定分區值
load data local inpath '/home/user01/emp.txt' into table emp_part partition (province='CHICAGO');
show partitions emp_part;
alter table emp_part add partition (province='shanghai');
alter table emp_part drop partition (province='shanghai');
向分區添加數據
load data local inpath '本地路徑' into table emp_part partition (province='shanghai');
查詢分區數據
select * from emp_part where province='henan';
create table emp_second( id int , name string, job string, salary float, dept int ) partitioned by (day string,hour string) row format delimited fields terminated by '\t';
alter table emp_second add partition (day='20180125',hour='16');
alter table emp_second drop partition (day='20180125');
load data local inpath '/home/hadoop/emp.log' into table emp_second partition (day='20180125',hour='17');
鏈接兩個在相同列上劃分了桶的表,使用map side join 實現
使sampling更高效
需設置set hive.enforce.bucketing=true
create table bucketed_users(id int, name string) clustered by (id) into 4 buckets
某個數據被分到哪一個桶根據指定列的hash值對桶數取餘獲得
load data local inpath '本地路徑' into table 表名
bin/hdfs dfs -put 本地路徑 hdfs路徑(hive的表位置)
load data inpath 'hdfs路徑' into table 表名
load data inpath 'hdfs路徑' overwrite into table 表名
load data local inpath '本地路徑' overwrite into table 表名
經過insert語句將select的結果 插入到一張表中
insert into table test_tb select * from emp_p;
建立表時加載數據
create external table test_tb ( id int, name string ) row format delimited fields terminated by '\t'; location "/hive/test_tb";
bin/hive -e "use test_db;select * from emp_p" > /home/hadoop/result.txt
bin/hive -f 路徑 >> /home/hadoop/result.txt
insert overwrite local directory '/home/hadoop/data' select * from emp_p;
insert overwrite local directory '/home/hadoop/data' row format delimited fields terminated by '^' select * from emp_p;
hive > insert overwrite directory '/data' select * from emp_p;
hive > export table emp_p to '/input/export' ;
hive > import table emp_imp from 'hdfs_path' ;
通配 *指定字段
select id,name from emp;
select * from emp_p where salary > 10000;
select * from emp_p where sal between 10000 and 15000;
select * from user where email is not null;
select * from emp_p where did in (1,2,3);
count max min sum avg
select count(1) personOfDept from emp_p group by job;
select sum(sal) from emp_p;
select distinct id from emp_part;
select distinct name, province from emp_part;
select eid,ename,salary ,did from emp where emp.did in (select did from dept where dname='人事部');
emp.eid emp.ename emp.salary emp.did 1001 jack 10000.0 1 1002 tom 2000.0 2 1003 lily 20000.0 3 1004 aobama 10000.0 5 1005 yang 10000.0 6
dept.did dept.dname dept.dtel 1 人事部 021-456 2 財務部 021-234 3 技術部 021-345 4 BI部 021-31 5 產品部 021-232
select * from dept, emp;
select * from emp, dept where emp.did=dept.did;
select t1.eid, t1.ename, t1.salary,t2.did ,t2.dname from emp t1 join dept t2 on t1.did=t2.did;
left join
select eid,ename, salary,t2.did, t2.dname from emp t1 left join dept t2 on t1.did = t2.did;
right join
select eid,ename, salary,t2.did, t2.dname from emp t1 right join dept t2 on t1.did = t2.did;
select eid,ename, salary,t2.did, t2.dname from emp t1 full join dept t2 on t1.did = t2.did;
select * from emp_part order by salary;
設置reduce個數爲3,也只有一個文件
set mapreduce.job.reduces=3;
底層 時在reduce函數以前完成的
設置reduce個數
set mapreduce.job.reduces=2;
insert overwrite local directory '/home/hadoop/result' select * from emp_part sort by salary; # 默認reduce個數爲1, 這種狀況下和order by同樣
set mapreduce.job.reduces=3;
這裏使用部分分區,薪資排序
insert overwrite local directory '/home/hadoop/result' select * from emp_part distribute by deptno sort by salary;
修改hive-site.xml
<property> <name>hive.server2.long.polling.timeout</name> <value>5000</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>bigdata.ibf.com</value> </property>
1)建立用戶 CREATE USER 'hadoop'@'centos01.bigdata.com' IDENTIFIED BY '123456'; 2)受權訪問(hive的存儲元數據的數據庫) GRANT ALL ON metastore.* TO 'hadoop'@'centos01.bigdata.com' IDENTIFIED BY '123456'; GRANT ALL ON metastore.* TO 'hadoop'@'%' IDENTIFIED BY '123456'; 3)刷新受權 flush privileges;
啓動服務
$ bin/hiveserver2 & 或 $bin/hive --service hiveserver2 &
鏈接
$ bin/beeline beeline>!connect jdbc:hive2://bigdata.ibf.com:10000 輸入mysql的用戶名 輸入mysql密碼
功能:用於HDFS與RDBMS之間數據的導入導出
全部的導入導出都是基於HDFS而言
數據分析流程
數據採集 日誌; RDBMS; 使用sqoop,將須要分析的數據採集到HDFS 數據清洗 字段過濾 字段補全 -》將須要分析的字段導入到HDFS 字段格式化 數據分析 將分析後的數據存儲在HDFS 將結果數據從HDFS導出到MySQL 數據展現 從RDBMS中讀取數據
sqoop支持:HDFS,hive,hbase
sqoop的底層
-》使用sqoop命令,經過不一樣的參數,實現不一樣的需求 -》sqoop根據不一樣的參數,解析後傳遞給底層的MapReduce模板 -》將封裝好的MapReduce打成jar包,提交給yarn執行 -》這個MapReduce只有maptask,沒有reducetask
版本
-》sqoop1 -》sqoop2: -》多了server端 -》添加了安全機制
安裝部署
下載解壓
tar -zxvf /opt/software/sqoop-1.4.5-cdh5.3.6.tar.gz -C /opt/cdh-5.3.6/
修改配置文件
$ pwd /opt/cdh-5.3.6/sqoop-1.4.5-cdh5.3.6 $ cp conf/sqoop-env-template.sh conf/sqoop-env.sh
修改sqoop-env.sh
export HADOOP_COMMON_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6 #Set path to where hadoop-*-core.jar is available export HADOOP_MAPRED_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6 #Set the path to where bin/hive is available export HIVE_HOME=/opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6
將MySQL鏈接驅動放入sqoop的lib目錄
$ cp /opt/software/mysql-connector-java-5.1.34-bin.jar /opt/cdh-5.3.6/sqoop-1.4.5-cdh5.3.6/lib/
使用測試
[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop help
[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop list-databases \ --connect jdbc:mysql://linux03.ibf.com:3306 \ --username root \ --password 123456
在sqoop-1.4.6中,須要添加java-json包
$ cp /opt/software/java-json.jar /opt/cdh5.14.2/sqoop-1.4.6-cdh5.14.2/lib/
解決找不到hive倉庫的問題
$ cp ${HIVE_HOME}/conf/hive-site.xml ${SQOOP_HOME}/conf/
在HADOOP_CLASSPATH中追加hive的依賴
$ sudo vi /etc/profile #HADOOP_CLASSPATH export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cdh5.14.2/hive-1.1.0-cdh5.14.2/lib/* source /etc/profile
bin/sqoop import --help #查看命令提示
源:MySQL的一張表
目標:HDFS一個路徑
在MySQL中建立測試表
在mysql中添加數據
use test_db; create table user( id int primary key, name varchar(20) not null, salary float )charset=utf8; insert into user values(1,"張三",9000); insert into user values(2,"李四",10000); insert into user values(3,"王五",6000);
把mysql 中test_db.user 導入到HDFS上, 默認在hdfs://linux01:8020/user/hadoop/
[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \ > --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ > --username root \ > --password 123456 \ > --table user
當沒有reduce時, 有幾個map就有幾個輸出文件
-》指定hdfs輸出目錄:--target-dir
-》指定map的個數:-m
[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \ > --connect jdbc:mysql://linux03.ibf.com:3306/test_db \ > --username root \ > --password root \ > --table user \ > --target-dir /toHdfs \ > -m 1
-》修改導出分隔符 --fields-terminated-by
-》--direct 導入更快
-》提早刪除輸出目錄
[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \ > --connect jdbc:mysql://linux03.ibf.com:3306/test_db \ > --username root \ > --password root \ > --table toHdfs \ > --target-dir /toHdfs \ > --direct \ > --delete-target-dir \ > --fields-terminated-by '\t' \ > -m 1
[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \ > --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ > --username root \ > --password 123456 \ > --table user \ > --columns name,salary \ > --fields-terminated-by '-' \ > --target-dir /sqoop \ > --delete-target-dir \ > --direct \ > -m 1
將SQL語句執行的結果進行導入-e,--query
bin/sqoop import \ --connect jdbc:mysql://bigdata01.com:3306/test \ --username root \ --password 123456 \ -e 'select * from user where salary>9000 and $CONDITIONS' \ --target-dir /toHdfs \ --delete-target-dir \ -m 1
在上面的-e的查詢語句中必須包含where $CONDITIONS ,
若是想用where語句 where salary>9000 and $CONDITIONS'
能夠設置密碼文件(把--password 改成 --password-file)
sqoop會讀取整個password-file,包括空格和回車,可使用echo -n
命令生成密碼文件,如:echo -n "secret" > password.file
$ echo -n 'root' > /home/hadoop/mysqlpasswd && chmod 400 /home/hadoop/mysqlpasswd bin/sqoop import \ --connect jdbc:mysql://bigdata01.com:3306/test \ --username root \ --password-file file:///home/hadoop/mysqlpasswd \ -e 'select * from toHdfs where $CONDITIONS' \ --target-dir /sqoop \ --delete-target-dir \ -m 1
hive,指定數據庫中沒有該表, 就會建立該表
bin/sqoop import \ --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ --username root \ -P \ --table user \ --fields-terminated-by '\t' \ --delete-target-dir \ -m 1 \ --hive-import \ --hive-database test_db \ --hive-table user
過程 MapReduce將數據導入到hdfs用戶的家目錄 從家目錄將數據導入到hive表 增量導入 追加:根據某一列上一次導入的最後一個值,來判斷追加的數據 時間戳:根據數據記錄修改的時間戳來進行導入 --check-column <column> Source column to check for incremental change --incremental <import-type> Define an incremental import of type 'append' or 'lastmodified' --last-value <value> Last imported value in the incremental check column
若是HDFS上沒有該文件會建立該文件
bin/sqoop import \ --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ --username root \ --password 123456 \ --table user \ --fields-terminated-by '\t' \ --target-dir /sqoop/incremental \ -m 1 \ --direct \ --check-column id \ --incremental append \ --last-value 3
建立sqoop job,自動建立增量 (報錯)
Sqoop job相關的命令有兩個:
bin/sqoop job
bin/sqoop-job
使用這兩個均可以
建立job:--create
刪除job:--delete
執行job:--exec
顯示job:--show
列出job:--list
bin/sqoop-job \ --create your-sync-job \ -- import \ --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ --username root \ -P \ --table user \ -m 1 \ --target-dir /hive/incremental \ --incremental append \ --check-column id \ --last-value 1
bin/sqoop-job --show your-sync-job
bin/sqoop job --show your-sync-job
bin/sqoop job --exec your-sync-job
bin/sqoop job --list
bin/sqoop job --delete my-sync-job
將數據從hive(HDFS上的文件與目錄),HDFS導出到MySQL
use mydb create table user_export( id int primary key, name varchar(20) not null, salary float );
須要如今數據庫中創建表
bin/sqoop export \ --connect jdbc:mysql://linux03.ibf.com:3306/mydb \ --username root \ -P \ --table user_export \ --export-dir /hive/incremental \ --input-fields-terminated-by ',' \ -m 1
使用sqoop --options-file
編輯文件sqoopScript
export --connect jdbc:mysql://linux03.ibf.com:3306/test_db --username root -P --table emp -m 1 --export-dir /input/export --fields-terminated-by "\t"
bin/sqoop --options-file ~/sqoopScript
內容
Hive簡單案例需求分析及結果的導出
動態分區的介紹及使用
使用腳本動態加載到hive表中
hive函數
1 需求及分析
需求
分析統計天天每小時的PV數和UV數
分析
建立數據源表
建立分區表(天,小時)/ 加載數據
數據清洗
建立hive表
字段過濾
id url guid 字段補全(無) 字段格式化(無)
數據分析
pv:count(url) uv:count(distinct guid)
保存結果
日期(天) 小時 PV UV
導出結果
導出到MySQL
2 具體實現
數據原表
1) 建立原表
create database if not exists hive_db;
user hive_db;
create table tracklogs(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)
partitioned by (date string,hour string)
row format delimited fields terminated by 't';
2) 加載數據
load data local inpath '/opt/datas/2015082818' into table tracklogs partition(date='20150828',hour='18');
load data local inpath '/opt/datas/2015082819' into table tracklogs partition(date='20150828',hour='19');
分析
1) 創建數據分析表
create table clear (
id string,
url string,
guid string
)
partitioned by (date string, hour string)
row format delimited fields terminated by 't';
2) 過濾數據
insert into table clear partition(date='20150828',hour='18') select id,url,guid from tracklogs where date='20150828' and hour='18';
insert into table clear partition(date='20150828',hour='19') select id,url,guid from tracklogs where date='20150828' and hour='19';
3) 指標分析
pv : select date,hour,count(url) as pv from clear group by date,hour;
uv: select date,hour, count(distinct guid) as uv from clear group by date,hour;
保存結果到result
create table result as select date,hour, count(url) pv, count(distinct guid) as uv from clear group by date,hour;
建立表時沒指定分隔符則默認分隔符爲 001
導出結果到mysql
# 建立表
create table result(
day varchar(30),
hour varchar(30),
pv varchar(30) not null,
uv varchar(30) not null,
primary key(day,hour)
);
# 導出數據
[hadoop@linux03 sqoop-1.4.5-cdh5.3.6]$ bin/sqoop export \
--connect jdbc:mysql://linux03.ibf.com:3306/mydb \
--username root \
--password root \
--table result \
--export-dir /user/hive/warehouse/hive_db.db/result \
--input-fields-terminated-by '001' \
-m 1
開啓動態分區
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
打開動態分區後,動態分區的模式,有 strict和 nonstrict 兩個值可選,strict 要求至少包含一個靜態分區列,nonstrict則無此要求。
建立表
create table clear_dynamic (
id string,
url string,
guid string
)
partitioned by (date string, hour string)
row format delimited fields terminated by 't';
動態加載數據
直接加載20180129的全部hour的數據
insert into table clear_dynamic partition(date='20180129',hour) select id,url,guid,hour from tracklogs where date='20180129';
根據hour自動分區
之前是這樣寫
insert into table clear partition(date='20150828',hour='18') select id,url,guid from tracklogs where date='20150828' and hour='18';
insert into table clear partition(date='20150828',hour='19') select id,url,guid from tracklogs where date='20150828' and hour='19';
20180129/
2018012900
2018012901
2018012902
2018012903
2018012904
2018012905
1) 編寫shell_腳本(bin/hive -e "" )
2) 測試腳本
show partitions tracklogs; #查看分區
alter table tracklogs drop partition(date='20150828',hour='18'); 刪除分區
alter table tracklogs drop partition(date='20150828',hour='19');
select count(1) from tracklogs; #查看記錄數
3) 使用shell腳本使用(bin/hive -f )
4)測試
show partitions tracklogs; #查看分區
alter table tracklogs drop partition(date='20150828',hour='18'); 刪除分區
alter table tracklogs drop partition(date='20150828',hour='19');
select count(1) from tracklogs; #查看記錄數
用戶自定義函數,用於實現hive中不能實現的業務邏輯處理
類型:
UDF: 一進一出
UDAF: 多進一出 sum,count等
UDTF: 一進多出 行列轉換
編寫UDF:
編寫UDF必須繼承UDF
必須至少實現一個evaluale方法
必需要有返回類型,能夠是null
建議使用hadoop序列化類型
需求:日期轉換
31/Aug/2015:00:04:37 +0800 --> 2015-08-31 00:04:37
實現步驟
1) 自定義類實現UDF類
2) 打包不要指定主類
3) 添加到hive中
maven中導入hadoop的包和hive的包
<dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.2</version> </dependency>
具體實現範例
package com.myudf; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; import java.util.Locale; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public class DateFormate extends UDF { SimpleDateFormat inputDate = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss",Locale.ENGLISH); SimpleDateFormat outDate = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); // 31/Aug/2015:00:04:37 +0800 --> 2015-08-31 00:04:37 public Text evaluate(Text str) { if(str == null) { return null; } if(StringUtils.isBlank(str.toString())) { return null; } Date date = null; String val = null; try { date = inputDate.parse(str.toString()); val = outDate.format(date); } catch (ParseException e) { e.printStackTrace(); } return new Text(val); } public static void main(String[] args) { Text val = new DateFormate().evaluate(new Text("31/Aug/2015:00:04:37 +0800")); System.out.println(val); } }
hive (test_db)>add jar /home/hadoop/DDD.jar;
hive (test_db)> CREATE TEMPORARY FUNCTION removequote as 'com.myudf.date.RemoveQuoteUDF';
hive (test_db)> show functions;
bzip2, gzip, lzo, snappy等
壓縮比:bzip2>gzip>lzo bzip2
壓縮解壓速度:lzo>gzip>bzip2 lzo
bin/hadoop checknative -a
http://google.github.io/snappy/
mvn package -Pdist,native,docs -DskipTests -Dtar -Drequire.snappy
關閉hadoop相關進程
解壓cdh5.xxx-snappy-lib-native.tar.gz 到$HADOOP_HOME/lib
$ tar -zxvf native-hadoop-cdh5.14.2.tar.gz -C /opt/modules/hadoop-2.6.0-cdh5.14.2/lib
能夠觀察到已經支持 $ bin/hadoop checknative -a
mapred-site.xml 配置
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
運行pi程序: $ bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.3.6.jar pi 1 2
經過主機:19888觀察該任務的configuration中壓縮配置
shuffle階段啓用壓縮
set hive.exec.compress.output=true; set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
reduce輸出的結果文件進行壓縮
set mapreduce.output.fileoutputformat.compress=true; set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
create table ( ... ) row format delimited fields terminated by '' STORED AS file_format
TEXTFILE
RCFILE
ORC
PARQUET
AVRO
INPUTFORMAT
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET --Parquet就是基於Dremel的數據模型和算法實現的。 這個比較常見
寫的快
讀得快
使用給定的日誌文件(18.1MB)
使用不一樣的存儲格式,存儲相同的數據,判斷文件大小
在MapReduce的shuffle階段啓用壓縮(對中間數據進行壓縮能夠減小map和reduce task間的數據傳輸量。對於IO型做業,能夠加快速度。)
set hive.exec.compress.intermediate=true; set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
對輸出結果壓縮
set hive.exec.compress.output=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
建立表file_text ,並加載數據
create table if not exists file_text( t_time string, t_url string, t_uuid string, t_refered_url string, t_ip string, t_user string, t_city string ) row format delimited fields terminated by '\t' stored as textfile; load data local inpath '/home/hadoop/page_views.data' into table file_text;
對比默認格式和file_orc_snappy 數據大小比較
create table if not exists file_orc_snappy( t_time string, t_url string, t_uuid string, t_refered_url string, t_ip string, t_user string, t_city string ) row format delimited fields terminated by '\t' stored as ORC tblproperties("orc.compression"="Snappy"); insert into table file_orc_snappy select * from file_text; -- 不能經過load來加載,由於load本質是hdfs的put,這樣不能壓縮,必需要insert這樣走MapReduce才能讓壓縮發揮做用
對比默認格式和parquet格式 數據大小比較
create table if not exists file_parquet( t_time string, t_url string, t_uuid string, t_refered_url string, t_ip string, t_user string, t_city string ) row format delimited fields terminated by '\t' stored as parquet; insert into table file_parquet select * from file_text;
對比默認格式和parquet格式,snappy壓縮 數據大小比較
create table if not exists file_parquet_snappy( t_time string, t_url string, t_uuid string, t_refered_url string, t_ip string, t_user string, t_city string ) row format delimited fields terminated by '\t' stored as parquet tblproperties("parquet.compression"="Snappy"); insert into table file_parquet_snappy select * from file_text;
hive (mydb)> dfs -du -s -h /user/hive/warehouse/mydb.db/file_parquet_snappy; hive (mydb)> dfs -du -s -h /user/hive/warehouse/mydb.db/file_parquet;
經過正則匹配,加載複雜格式日誌文件
1 正則
2 根據日誌加載數據
日誌
"27.38.5.159" "-" "31/Aug/2015:00:04:53 +0800" "GET /course/view.php?id=27 HTTP/1.1" "200" "7877" - "http://www.ibf.com/user.php?act=mycourse&testsession=1637" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibf.com"
建立表
CREATE TABLE apachelog ( remote_addr string, remote_user string, time_local string, request string, status string, body_bytes_set string, request_body string, http_referer string, http_user_agent string, http_x_forwarded_for string, host string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\]]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (\"-|[^ ]*\") (\"[^ ]*\")" ) STORED AS TEXTFILE;
load data local inpath '/home/hadoop/moodle.ibf.access.log' into table apachelog;
//Whether to execute jobs in parallel
set hive.exec.parallel=true;
//How many jobs at most can be executed in parallel
set hive.exec.parallel.thread.number=8;#能夠調大,提升並行效率
set mapreduce.job.reduces=1
mapreduce.job.jvm.numtasks=1 默認1個
hive配置,默認爲true
set hive.mapred.reduce.tasks.speculative.execution=true;
hadoop
mapreduce.map.speculative true
mapreduce.reduce.speculative true
Size of merged files at the end of the job
將小文件合併避免下降hdfs存儲大量小文件而下降性能
set hive.merge.size.per.task=256000000;
set hive.mapred.mode=strict; nonstrict默認
嚴格模式下,
分區表,必須加分區字段過濾條件
對order by, 必須使用limit
限制笛卡爾積的查詢(join 的時候不使用on,而使用where)
map join
若是關聯查詢兩張表中有一張小表默認map join,將小表加入內存
hive.mapjoin.smalltable.filesize=25000000 默認大小
hive.auto.convert.join=true 默認開啓
若是沒有開啓使用mapjoin,使用語句制定小表使用mapjoin
select /+ MAPJOIN(time_dim) / count(1) from
store_sales join time_dim on (ss_sold_time_sk = t_time_sk)
reduce join
對兩張大表join
對關聯的key進行分組
smb join
Sort-Merge-Bucket join
解決大表與大表join速度慢問題
經過分桶字段的的hash值對桶的個數取餘進行分桶
set hive.enforce.bucketing=true;
create table 表名 (
字段
)
clustered by(分桶字段) into 分桶數量 buckets;
如
create table student(
id int,
age int,
name string
)
clustered by (id) into 4 bucket
row format delimited fields terminated by ',';
//每一個Map最大輸入大小(這個值決定了合併後文件的數量)
set mapred.max.split.size=256000000;
//一個節點上split的至少的大小(這個值決定了多個DataNode上的文件是否須要合併)
set mapred.min.split.size.per.node=100000000;
//一個交換機下split的至少的大小(這個值決定了多個交換機上的文件是否須要合併)
set mapred.min.split.size.per.rack=100000000;
//執行Map前進行小文件合併
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
//設置map端輸出進行合併,默認爲true
set hive.merge.mapfiles = true
//設置reduce端輸出進行合併,默認爲false
set hive.merge.mapredfiles = true
//設置合併文件的大小
set hive.merge.size.per.task = 256000000
//當輸出文件平均大小小於設定值時,啓動合併操做。這一設定只有當hive.merge.mapfiles或hive.merge.mapredfiles設定爲true時,纔會對相應的操做有效。
set hive.merge.smallfiles.avgsize=16000000
本質緣由:key的分佈不均致使的
Map 端部分聚合,至關於Combiner
hive.map.aggr=true
有數據傾斜的時候進行負載均衡
hive.groupby.skewindata=true
當選項設定爲 true,生成的查詢計劃會有兩個 MR Job。第一個 MR Job 中,Map 的輸出結果集合會隨機分佈到 Reduce 中,每一個 Reduce 作部分聚合操做,並輸出結果,這樣處理的結果是相同的 Group By Key 有可能被分發到不一樣的 Reduce 中,從而達到負載均衡的目的;第二個 MR Job 再根據預處理的數據結果按照 Group By Key 分佈到 Reduce 中(這個過程能夠保證相同的 Group By Key 被分佈到同一個 Reduce 中),最後完成最終的聚合操做。
名詞
1) UV: count(distinct guid)
訪問您網站的一臺電腦客戶端爲一個訪客。00:00-24:00內相同的客戶端只被計算一次。
2) PV:Page View--- count(url)
即頁面瀏覽量或點擊量,用戶每次刷新即被計算一次。
3) 登陸人數:
登陸網站訪問的人數[會員],endUserId有值的數量
4) 遊客數:
沒有登陸訪問的人數,endUserId爲空的數量
5) 平均訪問時長:
訪客平均在網站停留的時間 trackTime --> max - min
6) 二跳率: pv>1的訪問量/總訪問量
平均瀏覽2個頁面及以上(pv>1)的用戶數 / 用戶總數(discont guid) 點擊1次
二跳率的概念是當網站頁面展開後,用戶在頁面上產生的首次點擊被稱爲「二跳」,二跳的次數即爲「二跳量」。二跳量與瀏覽量的比值稱爲頁面的二跳率。
count(case when pv >=2 then guid else null end ) / discont (guid)
7) 獨立IP:---count(distinct ip)
獨立IP表示,擁有特定惟一IP地址的計算機訪問您的網站的次數,由於這種統計方式比較容易實現,具備較高的真實性,因此成爲大多數機構衡量網站流量的重要指標。好比你是ADSL撥號上網的,你撥一次號都自動分配一個ip,這樣你進入了本站,那就算一個ip,當你斷線了而沒清理cookies,以後又撥 了一次號,又自動分配到一個ip,你再進來了本站,那麼又統計到一個ip,可是UV(獨立訪客)沒有變,由於2次都是你進入了本站。
日期 | uv | pv | 登陸人數 | 遊客人數 | 平均訪問時間 | 二跳率 | 獨立IP數 |
---|---|---|---|---|---|---|---|
準備測試數據
hive (db_analogs)> create database ts; hive (db_analogs)> use ts; hive (ts)> create table testscore(gender string,satscore int, idnum int) row format delimited fields terminated by '\t'; hive (ts)> load data local inpath '/opt/datas/TESTSCORES.csv' into table testscore;
OVER with standard aggregates: COUNT、SUM、MIN/MAX、 AVG
需求1:
按照性別分組,satscore分數排序(降序),最後一列顯示所在分組中的最高分
Female 1000 37070397 1590 Female 970 60714297 1590 Female 910 30834797 1590 Male 1600 39196697 1600 Male 1360 44327297 1600 Male 1340 55983497 1600
答案sql:
hive (ts)> select gender,satscore,idnum,max(satscore) over(partition by gender order by satscore desc) maxs from testscore;
注 意:
partition by 是分組用的
要求 topN
按照性別分組,satscore排序(降序),最後一列顯示在分組中的名次
需求1:
分數相同名次不一樣,名次後面根據行數增加
Female 1590 23573597 1 Female 1520 40177297 2 Female 1520 73461797 3 Female 1490 9589297 4 Female 1390 99108497 5 Female 1380 23048597 6 # 分數相同 Female 1380 81994397 7 # 分數相同
需求2:
分數相同名次相同,名次後面根據行數增加
Female 1590 23573597 1 Female 1520 40177297 2 Female 1520 73461797 2 Female 1490 9589297 4 Female 1390 99108497 5 Female 1380 23048597 6 #分數相同 Female 1380 81994397 6 # 分數相同
需求3:
分數相同名次相同,名次連續增加
Female 1590 23573597 1 Female 1520 40177297 2 Female 1520 73461797 2 Female 1490 9589297 3 Female 1390 99108497 4 Female 1380 23048597 5 Female 1380 81994397 5
SQL
sql1 hive (ts)> select gender,satscore,idnum,row_number() over(partition by gender order by satscore desc) maxs from testscore; -- ROW_NUMBER() 從1開始,按照順序,生成分組內記錄的序列 sql2 select gender,satscore,idnum,rank() over(partition by gender order by satscore desc) maxs from testscore; -- RANK() 生成數據項在分組中的排名,排名相等會在名次中留下空位 sql3 select gender,satscore,idnum,dense_rank() over(partition by gender order by satscore desc) maxs from testscore; -- DENSE_RANK() 生成數據項在分組中的排名,排名相等會在名次中不會留下空位
# 當有order by,而沒有指定窗口子句時,窗口子句默認爲RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(從起點到當前行的範圍)
# 當order by和窗口子句都沒有時,窗口子句默認ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING(從起點到後面的終點)
UNBOUNDED PRECEDING
UNBOUNDED FOLLOWING
1 PRECEDING
1 FOLLOWING
CURRENT ROW
窗口對比
select gender,satscore,idnum,sum(satscore) over(partition by gender order by satscore desc RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) sums from testscore;
select gender,satscore,idnum,sum(satscore) over(partition by gender order by satscore desc RANGE BETWEEN UNBOUNDED PRECEDING AND unbounded following) sums from testscore;
select gender,satscore,idnum,sum(satscore) over(partition by gender order by satscore desc RANGE BETWEEN 1 PRECEDING AND CURRENT ROW) sums from testscore;
當前行數據幅度+1後範圍內
落後值(上n個值),在不指定落後個數的狀況下,默認爲落後一個值(數據從上向下顯示,落後即當前值以前顯示的值)
場景: 分析用戶頁面瀏覽順序
sql
hive (ts)> select gender,satscore,idnum, lag(satscore) over(partition by gender order by satscore desc) as lastvalue from testscore;
要求
gender satscore idnum lastvalue Female 1590 23573597 NULL # 此處爲null,能夠爲其指定默認值 Female 1520 40177297 1590 # 顯示當前satscore的上一條記錄的值 Female 1520 73461797 1520 # 顯示當前satscore的上一條記錄的值 Female 1490 9589297 1520 Female 1390 99108497 1490
與LAG相反(下n擱置),用法同理,前面的值(領先值),默認爲領先一個值(數據從上向下顯示,領先即當前值以後顯示的值)
sql
hive (ts)> select gender,satscore,idnum, lead(satscore, 1, 0) over(partition by gender order by satscore desc) as nextvalue from testscore;
結果
gender satscore idnum nextvalue ... Female 1060 59149297 1060 Female 1060 46028397 1000 Female 1000 37070397 970 Female 970 60714297 910 Female 910 30834797 0