大數據（MapReduce的編程細節及其Hive的安裝，簡單操做）

時間 2019-11-30

標籤數據 mapreduce 編程細節及其 hive 安裝簡單欄目 Hadoop 简体版

原文原文鏈接

大數據課程第五天

MapReduce編程細節分析

MapReduce中,Reduce能夠沒有 (純數據的清洗,不用Reduce)數據庫
```
  job.setNumReduceTasks(0);
```

設置多個Reduceapache

// 默認在MapReduce中 Reduce的數量是1 
job.setNumReduceTasks(3);

//爲何Reduce的數量能夠設置爲多個
內存角度  並行角度 

//若是Reduce數量多個話,那麼生成結果也是多個獨立的文件,放置在同一個目錄下

Partition 分區編程

分區的做用: Map 輸出的key,合理分配對應的Reduce進行處理
默認的分區策略:

key%reduceNum = 
public class HashPartitioner<K, V> extends Partitioner<K, V> {
    public HashPartitioner() {
    }

    public int getPartition(K key, V value, int numReduceTasks) {
        return (key.hashCode() & 2147483647) % numReduceTasks;
    }
}

自定義分區策略
public class MyPartitioner<K,V> extends Partitioner<K,V>{
    
}
job.setPartitionerClass(MyPartitioner.class);

Map的壓縮框架
```
1. core-site.xml
2. mapred-site.xml
```

Combainer編程oop

Map端的Reduce
job.setCombinerClass(MyReduce3.class);

Counter計數器大數據

 Counter counter = context.getCounter("baizhiCounter", "mapCount");
 counter.increment(1L);

Hive編程

概念: Hive是apache組織開源的一個數據倉庫框架,最開始是FaceBook提供的.

1. 數據倉庫
數據庫   DataBase
存儲的數據量級   小   價值高
數據倉庫 DataWareHouse 
存儲的數據量級 大   價值低

2. Hive底層依附的是Hadoop 

3. 以類SQL(HQL Hive Query Languge) 的方式運行MR,操做HDFS上的數據

Hive的原理分析spa

Hive Hadoop on SQL
SparkQL Spark on SQL
Presto Impala kylin

Hive基本環境的搭建操作系統

1. 搭建Hadoop
2. Hive安裝 加壓縮
3. 配置 
   hive-env.sh
   # Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/opt/install/hadoop-2.5.2

# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/opt/install/apache-hive-0.13.1-bin/conf
   4. 在hdfs 建立  /tmp  數據庫表對應的路徑
                   /user/hive/warehouse
   5. 啓動hive
bin/hive

Hive的基本使用3d

1. hive數據庫
   show databases;
   create database if not exists baizhi_140
   use baizhi_140
2. 表相關操做
   show tables;
   create table if not exists t_user(
    id int,
    name string
    )row format delimited fields terminated by '\t';
3. 插入數據 導入數據 本地操做系統文件 向 hive表 導入數據
   load data local inpath '/root/data3' into table t_user;
4. SQL語句
   select * from t_user;

Hive與HDFS對應的一個介紹rest

1. 數據庫對應的就是一個HDFS目錄
baizhi141  /user/hive/warehouse/mydb
2. 表對應一個HDFS目錄
/user/hive/warehouse/mydb/t_user
3. 表中的數據 對應的是 HDFS上的文件
 load data local inpath '/root/data3' into table t_user;
 bin/hdfs dfs -put /root/data3 /user/hive/warehouse/mydb/t_user