Hive中mapjoin优化例子_hive使用mapjoin实例-程序员宅基地

技术标签: Hive  hive hive优化 hive中mapjoin hive中  

1 基本信息

3个表,1个事实表,2个维度表
事实表 test_fact (mid string,sex_id string,age_id string )
维度表dim_user_demography_age (age_id  string,age_name string )
维度表dim_user_demography_sex (sex_id string,sex_name  string

测试的sql

select mid,sex_name,age_name from test_fact   f
join dim_user_demography_age d1 on d1.age_id=f.age_id
join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;

d1 ,d2是维度表,仅有几行数据,数据量<10k

2 测试用到的参数及说明

测试用到的参数

set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;

2.1 hive.auto.convert.join

启用该参数后,hive会基于表的size自动的将普通join转换成mapjoin

2.2 hive.auto.convert.join.noconditionaltask

启用该参数后,hive会基于表的size自动的将普通join转换成mapjoin.对于一个n路连接来说,如果其中n-1个表或者分区的数据大小小于特定值,将会把一个普通join转换成mapjoin
测试用的sql,可以看做是一个3路连接

select mid,sex_name,age_name from test_fact   f
join dim_user_demography_age d1 on d1.age_id=f.age_id
join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;

2.3 hive.auto.convert.join.noconditionaltask.size

默认为10M,以上述sql举例,就是要求d1 d2 2个表的数据大小<10M。只有 hive.auto.convert.join.noconditionaltask=true的时候才生效

3 测试说明

下面分情况讨论下上上面参数对任务的影响。

3.1 启用hive.auto.convert.join.noconditionaltask

如果该参数启用,该sql只有有1个job。
日志如下,维度表d1,d2会被本地被转成成1个hash表,上传到hdfs。在mr任务中,用于做mapjoin。任务速度最快。

Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:17:09     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:17:10     Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
2018-04-24 16:17:10     Uploaded 1 File to: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable (308 bytes)
2018-04-24 16:17:10     Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile11--.hashtable
2018-04-24 16:17:10     Uploaded 1 File to: file:/tmp/hdp-xxx/7cc4814c-60ef-4a3d-8b2b-402bb9cc317f/hive_2018-04-24_16-17-03_272_6992344947047046417-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile11--.hashtable (387 bytes)

3.2 不启用hive.auto.convert.join.noconditionaltask

日志如下,会有两个mapjoin的过程,性能差。
1)将d1本地生成map表,上传hdfs
2)f与d1做一次mapjoin,保留结果
3)将d2本地生成map表,上传hdfs
4)与d2的map表做一次mapjoin
5)查询结果

hive> select mid,sex_name,age_name from test_fact   f
    > join dim_user_demography_age d1 on d1.age_id=f.age_id
    > join dim_user_demography_sex d2 on d2.sex_id=f.sex_id ;
Query ID = hdp-ads-audit_20180424161919_0663afcb-af1f-4b0f-b17b-23117c257969
Total jobs = 5
Stage-13 is selected by condition resolver.
Stage-1 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:19:25     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:19:26     Dump the side-table for tag: 1 with group count: 5 into file: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10008/HashTable-Stage-8/MapJoin-mapfile21--.hashtable
2018-04-24 16:19:26     Uploaded 1 File to: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10008/HashTable-Stage-8/MapJoin-mapfile21--.hashtable (387 bytes)
2018-04-24 16:19:26     End of local task; Time Taken: 0.984 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1524125616287_138024, Tracking URL = http://mhdp12.namenodetest:8888/proxy/application_1524125616287_138024/
Kill Command = /usr/bin/hadoop/software/yarn//bin/hadoop job  -kill job_1524125616287_138024
Hadoop job information for Stage-8: number of mappers: 1; number of reducers: 0
2018-04-24 16:19:40,865 Stage-8 map = 0%,  reduce = 0%
2018-04-24 16:19:59,570 Stage-8 map = 100%,  reduce = 0%, Cumulative CPU 15.58 sec
MapReduce Total cumulative CPU time: 15 seconds 580 msec
Ended Job = job_1524125616287_138024
Moved to trash: /home/dddddd/hive/scratchdir/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-mr-10013/bd115834-c22a-4d5e-b960-d09ddaf5a361/map.xml
Moved to trash: /home/dddddd/hive/scratchdir/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/_task_tmp.-mr-10003
Stage-11 is selected by condition resolver.
Stage-12 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xitong/software/hadoop-2.7.2U7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xitong/software/spark-1.6.0-U22-bin-2.7.2U6/lib/spark-assembly-1.6.0-U22-hadoop2.7.2U6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-24 16:20:05     Starting to launch local task to process map join;      maximum memory = 514523136
2018-04-24 16:20:06     Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable
2018-04-24 16:20:06     Uploaded 1 File to: file:/tmp/dddddd/1363bd2b-f5bf-44f8-a3cb-c9427c70cbab/hive_2018-04-24_16-19-19_514_7098521400798527233-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile01--.hashtable (308 bytes)
2018-04-24 16:20:06     End of local task; Time Taken: 0.95 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1524125616287_138033, Tracking URL = http://mhdp12.namenodetest:8888/proxy/application_1524125616287_138033/
Kill Command = /usr/bin/hadoop/software/yarn//bin/hadoop job  -kill job_1524125616287_138033
Interrupting... Be patient, this might take some time.
Press Ctrl+C again to kill JVM
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/wisgood/article/details/80066475

智能推荐

对每个边缘求最小外接矩形,通过最小矩形提取每个边缘_边缘的最小外接矩形-程序员宅基地

文章浏览阅读4.9k次。#include #include using namespace std;using namespace cv;int main(){Mat src; //源图像Mat tmp; //临时图像Mat dst_bw; //去掉背景后的目标二值图像Mat dst_contours;//轮廓图像src=imread("E:\\单板图片\\求孔洞数_边缘的最小外接矩形

【设计模式】中介者-程序员宅基地

文章浏览阅读865次。中介者,说白了跟市面上黑中介类似。当然这个中介,开发者是可以控制其行为的。也是在一定的信任关系上建立的。该模式要解决的问题是,一堆对象之间交叉耦合问题。网上看过群聊的例子。如果没有任何一个平台,多人之间的会话会是什么样的呢?不举多人,就三个吧A想把一句话说给BC,那么他首先要知道B和C在哪儿,然后分别告诉对方,自己想说的事情。如果再加一个人呢?问题很明显,此时各种群聊工具应运而生。我写

Mysql列自增是怎么实现的_mysql 自增序列生成原理-程序员宅基地

文章浏览阅读1.8k次。AUTO_INCREMENT两种情况1、在载入语句执行前,已经不确定要插入多少条记录。在执行插入语句时在表级别加一个auto-inc锁,然后为每条待插入记录的auto-increment修饰的列分配递增的值,语句执行结束后,再把auto-inc锁释放掉。一个事务再持有auto-inc锁的过程中,其他事务的插入语句都要被阻塞,可以保证一个语句中分配的递增值是连续的。AUTO-INC锁的..._mysql 自增序列生成原理

半导体器件基础_掺杂半导体的带隙-程序员宅基地

文章浏览阅读3.5k次,点赞2次,收藏17次。半导体能带结构示意图:上方两条白色带为没有电子填充的带,下面三条灰色带为充满电子的带,其中最高一条灰色带为价带,它与最低一条白色带之间的空隙为能隙空穴又称电洞(Electron hole),在固体物理学中指共价键上流失一个电子,最后在共价键上留下空位的现象导带(英语:conduction band),又名传导带,是指半导体或是绝缘体材料中,一种电子所具有能量的范围。这个能量的范围高..._掺杂半导体的带隙

基于C++和OpenCV的中心线提取算法_图像中心线提取c++-程序员宅基地

文章浏览阅读3.5k次,点赞2次,收藏26次。基于C++和OpenCV的中心线提取算法加权平方灰度重心法介绍算法演示加权平方灰度重心法介绍详情见 https://blog.csdn.net/u010518385/article/details/101015604算法演示下面展示 函数-输入图像和阈值,输出点。void get_median_line(Mat& src, int thresh, vector<Point2d>& points){ if (src.empty()) return; // 一、_图像中心线提取c++

HTML/CSS 常用单词整理_css上下外边距单词-程序员宅基地

文章浏览阅读8.3k次,点赞27次,收藏93次。页面布局(layout)header 头部/页眉;index 首页/索引;logo 标志;nav/sub_nav 导航/子导航;banner 横幅广告;main/content 主体/内容;container/con 容器;wrapper/wrap 包裹(类似于container);menu 菜单;sub_menu/second_menu 子菜单/二级菜单;..._css上下外边距单词

随便推点

Idea打包失败|Failed to execute goal org.springframework.boot:spring-boot-maven-plugin:2.3.4.RELEASE-程序员宅基地

文章浏览阅读4.4k次,点赞2次,收藏5次。解决idea打包失败问题。_failed to execute goal org.springframework.boot:spring-boot-maven-plugin:2.3

Servlet.service() for servlet [dispatcherServlet] in context with path []-程序员宅基地

文章浏览阅读1.2k次。请看:(436条消息) JdbcTemplate空指针异常_刘贵庆的博客-程序员宅基地https://blog.csdn.net/xysxlgq/article/details/124323258_servlet.service() for servlet [dispatcherservlet] in context with path [/min

超全的莫队算法一遍过-程序员宅基地

文章浏览阅读3.6k次,点赞17次,收藏63次。莫队的基本思想、回滚莫队、树上莫队、二次离线,一文搞定。什么?你还不会莫队?看完这篇文章你就会了_莫队

vmware 安装 win10_vmware 退出iso-程序员宅基地

文章浏览阅读3.3k次。1. 下载安装vm软件,地址:VM下载2. 下载win10 iso镜像文件 (1) 下载win10安装工具:win10安装工具下载 (2) 下载完成后,选择给其他电脑安装ios文件即可得到win10的iso镜像文件(考验网速的时刻到了)。 ps:试过各种百度出来的镜像文件,发现没有能用的,还是乖乖通过官方方式下载吧。3. 安装win10 ios镜像文件 参..._vmware 退出iso

PTA-求函数的值_7-2 求函数的值编写程序,求s(x)前10项的和,x从键盘输入。-程序员宅基地

文章浏览阅读1.7k次,点赞2次,收藏3次。求函数的值:已知 s(x) = x -x^3/(3x1!)+x^5/(5x2!)-x^7/(7x3!)+... 。编写程序,求s(x)前10项的和,x从键盘输入。_7-2 求函数的值编写程序,求s(x)前10项的和,x从键盘输入。

(6.1)各种USB接口简介_常用usb接口-程序员宅基地

文章浏览阅读3.6w次,点赞7次,收藏76次。/* AUTHOR: Pinus* Creat on : 2018-11-3* REFS : Type-C与Type-A、Type-B接口 Linux USB驱动学习总结(一)---- USB基本概念及驱动架构*/USB概念介绍USB,Universal Serial Bus(通用串行总线),是一个外部总线标准,用于规范电脑与外部设备的连接和通..._常用usb接口

推荐文章

热门文章

相关标签