hive使用经验(2)

2019-09-01 19:14

Select ‘shop’ as type,

Shopid as type_name, Uid From log ) y

Group by type,type_name,uid; Insert into t2

Select type,type_name,sum(1) From t1

Group by type,type_name; From t2

Insert into t3

Select type,type_name,uv Where type=’page’

Select type,type_name,uv Where type=’shop’ ;

最终得到两个结果表t3,页面uv表,t4,店铺结果表。从io上来说,log一次读。但比方案2少次hdfs写(multi insert有时会增加额外的map阶段hdfs写)。作业数减少1个到3,有reduce的作业数由4减少到2,第三步是一个小表的map过程,分下表,计算资源消耗少。但方案2每个都是大规模的去重汇总计算。

这个优化的主要思路是,map reduce作业初始化话的时间是比较长,既然起来了,让他多干点活,顺便把页面按uid去重的活也干了,省下log的一次读和作业的初始化时间,省下网络shuffle的io,但增加了本地磁盘读写。效率提升较多。

这个方案适合平级的不需要逐级向上汇总的多粒度uv计算,粒度越多,节省资源越多,比较通用。

问题9:多粒度,逐层向上汇总的uv结算。比如4个维度,a,b,c,d,分别计算a,b,c,d,uv; a,b,c,uv;a,b,uv;a;uv,total uv4个结果表。这可以用问题8的方案二,这里由于uv场景的特殊性,多粒度,逐层向上汇总,就可以使用一次排序,所有uv计算受益的计算方法。

案例:目前mm_log日志一天有25亿+的pv数,要从mm日志中计算uv,与ipuv,一共计算 三个粒度的结果表

(memberid,siteid,adzoneid,province,uv,ipuv) R_TABLE_4 (memberid,siteid,adzoneid,uv,ipuv) R_TABLE_3 (memberid,siteid,uv,ipuv) R_TABLE_2

第一步:按memberid,siteid,adzoneid,province,使用group去重,产生临时表,对cookie,ip 打上标签放一起,一起去重,临时表叫T_4;

Select memberid,siteid,adzoneid,province,type,user From(

Select memberid,siteid,adzoneid,province,‘a’ type ,cookie as user from mm_log where ds=20101205 Union all

Select memberid,siteid,adzoneid,province,‘i’ type ,ip as user from mm_log where ds=20101205 ) x group by memberid,siteid,adzoneid,province,type,user ;

第二步:排名,产生表T_4_NUM.Hadoop最强大和核心能力就是parition 和 sort.按type,acookie分组,

Type,acookie,memberid,siteid,adzoneid,province排名。 Select * ,

row_number(type,user,memberid,siteid,adzoneid ) as adzone_num , row_number(type,user,memberid,siteid ) as site_num, row_number(type,user,memberid ) as member_num, row_number(type,user ) as total_num

from (select * from T_4 distribute by type,user sort by type,user, memberid,siteid,adzoneid ) x; 这样就可以得到不同层次粒度上user的排名,相同的user id在不同的粒度层次上,排名等于1的记录只有1条。取排名等于1的做sum,效果相当于Group by user去重后做sum操作。

第三步:不同粒度uv统计,先从最细粒度的开始统计,产生结果表R_TABLE_4,这时,结果集只有10w的级别。

如统计memberid,siteid,adzoneid,provinceid粒度的uv使用的方法就是 Select memberid,siteid,adzoneid, provinceid,

sum(case when type =’a’ then cast(1) as bigint end ) as province_uv , sum(case when type =’i’ then cast(1) as bigint end ) as province_ip ,

sum(case when adzone_num =1 and type =’a’ then cast(1) as bigint end ) as adzone_uv , sum(case when adzone_num =1 and type =’i’ then cast(1) as bigint end ) as adzone_ip , sum(case when site_num =1 and type =’a’ then cast(1) as bigint end ) as site_uv , sum(case when site_num =1 and type =’i’ then cast(1) as bigint end ) as site_ip ,

sum(case when member_num =1 and type =’a’ then cast(1) as bigint end ) as member_uv , sum(case when member_num =1 and type =’i’ then cast(1) as bigint end ) as member_ip , sum(case when total_num =1 and type =’a’ then cast(1) as bigint end ) as total_uv , sum(case when total_num =1 and type =’i’ then cast(1) as bigint end ) as total_ip , from T_4_NUM

group by memberid,siteid,adzoneid, provinceid ;

广告位粒度的uv的话,从R_TABLE_4统计,这是源表做10w级别的统计 Select memberid,siteid,adzoneid,sum(adzone_uv),sum(adzone_ip) From R_TABLE_4

Group by memberid,siteid,adzoneid; memberid,siteid的uv计算 , memberid的uv计算,

total uv 的计算也都从R_TABLE_4汇总。


hive使用经验(2).doc 将本文的Word文档下载到电脑 下载失败或者文档不完整,请联系客服人员解决!

下一篇:食物中毒培训试题及答案

相关阅读
本类排行
× 注册会员免费下载(下载后可以自由复制和排版)

马上注册会员

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信: QQ: