Select ‘shop’ as type,
Shopid as type_name, Uid From log ) y
Group by type,type_name,uid; Insert into t2
Select type,type_name,sum(1) From t1
Group by type,type_name; From t2
Insert into t3
Select type,type_name,uv Where type=’page’
Select type,type_name,uv Where type=’shop’ ;
最终得到两个结果表t3,页面uv表,t4,店铺结果表。从io上来说,log一次读。但比方案2少次hdfs写(multi insert有时会增加额外的map阶段hdfs写)。作业数减少1个到3,有reduce的作业数由4减少到2,第三步是一个小表的map过程,分下表,计算资源消耗少。但方案2每个都是大规模的去重汇总计算。
这个优化的主要思路是,map reduce作业初始化话的时间是比较长,既然起来了,让他多干点活,顺便把页面按uid去重的活也干了,省下log的一次读和作业的初始化时间,省下网络shuffle的io,但增加了本地磁盘读写。效率提升较多。
这个方案适合平级的不需要逐级向上汇总的多粒度uv计算,粒度越多,节省资源越多,比较通用。
问题9:多粒度,逐层向上汇总的uv结算。比如4个维度,a,b,c,d,分别计算a,b,c,d,uv; a,b,c,uv;a,b,uv;a;uv,total uv4个结果表。这可以用问题8的方案二,这里由于uv场景的特殊性,多粒度,逐层向上汇总,就可以使用一次排序,所有uv计算受益的计算方法。
案例:目前mm_log日志一天有25亿+的pv数,要从mm日志中计算uv,与ipuv,一共计算 三个粒度的结果表
(memberid,siteid,adzoneid,province,uv,ipuv) R_TABLE_4 (memberid,siteid,adzoneid,uv,ipuv) R_TABLE_3 (memberid,siteid,uv,ipuv) R_TABLE_2
第一步:按memberid,siteid,adzoneid,province,使用group去重,产生临时表,对cookie,ip 打上标签放一起,一起去重,临时表叫T_4;
Select memberid,siteid,adzoneid,province,type,user From(
Select memberid,siteid,adzoneid,province,‘a’ type ,cookie as user from mm_log where ds=20101205 Union all
Select memberid,siteid,adzoneid,province,‘i’ type ,ip as user from mm_log where ds=20101205 ) x group by memberid,siteid,adzoneid,province,type,user ;
第二步:排名,产生表T_4_NUM.Hadoop最强大和核心能力就是parition 和 sort.按type,acookie分组,
Type,acookie,memberid,siteid,adzoneid,province排名。 Select * ,
row_number(type,user,memberid,siteid,adzoneid ) as adzone_num , row_number(type,user,memberid,siteid ) as site_num, row_number(type,user,memberid ) as member_num, row_number(type,user ) as total_num
from (select * from T_4 distribute by type,user sort by type,user, memberid,siteid,adzoneid ) x; 这样就可以得到不同层次粒度上user的排名,相同的user id在不同的粒度层次上,排名等于1的记录只有1条。取排名等于1的做sum,效果相当于Group by user去重后做sum操作。
第三步:不同粒度uv统计,先从最细粒度的开始统计,产生结果表R_TABLE_4,这时,结果集只有10w的级别。
如统计memberid,siteid,adzoneid,provinceid粒度的uv使用的方法就是 Select memberid,siteid,adzoneid, provinceid,
sum(case when type =’a’ then cast(1) as bigint end ) as province_uv , sum(case when type =’i’ then cast(1) as bigint end ) as province_ip ,
sum(case when adzone_num =1 and type =’a’ then cast(1) as bigint end ) as adzone_uv , sum(case when adzone_num =1 and type =’i’ then cast(1) as bigint end ) as adzone_ip , sum(case when site_num =1 and type =’a’ then cast(1) as bigint end ) as site_uv , sum(case when site_num =1 and type =’i’ then cast(1) as bigint end ) as site_ip ,
sum(case when member_num =1 and type =’a’ then cast(1) as bigint end ) as member_uv , sum(case when member_num =1 and type =’i’ then cast(1) as bigint end ) as member_ip , sum(case when total_num =1 and type =’a’ then cast(1) as bigint end ) as total_uv , sum(case when total_num =1 and type =’i’ then cast(1) as bigint end ) as total_ip , from T_4_NUM
group by memberid,siteid,adzoneid, provinceid ;
广告位粒度的uv的话,从R_TABLE_4统计,这是源表做10w级别的统计 Select memberid,siteid,adzoneid,sum(adzone_uv),sum(adzone_ip) From R_TABLE_4
Group by memberid,siteid,adzoneid; memberid,siteid的uv计算 , memberid的uv计算,
total uv 的计算也都从R_TABLE_4汇总。