hadoop - How to parse for occurences based on inputs in the same file -

- July 15, 2013

event1           foo_id1 event1           foo_id2 event1           foo_id4 event1           foo_id6 event1           foo_id7 event1           foo_id8 event1           foo_id8 event1           foo_id1 event1           foo_id4  event2           foo_id1 event2           foo_id2 event2           foo_id3 event2           foo_id4 event2           foo_id5 event2           foo_id6 event2           foo_id8 event2           foo_id9 event2           foo_id11

the above information available file in s3 under bucket (say s3://hadoop.mycompany.com/bucket1/foo1.txt).

all events have foo_ids. events in "event2", know how many times foo_id(s) occur in event1.

e.g. in above case,

foo_id1=2 foo_id2=1 foo_id3=0 foo_id4=2 foo_id5=0 foo_id6=1 foo_id8=2 foo_id9=0 foo_id11=0

how write hive script return data in expected format?

hi can accomplished using following hive script:

first need create hive external table using command

create external table events (event string, foo string) row format delimited fields terminated '\t' location 's3n://hadoop.mycompany.com/bucket1/';
run following query

select e2.foo, count(e1.foo) events e2 left outer join events e1 on e1.foo = e2.foo , e1.event = 'event1' e2.event = 'event2' group e2.foo;

you should results need, this:

foo_id1  2 foo_id11 0 foo_id2  1 foo_id3  0 foo_id4  2 foo_id5  0 foo_id6  1 foo_id8  2 foo_id9  0

hope solves problem.

Search This Blog

HPH

hadoop - How to parse for occurences based on inputs in the same file -

Comments

Post a Comment

Popular posts from this blog

objective c - Can't build GCM with Protobuf in Xcode -

Winapi c++: DialogBox hangs when breaking a loop -

How to use function view in Drupal -