hadoop - How to parse for occurences based on inputs in the same file -
event1 foo_id1 event1 foo_id2 event1 foo_id4 event1 foo_id6 event1 foo_id7 event1 foo_id8 event1 foo_id8 event1 foo_id1 event1 foo_id4 event2 foo_id1 event2 foo_id2 event2 foo_id3 event2 foo_id4 event2 foo_id5 event2 foo_id6 event2 foo_id8 event2 foo_id9 event2 foo_id11
the above information available file in s3 under bucket (say s3://hadoop.mycompany.com/bucket1/foo1.txt
).
all events have foo_ids
. events in "event2
", know how many times foo_id
(s) occur in event1
.
e.g. in above case,
foo_id1=2 foo_id2=1 foo_id3=0 foo_id4=2 foo_id5=0 foo_id6=1 foo_id8=2 foo_id9=0 foo_id11=0
how write hive script return data in expected format?
hi can accomplished using following hive script:
first need create hive external table using command
create external table events (event string, foo string) row format delimited fields terminated '\t' location 's3n://hadoop.mycompany.com/bucket1/';
run following query
select e2.foo, count(e1.foo) events e2 left outer join events e1 on e1.foo = e2.foo , e1.event = 'event1' e2.event = 'event2' group e2.foo;
you should results need, this:
foo_id1 2 foo_id11 0 foo_id2 1 foo_id3 0 foo_id4 2 foo_id5 0 foo_id6 1 foo_id8 2 foo_id9 0
hope solves problem.
Comments
Post a Comment