hadoop - How to parse for occurences based on inputs in the same file -


event1           foo_id1 event1           foo_id2 event1           foo_id4 event1           foo_id6 event1           foo_id7 event1           foo_id8 event1           foo_id8 event1           foo_id1 event1           foo_id4  event2           foo_id1 event2           foo_id2 event2           foo_id3 event2           foo_id4 event2           foo_id5 event2           foo_id6 event2           foo_id8 event2           foo_id9 event2           foo_id11 

the above information available file in s3 under bucket (say s3://hadoop.mycompany.com/bucket1/foo1.txt).

all events have foo_ids. events in "event2", know how many times foo_id(s) occur in event1.

e.g. in above case,

foo_id1=2 foo_id2=1 foo_id3=0 foo_id4=2 foo_id5=0 foo_id6=1 foo_id8=2 foo_id9=0 foo_id11=0 

how write hive script return data in expected format?

hi can accomplished using following hive script:

  1. first need create hive external table using command

    create external table events (event string, foo string) row format delimited fields terminated '\t' location 's3n://hadoop.mycompany.com/bucket1/';

  2. run following query

    select e2.foo, count(e1.foo) events e2 left outer join events e1 on e1.foo = e2.foo , e1.event = 'event1' e2.event = 'event2' group e2.foo;

you should results need, this:

foo_id1  2 foo_id11 0 foo_id2  1 foo_id3  0 foo_id4  2 foo_id5  0 foo_id6  1 foo_id8  2 foo_id9  0 

hope solves problem.


Comments

Popular posts from this blog

c++ - Function signature as a function template parameter -

algorithm - What are some ways to combine a number of (potentially incompatible) sorted sub-sets of a total set into a (partial) ordering of the total set? -

How to call a javascript function after the page loads with a chrome extension? -