sql - Mapreduce Table Diff -

- September 15, 2012

i have 2 versions (old/new) of database table 100,000,000 records. in files:

trx-old trx-new

the structure is:

id date amount memo 1  5/1     100 slacks 2  5/1      50 wine

id simple primary key, other fields non-key. want generate 3 files:

trx-removed (ids of records present in trx-old not in trx-new) trx-added   (records trx-new ids not present in trx-old) trx-changed (records trx-new non-key values have changed since trx-old)

i need operation every day in short batch window. , actually, need multiple tables , across multiple schemas (generating 3 files each) actual app bit more involved. think example captures crux of problem.

this feels obvious application mapreduce. having never written mapreduce application questions are:

is there emr application this?
is there obvious pig or maybe cascading solution lying about?
is there other open source example close this?

ps saw diff between tables question solutions on there didn't scalable.

pps here little ruby toy demonstrates algorithm: ruby dbdiff

i think easiest write own job, because you'll want use multipleoutputs write 3 separate files single reduce step when typical reducer writes 1 file. you'd need use multipleinputs specify mapper each table.

Search This Blog

HPH

sql - Mapreduce Table Diff -

Comments

Post a Comment

Popular posts from this blog

c++ - Function signature as a function template parameter -

algorithm - What are some ways to combine a number of (potentially incompatible) sorted sub-sets of a total set into a (partial) ordering of the total set? -

How to call a javascript function after the page loads with a chrome extension? -