sql - Mapreduce Table Diff -


i have 2 versions (old/new) of database table 100,000,000 records. in files:

trx-old trx-new 

the structure is:

id date amount memo 1  5/1     100 slacks 2  5/1      50 wine 

id simple primary key, other fields non-key. want generate 3 files:

trx-removed (ids of records present in trx-old not in trx-new) trx-added   (records trx-new ids not present in trx-old) trx-changed (records trx-new non-key values have changed since trx-old) 

i need operation every day in short batch window. , actually, need multiple tables , across multiple schemas (generating 3 files each) actual app bit more involved. think example captures crux of problem.

this feels obvious application mapreduce. having never written mapreduce application questions are:

  1. is there emr application this?
  2. is there obvious pig or maybe cascading solution lying about?
  3. is there other open source example close this?

ps saw diff between tables question solutions on there didn't scalable.

pps here little ruby toy demonstrates algorithm: ruby dbdiff

i think easiest write own job, because you'll want use multipleoutputs write 3 separate files single reduce step when typical reducer writes 1 file. you'd need use multipleinputs specify mapper each table.


Comments

Popular posts from this blog

Perl - how to grep a block of text from a file -

delphi - How to remove all the grips on a coolbar if I have several coolbands? -

javascript - Animating array of divs; only the final element is modified -