sql - Mapreduce Table Diff -

- September 15, 2012

i have 2 versions (old/new) of database table 100,000,000 records. in files:

trx-old trx-new

the structure is:

id date amount memo 1  5/1     100 slacks 2  5/1      50 wine

id simple primary key, other fields non-key. want generate 3 files:

trx-removed (ids of records present in trx-old not in trx-new) trx-added   (records trx-new ids not present in trx-old) trx-changed (records trx-new non-key values have changed since trx-old)

i need operation every day in short batch window. , actually, need multiple tables , across multiple schemas (generating 3 files each) actual app bit more involved. think example captures crux of problem.

this feels obvious application mapreduce. having never written mapreduce application questions are:

is there emr application this?
is there obvious pig or maybe cascading solution lying about?
is there other open source example close this?

ps saw diff between tables question solutions on there didn't scalable.

pps here little ruby toy demonstrates algorithm: ruby dbdiff

i think easiest write own job, because you'll want use multipleoutputs write 3 separate files single reduce step when typical reducer writes 1 file. you'd need use multipleinputs specify mapper each table.

Search This Blog

HPH

sql - Mapreduce Table Diff -

Comments

Post a Comment

Popular posts from this blog

Perl - how to grep a block of text from a file -

delphi - How to remove all the grips on a coolbar if I have several coolbands? -

javascript - Animating array of divs; only the final element is modified -