sql - Mapreduce Table Diff -
i have 2 versions (old/new) of database table 100,000,000 records. in files:
trx-old trx-new
the structure is:
id date amount memo 1 5/1 100 slacks 2 5/1 50 wine
id simple primary key, other fields non-key. want generate 3 files:
trx-removed (ids of records present in trx-old not in trx-new) trx-added (records trx-new ids not present in trx-old) trx-changed (records trx-new non-key values have changed since trx-old)
i need operation every day in short batch window. , actually, need multiple tables , across multiple schemas (generating 3 files each) actual app bit more involved. think example captures crux of problem.
this feels obvious application mapreduce. having never written mapreduce application questions are:
- is there emr application this?
- is there obvious pig or maybe cascading solution lying about?
- is there other open source example close this?
ps saw diff between tables question solutions on there didn't scalable.
pps here little ruby toy demonstrates algorithm: ruby dbdiff
i think easiest write own job, because you'll want use multipleoutputs write 3 separate files single reduce step when typical reducer writes 1 file. you'd need use multipleinputs specify mapper each table.
Comments
Post a Comment