Just getting started writing Hadoop MR jobs. Hopefully we'll be switching over to Spark soon, but we're stuck doing MR for now.
I'd like to group records by a hash of their value. But I'd like to sort them by something totally unrelated--a timestamp in their value. I'm confused about how to best do that. I see two options:
1) Have a first MR job that computes the hash for each value in its mapper, and then reduces all the records of that hash to the same value however it wants (I actually have this much working just as we need right now). Then chain a second MR job that re-sorts the output of the reducer above by the timestamp in the value. Inefficient?
2) I've read some blogs/posts about how to use composite keys, so maybe I could accomplish it all in one step? I'd create some kind of composite key that had both the hash for grouping, and the timestamp for sorting in the mapper. But I'm unclear if this is possible. Can it still group correctly if the sort is completely unrelated to the grouping? Also unsure what interfaces I'd need to implement and what classes I'd need to create or how to configure it.
I'm not talking about a secondary sort. I don't care about the order of the objects in the Iterator for each reduce call. I'm concerned with the order things get emitted from reducer, needs to be a global sort by timestamp.
Whats the recommended way to do something like this?
Aucun commentaire:
Enregistrer un commentaire