pydoop

A Python MapReduce and HDFS API for Hadoop
Download

pydoop Ranking & Summary

Advertisement

  • Rating:
  • License:
  • The Apache License 2.0
  • Price:
  • FREE
  • Publisher Name:
  • Simone Leo, Gianluigi Zanetti and Luca Pireddu
  • Publisher web site:

pydoop Tags


pydoop Description

pydoop is a Python MapReduce and HDFS API for Hadoop. Built as a wrapper around the C++ API, pydoop allows you to develop full-fledged MapReduce applications with HDFS access. Here is how you write a basic Python wordcount with pydoop:from pydoop.pipes import Mapper, Reducer, Factory, runTaskclass WordCountMapper(Mapper): def map(self, context): words = context.getInputValue().split() for w in words: context.emit(w, "1")class WordCountReducer(Reducer): def reduce(self, context): s = 0 while context.nextValue(): s += int(context.getInputValue()) context.emit(context.getInputKey(), str(s))runTask(Factory(WordCountMapper, WordCountReducer))Or, for simple tasks such as word counting you can try the pydoop_script tool. Then your code would become:def mapper(k, text, writer): for word in text.split(): writer.emit(word, 1)def reducer(word, count, writer): writer.emit(word, sum(map(int, count)))Product's homepage


pydoop Related Software