Wiki Dataset Word Counter
Projects | | Links:

Note: The code and the end result cannot be publicly displayed due to copywrite by The Data Incubator; however, they can be sent privately upon request.
Using MapReduce, I determined the number of times the 100 most used words appeared in the text section of wiki dataset. After cleaning out the html strings and metadata, “the” turned out to be the most popular word with 1,431,080 appearances. Finally, I determined the aggregate data on the number of links per page including: number of pages analyzed, mean number of links per page, stdev, and quantiles. 188,418 Wikipedia pages were analyzed in total.
Learning Points: MapReduce, MRJob, MRStep, lxml.etree