Hadoop is the de-facto standard for Big Data, that is, large scale data processing and storage. Hadoop is a relatively young platform that still suffer from fundamental scalability issues: all the file system metatdata must be stored in the memory heap of one unique node, the name node, and all the scheduling ecisions are taken by one unique node, the scheduler. At KTH/SICS, we are developing Hadoop Open Platform-as-a-Service (Hops), a new distribution of Apache Hadoop with scalable, highly available, customizable metadata and scheduling.
How to share R codes? How does a package look like? What tools are there for testing?
And what is Reproducible Research? Finally, we solve a problem in R.