This is a page where I have compiled public datasets that I have come across.
- http://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html
- http://www.researchpipeline.com/mediawiki/index.php?title=Main_Page
- http://www.scaleunlimited.com/datasets/public-datasets/
- http://kevinchai.net/datasets (good)
- http://snap.stanford.edu/data/
- http://labrosa.ee.columbia.edu/millionsong/
- http://blogs.msdn.com/b/avkashchauhan/archive/2012/04/12/processing-million-songs-dataset-with-pig-scripts-on-apache-hadoop-on-windows-azure.aspx
- https://stackoverflow.com/questions/10843892/download-large-data-for-hadoop
- http://www.hadooplessons.info/2013/06/data-sets-for-practicing-hadoop.html
- http://archive.ics.uci.edu/ml/datasets.html
- http://lemurproject.org/clueweb09/
- http://stackoverflow.com/questions/2674421/free-large-datasets-to-experiment-with-hadoop
-
http://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free (good) - http://rishavrohitblog.blogspot.in/2013/02/sample-datasets.html
- http://blog.cloudera.com/blog/2013/02/how-to-resample-from-a-large-data-set-in-parallel-with-r-on-hadoop/
-
http://www.hadoopinrealworld.com/using-million-song-dataset-in-hadoop/ - http://www.datawrangling.com/some-datasets-available-on-the-web/ (good)
- spatialhadoop.cs.umn.edu/datasets.html
- https://www.datadr.org/doc/airline.html
- https://ibmhadoop.challengepost.com/details/data
- http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/