3 clustering algorithms implemented and tested on 4 datasets.
Parallel Implementation of K Means on Hadoop Streaming.
Dataset | Objects | Number of Clusters |
---|---|---|
cho | 386 | 5 |
iyer | 517 | 10 |
demo-dataset-1 | 150 | 3 |
demo-dataset-2 | 6 | 2 |
Each row represents a gene:
- First column is gene_id.
- Second column is the ground truth cluster. "-1" represents outliers.
- Rest of the columns represent gene's expression values (attributes).