Back to I2R Data Mining Department's Dataset Respository
This is an online repository of high-dimentional biomedical data sets, including gene expression data, protein profiling data and genomic sequence data that are related to classification and that are published recently in Science, Nature and so on prestigious journals. These biomedical applications are also challenging problems to the machine learning and data mining community. As the file formats of these original raw data are different from common ones used in most of machine learning softwares, we have transformed these data sets into the standard .data and .names format and stored them in this repository. Besides, we also provide data in .arff format which is used by Weka (a machine learning software package developed at University of Waikato in New Zealand). Weka is written in Java and is open source software issued under the GNU General Public License. Detailed documentation of Weka can be found at http://www.cs.waikato.ac.nz/~ml/weka/.
Hopefully, our this data reformatting could save plenty time of data pre-processing for users who want to evaluate their algorithms using a group of bench-mark high-dimentional biomedical data sets. Here, we also introduce our CS4 algorithm for efficient discovery of many diversified decision trees.