Kent Ridge Bio-medical Dataset

Back to I2R Data Mining Department's Dataset Respository

This is an online repository of high-dimentional biomedical data sets, including gene expression data, protein profiling data and genomic sequence data that are related to classification and that are published recently in Science, Nature and so on prestigious journals. These biomedical applications are also challenging problems to the machine learning and data mining community. As the file formats of these original raw data are different from common ones used in most of machine learning softwares, we have transformed these data sets into the standard .data and .names format and stored them in this repository. Besides, we also provide data in .arff format which is used by Weka (a machine learning software package developed at University of Waikato in New Zealand). Weka is written in Java and is open source software issued under the GNU General Public License. Detailed documentation of Weka can be found at http://www.cs.waikato.ac.nz/~ml/weka/.

Hopefully, our this data reformatting could save plenty time of data pre-processing for users who want to evaluate their algorithms using a group of bench-mark high-dimentional biomedical data sets. Here, we also introduce our CS4 algorithm for efficient discovery of many diversified decision trees.


List of data sets


CS4: A new ensemble method based decison trees

CS4 is the acronym of Cascading-and-Sharing for ensemble of decision trees. It is our new idea to construct committees of decision trees. Unlike widely used ensemble approaches such as Bagging or Boosting where bootstrapped training data are used, our CS4 algorithm always keeps the original training data unchanged. The main idea of CS4 to build a committee of trees is to use different top-ranked features as the root node of different trees. Our recent two publications describe the use of CS4 in bioinformatics and data mining, one is in Bioinformatics supplementary issue of European Conference on Computational Biology (ECCB) 2003, the other is in the proceedings of the Third IEEE International Conference on Data Mining (ICDM'03). Please contact us if you require the source code of CS4. <
Contact: Jinyan Li or Huiqing Liu