Yan Ma

Date of Graduation


Document Type



This thesis study is intended to explore the usability and efficiency of tree ensemble learning when applied to the following research areas: multi-modal biometrics information fusion, software defect prediction and microarray data analysis. The data sets from these three areas have various structures in terms of the sample size, number of features, and the class labels distribution. For example, microarray experiments produce high-dimensional data with at least thousands of variables (genes), while the data sets from biometrics and software engineering studies are large-sized with hundreds to thousands of observations, majority of which are of the same class label. Different data structure has certain requirements on the learning algorithms. No matter what requirements are posed on the learning techniques, the ultimate goal is to achieve a high prediction accuracy as much as possible. This thesis centers on making the best use of tree ensemble learning methodologies in biometrics, software engineering and microarray research. Tree ensembles can also be used in analyzing time-course microarrays. When the toxicological microarray experiments involve both treatment and time, although it is of interest to find genes whose expression differ significantly among toxins (or over time), the principal interest is in the toxin-based interactions of gene response patterns over time. It is possible to decompose the toxin levels, time points and toxin-time interactions into contrasts and apply tree ensembles as a supervised learning technique to the groups defined by each contrast. The variables that are important in distinguishing between groups specified by the contrast on treatment and time interaction are consistent with the genes showing significant interaction effect. The procedure is most interpretable when the factors have two levels, that is, a 2 k factorial design. This approach is tested on a simulated data set based on a data obtained from a propanil study.