Date of Graduation


Document Type


Degree Type



Eberly College of Arts and Sciences


Forensic and Investigative Science

Committee Chair

Glen P. Jackson

Committee Member

Jacqueline Speir

Committee Member

Stephen Valentine


Current search algorithms for the identification of substances based only on their electron ionization mass spectra provide the correct compound as their top result approximately 80% of the time. One contributing factor to the ~20% deviation in the first-hit recognition rate is that traditional algorithms work by comparing the unknown spectrum to an ‘ideal’ or consensus spectrum of each reference compound. The inclusion of replicate reference spectra in a database has been shown to improve the probability of ranking the correct identity in the number one position, but the variance in ion abundances caused by different conditions or different instruments remains an intractable problem and the major source of uncertainty in mass spectral identification.

To assess the relative contributions of different factors to the spectral variance of replicate spectra, this study initially considered the repeller voltage, focus lens voltage, and ion energy as primary parameters. A three-factor, three-level, full-factorial design of experiments was conducted using cocaine as a model compound. A library of cocaine spectra was collected with a gas chromatography-electron ionization-mass spectrometer (GC-EI-MS) by extracting each spectrum across the eluting peak. The 20 most abundant ions in the library of cocaine spectra were extracted to assess the contribution of each instrument parameter on the variance in ion abundances by performing multivariate analysis of variance (MANOVA). Results showed that these instrument parameters were responsible for only ~3% of the total variance in the normalized abundances. This initial finding prompted a subsequent study that monitored the branching ratios of cocaine during random fluctuations in the vacuum chamber pressure. Random changes in vacuum pressure accounted for ~90% of the natural variance in the relative ion abundances of the two most abundant peaks of cocaine (not including the base peak).

The database of 389 cocaine spectra was then used to compare the traditional consensus approaches to spectral matching with two variants of a novel algorithm called the Expert Algorithm for Substance Identification (EASI). EASI uses multivariate linear modeling to predict the ion abundances of 20 ions in each spectrum, assuming that each of the 20 ion abundances is continuously dependent on the other 19 ion abundances. One variant of this model includes intercepts in the linear models, and the other does not. To assess the effect of spectral variance on spectral identifications, traditional measures of spectral similarity or dissimilarity were calculated between each query spectrum and the consensus cocaine spectrum, including the Pearson product-moment correlation (PPMC) coefficients, mean absolute residuals (MARs), Euclidean distances, and NIST scores. These metrics were then used as binary classifiers to obtain true positives, true negatives, false positives, and false negatives at a range of decision thresholds. The models were tested on a database of spectra that included more than 300 cocaine spectra from different laboratories, more than 700 spectra of 5 common drugs, and 10 spectra of cocaine diastereomers: allococaine, pseudococaine, and pseudoallococaine. The EASI models outperformed the consensus approach on every metric. EASI coupled with the PPMC values, MARs and Euclidean distances had accuracies greater than 90% with zero false positives, including spectra of cocaine diastereomers and cocaine collected on different instruments. The Mahalanobis distances to the training set as a binary classifier were also reported, and they were found to be as good or better than EASI at discriminating between cocaine and non-cocaine spectra.

Each measure of spectral similarity was used to build receiver operating characteristic (ROC) curves and calculate the area under the ROC curve (AUC). When taking only the cocaine diastereomers as known negatives, the EASI without a constant had the highest area under the curve (AUC=0.925), followed by EASI including a constant (AUC=0.907), and lastly the consensus model with (AUC=0.829). This work shows that random variations in vacuum pressure are responsible for most of the short-term variance in replicate mass spectra and that a model (EASI) that accounts for cross-correlations between the different fragment ions allow superior compound identification to traditional algorithms.