Previous slide  |   Next slide  |   Table of Contents


An Introduction to Chemometrics

Good afternoon. The title of my talk today is An Introduction to Chemometrics. My agenda begins with this introduction, followed by definitions of some of the specialized terms used in chemometrics. Then, I'll discuss the goals of chemometrics, followed by a general methodology and a practical example from the literature (Saxberg, Duewer, Booker and Kowalski, 1978). I'll finish with some pitfalls that a chemometrics user might be vulnerable to and a summary.

A natural place to begin an introduction to chemometrics is to start with a few definitions. This will be useful for two reasons; to make clear some key terms which occur frequently in the literature and this talk, and to give you a brief look at a few of the concepts which chemometrics frequently uses.

Chemometrics itself can be defined as the application of mathematical, statistical, graphical or symbolic methods to maximize the chemical information which can be extracted from data. Chemometric procedures can prove useful at any point in an analysis, from the first conception of an experiment, until the data is discarded (if ever).

The term multivariate analysis, as usually applied by chemometricians, defines any statistical, mathematical or graphical approach which considers multiple variables simultaneously. This is slightly different from the standard statisticians' definition of multivariate analysis, which requires that multiple dependent variables be considered simultaneously in the analysis. Chemists occasionally perform classical multivariate analysis, but usually, when a chemist refers to multivariate analysis, he is referring to the more general definition. A few researchers have suggested that chemists should refer to what they do as multivariable analysis. While this would clear up occasional confusion, this suggestion has not been widely adopted. For this presentation, the chemometric definition of multivariate analysis will be used.

Pattern recognition approaches seek to identify similarities and regularities present in data. A major subset of pattern recognition is cluster analysis. Cluster analysis seeks to perceive natural classifications, often called clusters, in data. Pattern recognition and cluster analysis problems are usually trivial in two or three dimensions, since people are excellent at such discriminations when they can plot the data. When more dimensions are involved, computers are usually used to assist.

The concept of n-dimensional space, also known as hyperspace, is used when dealing with multidimensional systems. Hyperspace is just the logical extension of three dimensional space, to reflect more complex multivariable situations. The prefix hyper- is used extensively in chemometric discussions to refer to anything which extends beyond 3-dimensional space. Reference might be made to a 10-dimensional hyperplane, for instance, which exists in 11-dimensional hyperspace, much like a 2-dimensional plane exists in 3-dimensional space. Frequently, practical problems which cannot be solved in 3 dimensions, can often be handled using the information provided by additional dimensions and with the help of cluster analysis and pattern recognition.

The concept of hyperspace disturbs some potential users of chemometric techniques. They are often uncomfortable dealing with situations that they cannot directly visualize. Many cluster analysis techniques address this problem by attempting to bring the problem back to a three dimensional environment that can be directly perceived. This cannot always be done, but any simplification obtained assists the researcher to better understand his data. Another important point is that the mathematical and statistical techniques used for chemometrics can almost always can be extended into hyperspace. There is no mathematical distinction between space and hyperspace. The only reason for such a distinction is that people can perceive space directly, but cannot perceive hyperspace. So, although computers aren't especially good at cluster analysis, they can deal with hyperspace. By using computers to bring the problem back down into normal space, we cooperate with them to solve the problem efficiently.

One of the primary goals of chemometrics is to reduce the number of dimensions needed to accurately portray the characteristics of the data set. There are a wide variety of methods available to do this, either by selecting an important subset of the original variables, or by creating a set of new variables, which are more efficient than the original variables in describing the data. The creation of new variables can be approached in several ways; two of these are projection and mapping. Projection is the more common technique and involves using weighted linear combinations of the original variables to define a new, smaller set of variables which contain nearly the same information content as the original variables. The most frequently used projection technique is principal components analysis, which I'll discuss later. Mapping is similar to projection, but the transformations considered in this case are non-linear. These non-linear methods often seek to preserve certain properties in the data, such as interpoint distances, while performing a dimensional reduction. Mapping results can be difficult to interpret, however, and aren't as important as the projection methods.

Traditionally, the most frequently used chemometric techniques have been in the area of cluster analysis and pattern recognition, which I have already briefly mentioned. But there are other techniques fitting the definition of chemometrics, that been of great use in chemical applications. At this point, I want to present a rough classification scheme for chemometric analysis. It contains four areas that chemists are concerned with. Under each of these, I have included subclasses, pertinent to them. There is a fair amount of crossover in this scheme, but it helps to delineate what chemometrics has to offer to chemists.

One area in this classification scheme contains those procedures which help the researcher gather good data. The second set of techniques, which has some methods in common with the previous set, assists extraction of valuable information from good data. Notice the underline under "good". Bad data has little or no valuable information associated with it and chemometrics cannot help such data. A third class of growing importance to chemists is spectral library matching and comparison. Finally, artificial intelligence applications in chemistry have been considered by many to be chemometric techniques.

Scientists want to gather good data. When considering experiments, they want to learn something and generally want to learn it as economically in time and effort as possible. Chemometrics can assist in several ways.

In some sense, most chemometric techniques are statistical in nature; a separate statistical category serves as a catch-all for those techniques which don't comprise a category of their own in chemometrics. An example of a technique, very important to chemists, which I would place here is the statistical design of experiments. Designed experiments offer a maximum amount of information for a minimum amount of effort and should always be considered when an experiment is being conceived. The experimenter plans carefully the measurements he will make, consistent with known experimental designs. By conducting an analysis of variance (ANOVA) on these results, he can draw inferences using at or near the minimum number of experiments. Except for the most trivial of problems, it is worthwhile to consider using designed experiments.

Optimization of experimental parameters is often performed to improve the sensitivity and precision of chemical analyses. Manipulation of the parameters can either be guided by assumed mathematical models of system behavior or by iterative methods, such as a simplex algorithm. By finding the optimum experimental conditions under which to run the experiment, the quality of the data gathered is improved.

Calibration techniques relate instrumental response to chemical concentrations. Improper calibration techniques will yield improper solutions to problems. Linear, non-linear and multivariate calibration algorithms are available, depending on the instrument involved and the analysis requirements.

Resolution techniques deal with distinguishing between various component parts of the system. Resolution problems most often occur when peaks overlap in chromatography and can be dealt with using methods such as least squares peak and curve resolution and Fourier spectral deconvolution.

Signal processing methods are very similar to those used for resolution. Signal processing seeks to distinguish between signal and noise, extracting the latter, while resolution attempts to distinguish between multiple signal components present in data. If noise is considered as one of the components of a signal, signal processing is just a subclass of resolution. It is usually considered a separate entity in the literature, however. Besides using many of the same techniques as resolution, signal processing also uses various forms of spectral filtering such as least squares polynomial and Kalman filtering.

Once good data has been gathered, a researcher wants to do the best job he can to pull information from it. If he is working with good, pertinent data, chemometrics can often help him answer the questions that made him take it in the first place.

Statistics is also used here as a catch-all category. Two methods that fit here are error propagation and distributional property assessment. Error propagation techniques allow the quality of results an analysis generates to be monitored. It can keep an analyst from wasting time on data that doesn't merit it. Distributional property assessment allows the analyst to begin learning the nature of his data. The question which is usually of interest here is "How close do the individual variables come to following a Gaussian distribution?" When the variables of interest are nearly Gaussian, a researcher can feel comfortable about using techniques which work poorly when the data is not normally distributed.

Modeling and parameter estimation defines a system in terms that are more easily interpreted than are tables of raw data. Many chemometric methods are concerned with modeling and parameter estimation. Calibration, resolution and signal processing can all be considered to be modeling and parameter estimation techniques, although it is more useful for our purposes to separate them. Such general parameters as means and standard deviations fall into this category, as do all forms of regression analysis. Simulation techniques also can be included here.

A special set of methods are used to study structure-property and structure-activity relationship estimation. The goal here is to predict physical, chemical or biological properties of a substance, based on its chemical composition. Techniques used include molecular connectivities, topological distance calculations and autocorrelation functions.

Principal components analysis is among the most versatile of all chemometric methods. It seeks to maximize the variance information present in a data set in as few new dimensions as is possible. Graphically, principal components analysis twists the axes of the data to conform to axes which contain a maximum amount of variance information; this can be thought of as looking at the data from a different perspective in hyperspace. A two-dimensional example is shown here. The first new dimension falls on the longest axis of variance; while the second new dimension contains the rest. As much variance as possible is forced into x-prime. The rest is spanned by y-prime. Mathematically, this is a simple linear algebra transformation for any reasonable number of dimensions. Factor analysis modifies the results of principal components analysis in various ways in an effort to come up with more interpretable factors. Both principal components analysis and factor analysis help to reduce the number of variables which need to be considered in an analysis, and often yield a new set of variables which describes important, but unmeasurable property of the system.

Pattern recognition techniques seek to identify regularities and similarities which are present in data. Mathematical pattern recognition is often confused with optical pattern recognition, which seeks to teach machines to evaluate optical data. While these techniques do have some common bases, they are quite different in application, methodology and terminology. Many techniques are included in the category of pattern recognition. Among these are direct two and three dimensional plots, projection and mapping, cluster and discriminant analysis.

A third major class of techniques, of great interest in recent years, is spectral library matching and comparison. These techniques seek to efficiently elucidate chemical structures from spectral data. They include nearest neighbor and distance measures, correlation analysis, probability matching, Fourier and principal components analysis.

A final class, just beginning to catch hold is the field of artificial intelligence. Artificial intelligence aims to provide silicon assistants, colleagues or experts to assist in solving chemical problems. The coming barrage of smart instruments, data bases, and robots will be some of the results of continuing research in AI. Expect self-optimizing instruments, automated structural elucidation and perhaps even automatic chemometric analysis from efforts in this field.

For chemometrics to be of much use to an analyst, it must be more than a hit or miss venture. Some sort of methodology is usually needed to prevent wasted time and effort. Although chemometrics is a broad field, a general strategy can be given. This methodology gives you a feel for what sorts of considerations you would have in a chemometric application.

First, plan the experiment yourself, if possible. The best results can be obtained when you have an intimate knowledge of the conditions under which the experiment was run. Try to use statistically designed experiments to efficiently gather the most information for the least effort. As part of this process, go to the literature and see if anyone else has tried a similar kind of experiment. Specific, well-defined methodologies exist for some classes of problems. Two prominent examples are the books of Parsons and Wolff in pattern recognition and Malinowski and Howery in factor analysis. You should also consider carefully what you want to learn when you decide what chemical methods to use and what variables you will measure. Ignoring critical variables throws away information that no technique can recover.

Next, begin to take data. In the early stages, in particular, pay close attention to data quality. Depending on the kind of data you are gathering, you may be able to optimize machine response, increase resolution and signal process to reduce noise. You should also watch for unusual events. In other words, take any step possible to maintain and improve the quality of your data. As part of this, it is vital that you observe and record all the data for every sample that you consider. Missing values severely degrade the performance of multivariate techniques; in most cases, it might be best to remove the whole observation from the data set, rather than attempt some kind of missing value correction.

Once the data has been taken, it is often necessary to transfer it somewhere else, usually to a mainframe computer. If possible, avoid manual data entry, since this method is more prone to error. Always, verify transferred data carefully. Unusual data values should be investigated, to eliminate any errors in data transfer, since many of the chemometric algorithms give poor results for data that contains outliers.

A step which fits well with data verification involves developing a basic knowledge of the character of the data set. Means, standard deviations, ranges and other statistical measures should be computed for each of the variables in the data set. Again look for outliers and see if the results you obtain are consistent with what you expected. Check to see if each variable roughly conforms to a Gaussian distribution. Also, look at the correlation coefficients between the variables, to see if there are strong linear relationships present. This process of getting familiar with your data will usually be of value later in the analysis. Try to gather any information that might prove pertinent.

Once the exploratory data analysis is completed, the inquiry phase can begin. This is usually an iterative process. A pertinent routine is run, the information it provides is evaluated and used to further understand the problem. Based on the new knowledge, a complementary technique might be chosen, the questions of interest might be reformulated or a new set of questions might be generated. A complete analysis can involve many of these steps, with much refining of knowledge, using a broad array of techniques.

Finally, the analyst should review carefully what has been observed. When a result is obtained that doesn't match with knowledge of the system, the techniques or the data are probably at fault. Perhaps critical assumptions of some of the methods were violated, or the data has flaws that prevent a straightforward solution. In any case, don't abandon knowledge of how a system works. When an open-minded consideration of results yield a conclusion that conflicts with common sense, stick with common sense. Chemometric methods come with no guarantee of success.

I chose what I considered to be a very interesting practical example to present to you today. This isn't work that I've done personally; it has been taken from the chemometric literature. It involves forensic analysis of whiskeys using pattern recognition and cluster analysis techniques. Substitution of low priced whiskeys for more well-known and expensive brands is a common violation of state and federal liquor laws. To obtain a conviction, it must be proved that a suspected whiskey sample, sold as one brand, is not truly that brand. To develop this technique, a typical premium whiskey was chosen as a target brand. A set of less expensive whiskeys was chosen as mock counterfeits of the premium whiskey. It was desired that the determination be performed using a single whiskey sample, and that a simple, unmodified gas chromatograph, available in virtually any crime lab, be used. The ideal analysis would be fast, simple and accurate. Saxberg, et. al., measured the peak heights and half-widths, to estimate the peak areas of 17 prominent GC peaks, common to all the samples. The peak areas were used as data for the multivariate analysis. This analysis illustrates a common chemometric goal, to find out what variables can be used to separate one group of samples from another group. Here the goal is to distinguish between one brand of premium whiskey and all other whiskey samples. Variations of this idea include the discrimination of "good" versus "bad" samples, "high" versus "low" samples, "in-spec" versus "out-of-spec" product, etc. An additional complication exists in the whiskey example. An opened bottle of whiskey changes in composition with time. The method should not separate old from fresh premium whiskey samples in this analysis.

In the early stages, the researchers didn't apply any constraints to the analysis. They merely looked at the data in various ways, trying to understand the natural structure of the data. As part of this process, they used principal components analysis. A plot of the result shows that the first two principal component axes are sufficient to achieve the desired separation. All the premium whiskey samples fall to the right of the line drawn, while all the non-premium samples fall to the left. Although it was not desired, some tendency can be seen for old and fresh premium whiskey samples to separate. This does not interfere with the desired separation, however. If a separation between old and fresh premium whiskey samples were desired, additional effort could probably achieve it.

The analysis isn't complete at this point, since the original goals of the problem have not been satisfied. Principal components analysis showed that there is reason to believe that a true, natural separation exists between premium and non-premium whiskey samples. It is possible, although not likely, that this solution is an artifact, a spurious result. At least one confirmatory technique should be used to verify results such as this. A more serious problem is that, although the solution appears accurate, it isn't really either easy or fast. This is true because the new dimensions created by the principal components analysis are linear combinations of the original 17 variables. For each sample run, 17 peak areas must be calculated, entered and processed to provide a decision. A simpler solution is more desirable.

The next step performed was a K-nearest neighbors analysis. This technique predicts group membership of a sample based on the group membership of its K nearest neighbors. This procedure involves calculating the distance between each pair of points and choosing one or several values of K. The group identities of each of the K nearest neighbors for each of the samples of interest are tallied. The group with the largest number of "votes" for each sample is the group that that sample is assigned to. In this analysis, all of the cases were considered to be samples of interest. The analysis was therefore a test, to see if the premium/non-premium groupings would hold up. Usually , they did. Until nine nearest neighbors were reached, only one of the 68 samples was misclassified. Two techniques, not closely related, showed that some sort of natural separation seemed to exist in the data, but again, the criteria of simplicity, ease and accuracy had not been met.

The next procedure run was a linear discriminant analysis. This is one of a family of techniques which seek to find hyperplanes, planes in n-dimensional space, which separate one category from another. Some of these techniques are iterative, seeking to use as few variables as possible. In this example, a stepwise discriminant analysis was used. This technique steps important variables into the process one at a time. Using stepwise discriminant analysis, a good solution was found using only two of the original 17 variables. A plot of feature 1, which was later found to be acetaldehyde, versus feature 9, found to be isoamyl alcohol, was sufficient to separate premium samples from non-premium samples.

Here's the sample chromatogram again so that you can look at peaks 1 and 9. Peak 9 is one of the more prominent peaks, but peak 1 is not particularly large. There was no reason to think that this peak would be important before the analysis was completed; the analysis determined this result. You might wonder why the discriminant analysis was not run first, to avoid the effort of the other steps. Discriminant analysis is a particularly tricky technique. Given enough computer time, it can always achieve a targeted separation, even if the separation is chosen randomly. Verification of results should always be a part of chemometric analysis, since the possibility of spurious results very frequently exists.

So, a problem originally comprising seventeen dimensions was reduced to a problem in two dimensions. A forensic analyst would now only need to find the peak areas of the acetaldehyde peak and the isoamyl alcohol peaks and determine which side of the line the plotted sample falls on, to know whether the whiskey sample was the premium sample or some other whiskey. The solution allows fast, simple and accurate results to be obtained, meeting the desired goals.

To end my talk, I'll spend a few minutes on some of the pitfalls that a chemometrics user might fall (or be pushed) into.

There seems to be two general classes of traps in chemometric analysis. The first type of trap involves expecting something for, or from, nothing. The second kind of trap is due to not giving proper attention to what is being done. One of the perceptions some chemists have is that chemometric techniques are magical, surefire methods to cull what they want from whatever data is available. That perception isn't true. If good data exists, chemometric techniques can often allow a researcher to extract information from it, but sometimes little or nothing of value comes out. Just as not all chemical experiments work, chemometric analysis can also fail. The chance of success goes up greatly when the analyst has a strong knowledge of the chemometric techniques he is using. Chemometric techniques must not be considered to be "black boxes". Just as a poor choice of chemical methods would usually not satisfy the requirements of an analysis, a haphazard choice of chemometric techniques will usually yield poor results. This is especially true when the analyst violates the assumptions on which a technique is based. The basic goals, assumptions and weaknesses of chemometric methods should be understood. An experienced chemist knows when he can stretch the limits of a technique and when he cannot. Chemometric techniques demand equal expertise of the analyst. You can stretch the limits of these techniques if you are aware of what is going on. These precautions parallel good chemical practice.

Finally, as I've already stated, you can't get useful information from bad data. There are two situations where data problems frequently occur. When critical variables are left out of the data gathering process, the data will not properly describe the system of interest. Another way of stating this is that you can't get information out of the data that isn't there. Consider, for instance, if the acetaldehyde peak areas were left out of the whiskey database, would the experiment have still yielded acceptable results? Another more subtle situation occurs when you don't have much control over the data you are asked to look at. Data gathered by others often has been filtered and controlled in various ways that you may never be told of. Since most statistical techniques require random samples of the population, such uncontrolled manipulation can interfere with the multivariate analysis. Basically, you can't get at information that's been hidden from you. When he has not been involved in gathering the data himself, the analyst must be as fully informed as is possible about all of the details concerning the data. Unless someone has given careful consideration to what the data will be used for, the analysis will probably not succeed.

Another set of traps involve not watching and thinking about what is being observed. In other words, an analyst must keep his eyes and mind open. He should understand the subject he is dealing with, as well as the chemometric techniques he is using. At very least, he should have very close access to someone who has an in-depth knowledge of the principles involved behind the research being done. Decision processes in the inquiry phase usually require both chemical input and statistical knowledge. This is especially true in the rubber industry. Often, only the knowledge of the system he is dealing with will keep the researcher on the right track.

The results of a chemometric analysis assume that all unmeasured variables involved with a process are either held constant or are unimportant. Should either of these assumptions fail, the original results of the study may become invalid. In the whiskey example, a change in the blending or aging strategies of the premium whiskey could invalidate the results obtained. For on-going projects, intermittent confirmatory testing should be performed, to guarantee that model validity is maintained.

One final hint which should be considered is to avoid becoming too fond of any one technique. Each technique has unique strengths and weaknesses. Used in concert with each other, these methods complement each other. Overreliance on any one technique makes the analysis vulnerable to its weaknesses.

To end, I'd like to credit one of the major sources that I used in preparing this talk and which anyone curious about chemometrics should be familiar with. In April of even numbered years, Analytical Chemistry prints a special Fundamental Reviews issue, which is a bibliography of recent works in areas considered fundamental to analytical chemistry. Since 1980, there has been a section devoted specifically to Chemometrics. Before 1980, the section was called Statistical and Mathematical Methods in Analytical Chemistry. I paged through the last several of these while I was preparing this presentation; they helped me decide how to classify the various techniques which comprise chemometrics and also helped me find recent references with which I could refresh myself on topics I was familiar with and learn about those of which I had little or no knowledge.

I hope that I've given you a glimpse of the depth of the field of chemometrics and that, if it seems to fit your needs, you'll pursue your interest further.

General References

General Chemometrics

M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics, (John Wiley & Sons, New York, 1986).

Cluster Analysis

D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, (John Wiley & Sons, 1983).

Factor Analysis

E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry, (John Wiley & Sons, New York, 1980).

Pattern Recognition

K. Varmuza, Pattern Recognition in Chemistry, (Springer- Verlag, Berlin, 1980).

D.D. Wolff and M.L. Parsons, Pattern Recognition Approach to Data Interpretation, (Plenum Press, New York, 1983).

Illustrative Example

B. E. H. Saxberg, D. L. Duewer, J. L. Booker, and B. R. Kowalski, Anal Chim Acta, 103, 201-212 (1978).

An Introduction to Chemometrics - Text / B A Rock, 469D, 7-1119 / Jun 5 1997


Previous slide  |   Next slide   |   Table of Contents