Reflectance Spectroscopy Reveals the Variety and Sweetness of Apples
Chemometrics is a powerful tool for the analysis of optical spectroscopy of complex chemical systems like foods. This primer introduces the basics of chemometric analysis, showing how it can be applied to reflectance measurements of apples for quality control. From quantitative measurement of sugars to identification of the variety, chemometrics is a solution that is ripe for the picking.
Limitations of the Beer-Lambert Law
Optical spectroscopy has a long history as a powerful workhorse in analytical chemistry. It is capable of measuring even minute concentrations of the compound of interest, typically in a solution. Most chemists have applied the Beer-Lambert law at one point or another in their careers to determine an unknown concentration from a solution’s absorbance at a specific wavelength, usually based on an independently recorded calibration curve. However, few are aware of the limitations of this tried-and-true law. It is no longer valid for even moderately concentrated solutions, for samples that scatter or emit light, if stray light is generated in the setup, in the presence of concentration-dependent chemical equilibria, for inhomogeneous media, or if the spectral resolution is too low. It is also very difficult to apply to measurements made in reflectance mode, as the depth of penetration and impact of scattering are difficult to assess.
Even when a simple Beer-Lambert relationship at a single wavelength cannot be used to determine the concentration of interest, the information needed to extract that component is still present within the recorded spectrum. More sophisticated mathematical approaches can be used to determine an unknown concentration in these complex cases – a field known as chemometrics. Chemometric tools have gained popularity in recent years thanks to their powerful capabilities and application in a wide range of fields.
The Need for Chemometrics
In today’s global food network, fruit is often picked well before maturity and allowed to ripen during transport, while other fruit is stored from season to season to provide a year-round supply. In the effort to provide convenience, food quality often takes a back-seat to appearance, leaving both consumers and resellers guessing as to the ripeness of the fruit. Chemometric tools, however, can predict the sweetness of an apple, as well as its nutritional value, by analyzing near-infrared diffuse reflectance spectra obtained without opening the fruit.
Chemometrics shines where traditional spectroscopic analysis fails, as in the example of NIR diffuse reflectance spectroscopy of an apple. Why would the trusted Beer-Lambert law not apply in this application? For one, diffuse reflectance spectra depend on a number of factors that are hard to control, such as the measurement geometry, the size of the scattering particles inside the fruit, or the surface properties of the apple’s skin. Second, near-infrared spectroscopy is sensitive to all -OH, ‑NH, and -CH bond vibrational overtones. Most organic compounds absorb in this spectral range, which leads to broad, seemingly featureless NIR spectra due to the countless overlapping absorption bands.
NIR absorbance spectra, unlike their infrared cousins, cannot readily be assigned to individual chemical bonds, and thus this region of the electromagnetic spectrum has been ignored for decades. The development of chemometrics, however, has opened up a treasure trove of spectroscopic information that is easily accessible with sensitive optical spectrometers and affordable light sources.
What, Exactly, is Chemometrics?
Chemometrics is the interdisciplinary application of multivariate statistics through powerful software tools to extract qualitative and quantitative answers from scientific measurements. Here we will focus on the interpretation of optical spectroscopic measurements, which is an important (but by no means the only) application of the computational technique.
Three types of questions are commonly answered with chemometrics (see Figure 1):
- Quantification: How much of a certain substance is in a sample?
- Prediction of the octane number of gasoline fuels
- Measuring organic matter content in soils
- Detecting moisture content of paper
2. Classification: What is the identity of the sample?
- Identification of raw materials
- Authentication of high-quality wines or whiskeys
3. Discrimination: Is the sample similar to a quality standard?
- Identification of out-of-specification samples
- Monitoring progress in batch processing
- Early detection of unusual events in a continuous process
The Chemometrics Three-Step
Just as the Beer-Lambert law requires creation of a calibration curve, chemometrics requires its own calibration, created using a large number of spectra from samples for which the analyte concentration (in the case of quantification) or group membership (in the case of classification or discrimination) is known. Chemometric software tools utilize machine learning algorithms to develop a model to describe the concentration or group membership, and thus the calibration process is usually referred to as “training.” Training of the chemometric model often requires some tweaking of parameters to optimize the analysis. This is done using a process called cross-validation, whereby the model is tested using spectra pulled from the original data set.
It is also important to validate the model through prediction from spectra of additional known samples, with this “test set” being distinct from the original “training set.” If the error of the test set predictions is within the desired precision and accuracy, the model is ready to be used with the spectrum collected from a new sample to predict the answer of interest (concentration or group membership). A simple NIR reflectance spectrum of an apple, for example, when processed using a well-developed chemometric model, can now be used to predict a complex quantity, such as the apple’s sweetness.
While the effort to train the model might seem like a large up-front investment in time, the payoff is huge in terms of allowing fast, nondestructive, cheap predictions in the field, thus eliminating the cost of sending samples of goods for time-consuming and expensive lab analysis. NIR diffuse reflection spectroscopy in particular allows for very fast measurements, as little to no sample preparation is required.
The Math Under the Hood
Chemometrics borrows heavily from multivariate statistics, the science of using many observables to predict an unknown parameter when their relationship is not known. In spectroscopy, the observables are the absorbance at a large number of wavelengths (the spectrum), but could possibly also include additional measurements, such as the temperature. Mathematically, the process is similar to the common (univariate) linear regression algorithm used to predict the unknown y from the measured variable x. In the case of multivariate statistics, however, instead of a single independent variable, x, now there is a large number of variables. Not surprisingly, linear algebra takes center stage in these calculations.
How, exactly, is this done? Instead of a single accepted approach, as in the “normal” linear regression, we now have to choose from an entire alphabet-soup of methods: PLS, SVM, PCA and more. Fortunately, the details of these individual methods need not be understood in order to get started. In fact, even though their approaches can be very different, the results are often similar.
One popular method for quantification is Partial Least Squares Regression (PLS), which determines sets of spectra (the “components”) that can most effectively explain the variations in the concentration of the analyte. Another popular method used for classification is called Support Vector Machine (SVM), while discrimination, the comparison against a standard, is often done with Principal Component Analysis (PCA).
Just as numerous as the mathematical methods is the number of software tools available for the development of the trained model. Besides chemometric packages for programs like the statistics software “R,” or toolboxes for MATLAB, there are also dedicated commercial chemometrics programs, such as Analyze IQ, GRAMS (Thermo Fisher), Unscrambler (Camo) or Pirouette (Infometrix), just to name a few. These programs offer a large set of tools to build, optimize and test chemometric models and to combine them into decision trees, thus allowing the development of very sophisticated analyses.
Case Study: Predicting Apple Sweetness
The sugar content of fruit (primarily fructose, glucose and sucrose) is commonly measured in sum as the soluble solids content (SSC) in the expressed fruit juice with a refractometer and reported as degrees Brix (°Bx), or grams of sucrose equivalent per 100 mL. Typical values range from 10°Bx to 16°Bx, depending on the apple variety, with unripe and ripe apples of the same variety differing by up to 4°Bx.
A Brix measurement is time consuming and requires sacrificial sampling from each batch to perform laboratory analysis of the fruit. Chemometrics using near-infrared reflectance spectra offers a rapid and nondestructive alternative. In this first case study we will review the typical steps required to develop and test a chemometric model for a complex analyte, such as °Bx.
Figure 2a shows the diffuse reflection measurement setup and the recorded NIR reflectance spectra for 76 Ginger Gold apples, collected using a Flame-NIR spectrometer (950-1650 nm) and tungsten halogen light source. Both ripe and unripe apples were used, and their Brix values determined in a separate lab analysis. Spectra from 5 locations across the “equator” are averaged for each apple (Figure 2b). Due to slight differences in the measurement geometry and the shape of the apple the spectra appear shifted and scaled relative to each other, which is corrected for in a “pre-processing” step using the SNV (standard normal variate) method (Figure 2c). As only the differences between the spectra contain the information about the varying sweetness, we subtract the average of all spectra (a process called mean-centering, shown in Figure 2d).
We randomly split the data set into one-third for the test set and two-thirds for the training set to optimize the model for the Brix value in repeated cross-validation. Cross-validation helps to walk the fine line between poor prediction in general (underfitting) and poor prediction on unknowns despite good performance on the training set (overfitting). In this case study the best model performance is achieved by including 5 components (basis vectors or dimensions) in the model. The quality of the prediction is obvious in Figure 3, which compares the apple’s sweetness predicted by the model from the NIR reflectance spectrum with the actual Brix value measured in the lab for both training and test sets. The deviation between predicted and actual values is summarized in the “standard error of prediction” (SEP), a measure of the quality of the model, which is better than 0.3°Bx in this investigation.
Case Study: Identifying Apple Variety
As an example of a classification task we demonstrate how to identify an apple’s variety using chemometric tools based on the visible diffuse reflectance spectra recorded in a setup similar to that depicted in Figure 2. The only difference was that our visible Flame spectrometer was used to collect the reflectance spectrum from 480-920 nm in order to gather information about pigments like carotenoids (420-500 nm), anthocyanins (540-550 nm) and chlorophyll (600-700 nm).
The recorded spectra were pre-processed in the same manner as described above: scaled with SNV and mean-centered to remove any commonalities and isolate the differences between the spectra (and hence the apples). To visualize the differences we perform a principal component analysis – a mathematical procedure that identifies those parts of the spectrum (the principal components) that can explain the largest variations between the different apple spectra most successfully.
Mathematically speaking, the principal components are directions in the multidimensional space defined by all wavelengths. In the case of this apple study they represent combinations of the individual spectra for the main pigments in the apple skin, namely carotenoids (orange), anthocyanins (red) and chlorophyll a and b (600-700 nm). The principal components allow us to represent the highly multidimensional data set in terms of just a few important dimensions, separating the interesting information from meaningless noise.
For the current data set we found that 97% of the variation between the spectra can be explained with just two principal components. Different apples “score” different amounts in the direction of these two principal components, i.e., their spectrum can be described to varying degrees as a combination of those two components, as shown in Figure 4. It becomes immediately obvious that the apple varieties fall into three groups: green apples, yellow-green apples and red apples, as indicated. With this plot (the training) in hand, it is possible to measure the spectrum of an unknown apple in the same way, obtain scores for the existing principal components, and then infer the apple’s variety based on the location of its principal component scores in the plot.
Identifying apples as yellow, green or red based on their visible reflection seems trivial and is meant to serve as an illustration of the general approach in a classification task. “Support Vector Machine” (SVM) is a popular algorithm used to classify samples successfully even in much less clear-cut applications. To demonstrate, we randomly split our data set and trained an SVM classification model on 80% of the visible apple spectra (a total of 576 spectra) to recognize the apple variety. We determined the optimum values for the two model parameters in a cross-validation grid search and tested the performance of the final model on the remaining 20% of the data (our test set). How did the model do? In cross-validation the classification error was 0.3%; the model made no mistakes in the separate test set of 144 spectra. How do you like them apples?
The combination of fast, modular spectrometers with powerful chemometrics tools creates new opportunities for rapid, on-site testing of foods and other samples. From quick and inexpensive measurements in complex situations, to sample identification rivaling expert abilities or online quality control – the possibilities are endless. Modern chemometrics software packages make this tool set available for practitioners in all disciplines. Chemometrics might just be the tool you were looking for.