Deep learning (DL) algorithms have proven to be absolutely indispensable tools in various artificial intelligence tasks associated with computer vision, natural language processing, and many other research domains. The DL techniques are able to achieve high level of abstraction during the training process and thus are able to ignore insignificant data features while getting focused on those really important for particular practical task. This potential did not pass unnoticed in chemistry and currently there is a rapidly growing trend in application of various deep learning algorithms for data mining and processing in chemistry. A nice example to mention is the recent paper in Nature [1], where the authors managed to train the neural network that was capable to “rediscover” the periodic law and to predict several materials for functional applications. The research in analytical chemistry was also influenced by this trend and there are already several reports exploring deep learning methods in vibrational spectroscopy [2], mass-spectrometry [3], chromatography [4] and sensors [5]. The present project aims to make several important steps in this direction and to study the potential of deep learning in two exciting yet unexplored research domains: two-dimensional chromatography and Mossbauer spectroscopy. Both of these fields have their own problems with data processing and we plan to address them effectively using deep learning algorithms.
The evolution of chromatographic systems for separation, identification and quantification of chemical compounds in complex natural matrices with minimum sample preparation and in a short time has become one of the most important developments in analytical chemistry in recent years. In spite of numerous technological advances, the complexity of many naturally occurring mixtures still exceed the capacity of any single method, even when optimized to resolve them. Great efforts have been concentrated on coupling separations methods together to increase resolution, and these have occurred in parallel with advances in coupling these separation methods with different spectroscopic detection methods (hyphenated methods such as GC-MS, LC-MS, LC-DAD, LC-IR, and etc). Multi-dimensional chromatography methods (e.g., comprehensive two-dimensional gas chromatography (GC×GC) and comprehensive two-dimensional liquid chromatography (LC×LC)) have emerged as powerful techniques suitable for the separation of complex mixtures However, one of the main challenges in chromatography and especially in multi-dimensional chromatography is related to the difficulty of the analysis and interpretation of the enormous amount of data obtained in these cases. These huge data sets are difficult to manage and analyze using conventional statistical methodologies and software tools. This is due to the existence of different problems such as baseline/background contribution, noise, retention time shift, low S/N and co-elution (overlapped and embedded peaks) in their chromatographic or spectroscopic dimensions during chromatographic runs.
Recently, convolutional neural network (CNN), a deep learning method with hierarchical feature learning capabilities has made great breakthroughs in signal analysis. CNNs were applied in two- and three dimensional chemical data analysis, including hyperspectral images, vibrational spectroscopic data and electrochemical data [6]. The CNN approach takes advantage of local sparse connections to study local patterns from the raw data and reduce the risk of overfitting by adopting the weight sharing. Owing to the fact of emergence of DL algorithms in chemistry and especially in analytical chemistry, they have potential use for chromatographic data for handling different types of chromatographic problems and to improve the selectivity and sensitivity of chromatographic measurements. It is important to note that modern chromatographic instruments are capable of producing large data sets, but normally the number of samples is not very large for reliable training which is always needed in DL algorithms.
The CNN algorithm can be explored for different purposes in advanced chromatographic data analysis including (i) noise reduction, (ii) elution time shift correction, (iii) precise peak detection, (iv) prediction of retention time, (v) feature extraction for determination of the significant compounds especially in metabolomics studies, (vi) classification of chromatographic peaks based on their spectral profiles and (vii) quantitation of target analyte(s).
The aim of this project is exploring the potential of DL algorithms and especially CNN to solve data analysis problems in high-dimensional chromatographic data (GC-MS and GC×GC-MS data in particular) as follows: 1) The lack of enough samples for DL training will be compensated with modeled chromatograms and then apply such DL models for handling real chromatographic data. If this is shown to be successful it will give a nice tool for further development of DL in chemometrics; 2) The potential of DL algorithms for solving fundamentals chromatographic artifacts like distorted baseline, elution time shift, peak detection and peak overlap will be explored; 3) Complex real chromatographic datasets involving environmental and biomedical samples will be studied to check the applicability of the developed data mining strategies.
Another attractive research field where DL methods and algorithms can bring significant progress is Mossbauer spectroscopy. It is a useful technique to study the chemical state of resonance atoms in condensed matter. It has found broad applications in material science, corrosion studies, biology and various other fields. Traditional processing of Mossbauer spectra assumes decomposition of the spectral data into separate multiplets corresponding to the hypotheses regarding particular non-equivalent states of the resonance atom [7]. Typically, the fitting of the Mossbauer spectra is performed with a superposition of Lorentz functions using standard algorithms like e.g. Levenberg-Marquardt. The fitting allows determination of hyperfine parameters of the spectral components: isomer shift, electric quadrupole splitting and hyperfine magnetic splitting. These parameters are further employed for interpretation of resonance atom forms in the samples. Quite often, chemical states of the probe atom are numerous (especially in complex samples) and the derived Mossbauer spectra are rather complex with a lot of overlapping signals. This may lead to certain difficulties in spectral processing as the investigator needs to separate the contributions from individual chemical states. When multiple phases are simultaneously present in the spectra there are a lot of uncertainties, subjectivity, and errors in fitting hypotheses to the experimental data. A very high expert level is required for interpreting such spectra and additional data from other instrumental methods are often employed. In this respect, some automated Mossbauer spectra processing tools would be highly attractive.
CNN methodology may provide a valuable tool in this respect. Based on the assumption that human experts do initial “processing” of Mossbauer spectra visually and perceiving the spectra as images, but not as a row of numbers (a conventional chemometric interpretation of the spectral data), one can hypothesize that image-processing tools allowing certain abstraction level would be a useful tool for spectra deconvolution. CNN methods were shown to be very effective in image analysis in machine learning context and this issue was not explored yet in chemometric context of spectral processing. One of the primary tasks in Mossbauer spectroscopy is identification of the initial hypothesis – a number of multiplets found in particular complex spectra. We assume that properly trained with a sufficiently large number of modeled Mossbauer spectra CNN could be successful in identification of these hypothesis and determination of multiplet parameters just in the same way as human experts do with a naked eye.
The particular tasks in this work package are: 1) modelling of Mossbauer spectra with different number of multiplets overlapping in different ways; 2) training CNN model for identification of the number of singlets, doublets and sextets in the spectra along with their parameters; 3) assessment of the prognostic power of CNN model with real data from complex samples containing multiple phases.
Besides the research activities of this collaboration, an educational component of the project is also important. Owing to the two successful Winter School of Chemometrics (WSC) in 2019 and 2020 in Sharif University of Technology (SUT) for teaching chemometrics to around 150 participants with various backgrounds including chemistry, pharmaceutical science, environmental science, engineering and etc., we plan to organize the Third Winter School of Chemometrics (WSC-2021) (February 2021). It is planned that Russian team will attend the school to give the educational lectures on the advanced chemometric tools employed in the modern chemistry.
The expected results of the project are:
1) Methodology of application of deep learning techniques to complex Moessbauer and chromatographic data;
2) Results of performance comparison for CNN models and conventional chemometric tools in processing of real datasets;
3) Recommendations on the use of deep learning techniques in analytical chemistry;
4) Publication of the project results in Q1 journal on analytical chemistry;
5) Report on the activities performed in the framework of WSC in Tehran (February 2021)
The project is a thoroughly planned and coordinated collaborative research act that is performed by the international team of experienced scientists who have already joint proven research track record [8] in the field of data processing for analytical chemistry. The collaborators have already submitted the joint proposal on data handling in multisensor system to RFBR-INSF project call. The implementation of this DL initiative will allow broadening the research scope of the collaboration and will give a chance to prepare further more ambitious projects with larger funding.
References
[1] Tshitoyan, V., Dagdelen, J., Weston, L. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).
[2] Yang, J., Xu, J., Zhang, X., et al. Deep learning for vibrational spectral analysis: recent progress and a practical guide. Analytica Chimica Acta 1081, 6-17 (2019).
[3] Melnikov, A.D., Tsentalovich, Y.P., Yanshole, V.V., Deep learning for the precise peak detection in high-resolution LC-MS data. Analytical Chemistry 92, 588-592 (2020).
[4] Ma, C., Ren, Y., Yang, J., et al. Improved peptide retention time prediction in liquid chromatography through deep learning. Analytical Chemistry 90, 10881-10888 (2018).
[5] Cho, S.Y., Lee, Y., Lee, S., et al. Finding hidden signals in chemical sensors using deep learning. Analytical Chemistry 92, 6529-6537 (2020).
[6] Dhillon, A., Verma, G.K., Convolutional neural network: a review of models, methodologies and applications to object detection. Progress in Artificial Intelligence 9, 85-112 (2020).
[7] Debus, B., Panchuk, V., Gusev, B., et al. On the potential of multivariate curve resolution in Mossbauer spectroscopic studies. Chemometrics and Intelligent Laboratory Systems 198, 103941-103948 (2020).
[8] Parastar, H., Kirsanov D., Analytical figures of merit for multisensor arrays, ACS Sensors 5 580–587 (2020).