Identifying particular control components of a test soil sample presented as mixed, contaminated, improperly stored or damaged soil is an important problem in soil forensics, soil monitoring and other types of soil analysis. This problem is reduced to determining whether two soil samples — test and control — have the same origin or source. Here, we propose an algorithm which copes with this problem based on 16S rRNA gene libraries of test and control soil samples and does not rely on OTU clustering. The algorithm first extracts the Library-SPECific sets of sequences (LSPECs) for alternative control libraries and then quantifies signals of LSPECs in a test library. The heavy use of the suffix array approach for sequence comparison accelerates the algorithm significantly. To evaluate the performance of the algorithm, we collected a control set of 29 soil samples and created two test sets (real and simulated), containing mixed, contaminated and extremely small single-source soil samples (last samples resemble forensics probes). We then carried out 16S rRNA amplicon sequencing of total soil DNA isolated from both test and control soil samples. The algorithm successfully identified the origin of all single-source soil samples and the compositions of mixed and even low/highly contaminated samples. The algorithm also demonstrated robustness to the increase in control set size from 9 to 29. We believe the proposed algorithm is suitable for identification problems with various degrees of complexity and is flexible enough to manage other molecular markers and microbiological samples from different non-soil sources.
Scopus subject areas
- Decision Sciences(all)
- Ecology, Evolution, Behavior and Systematics