A global atlas of subsurface microbiomes reveals phylogenetic novelty, large scale biodiversity gradients, and a marine-terrestrial divide

Subsurface environments constitute one of Earth’s largest habitats for microbial life, yet differences between surface and subsurface microbiomes and between marine and terrestrial microbiomes remain unclear. We analyzed 478 archaeal and 1054 bacterial amplicon sequence datasets and 147 metagenomes from diverse and globally distributed surface and subsurface environments. Taxonomic analyses show that archaeal and bacterial diversity is similar in marine and terrestrial microbiomes at local to global scales. However, community composition greatly differs between marine and terrestrial microbiomes, suggesting a horizontal phylogenetic divide between sea and land that mirrors patterns in plant and animal diversity. In contrast, community composition overlapped from surface to subsurface environments indicating vertical diversity continua rather than a discrete subsurface biosphere. Diversity of terrestrial microbiomes decreases from surface to subsurface environments. Marine subsurface diversity and phylogenetic novelty rivaled (bacteria) or even exceeded (archaea) that of surface environments, suggesting that the marine subsurface holds a considerable and underestimated fraction of Earth’s archaeal and bacterial diversity. Teaser Subsurface ecosystems within Earth’s crust and seafloor harbor diverse and distinct microbiomes featuring many unknown lineages.

While subsurface ecosystems harbor substantial biomass, diversity, and phylogenetic novelty, the degree to which the biodiversity of subsurface ecosystems directly compares with that of surface ecosystems remains unclear.Moreover, although standardized global meta-analyses exist for the marine (16) and the terrestrial subsurface (54), a standardized global dataset encompassing both marine and terrestrial subsurface environments was not available to date.Here, we provide a comparison between surface and subsurface as well as between marine and terrestrial environments.We investigate environments that span a broad range of depths from surface environments (under the influence of relatively fresh photosynthesis-derived organic matter) to very deep and isolated environments (cut off from photosynthetic primary production for at least centuries).We analyze and compare 1532 globally distributed taxonomic marker gene datasets and 147 metagenomes to investigate (i) whether microbial communities of marine and terrestrial biomes and of surface and subsurface environments fundamentally differ, (ii) whether subsurface environments are generally less diverse than surface ecosystems, (iii) whether subsurface environments harbor distinct clades, and iv) whether marine and terrestrial subsurface communities share a core microbiome.

Results and Discussion
The Census of Deep Life atlas of subsurface biodiversity The Census of Deep Life (CoDL) under the auspices of the Deep Carbon Observatory organized a decadal effort (2010-2020) to characterize microbial diversity and function in subsurface ecosystems worldwide, carried out by more than two dozen groups and hundreds of researchers.After the completion of the CoDL, we compiled, re-analyzed, and compared 1054 bacterial and 478 archaeal 16S rRNA gene amplicon datasets from 35 individual globally distributed CoDL projects together with 17 additional non-CoDL projects that originated mainly from surface ecosystems (Fig. 1A, B; Table S1, S2).Archaeal and bacterial 16S rRNA gene amplicon sequence variants (ASVs, quasi strain-level microbial lineages) were amplified using domain-specific primers.We obtained 31099 archaeal and 415423 bacterial ASVs from 4.9 × 10 7 archaeal and 13 × 10 7 bacterial reads.On average we found 155 archaeal and 1257 bacterial ASVs per sample (Table S3, S4).Analyses of 147 metagenomes from 49 globally distributed CoDL subsurface projects complement the amplicon datasets.The metagenomes were used for taxonomic analyses based on two different taxonomic markers, 16S rRNA gene sequences retrieved by phyloFlash (pF16S; before read assembly) and ribosomal protein S3 genes (rpS3; after read assembly), as well as for a community analysis based on all predicted genes.We obtained a total of 1443 different archaeal and 22556 different bacterial pF16S genes, as well as 150 different archaeal and 1575 different bacterial rpS3 genes (Table S3, S4).We analyzed the communities first by grouping the amplicon datasets based on the biome from which the samples originated; marine biome (nArchaea=304 datasets, nBacteria=505 datasets) and terrestrial biome (nArchaea=174, nBacteria=549).Within those biomes we grouped samples based on the depth realm they originated from: surface, interface and subsurface.Surface datasets (nArchaea=183, nBacteria=687) include water samples from oceans and lakes (at various depths in the water column), shallow (<0.1 mbsf) sediment samples from oceans, estuaries and lakes, animal-associated microbiomes, and wastewater.Surface ecosystems represent environments under the influence of relatively fresh photosynthesis-derived organic matter.Subsurface datasets (nArchaea=85, nBacteria=122) originated from ecosystems that have likely been cut-off from photosynthetic primary production for a timescale of at least centuries.These ecosystems were accessed via boreholes and include deep sediments, aquifers, and fracture fluids.Marine subsurface sediments were further sub-divided based on depositional setting: shelf, slope, and abyssal domains.The location of each domain is defined by water depth (55)(56)(57).Shelf environments roughly correspond to water depths <200 m, except the Antarctic region where shelf area corresponds to water depths <500 m.The abyssal plain corresponds to areas of water depths >3500m.Sediments under other water depths are referred to as slopes.Interface datasets (nArchaea=210, nBacteria=245) originated from environments that are subjected to influences from surface and subsurface environments and processes.Interface environments can be surface sites that are driven by energy from the subsurface, or vice versa, and included water and sediment samples from caves, hot springs, hydrothermal vents, and cold seeps.We also grouped and analyzed the samples based on the 13 major environments they originated from, as listed in Fig. 1C.To minimize batch effects and biases, we sequenced and analyzed all samples using the same primers, the same chemistry, the same Illumina instrument, and the same bioinformatic pipeline.We minimized the potential impact of contamination by including blanks and controls, and removing notorious contaminants (58,59).We minimized composition bias by normalizing template concentration and standardizing workflows (60)(61)(62).

Substantial differences between marine and terrestrial microbiomes
We found substantial taxonomic differences in archaeal and bacterial community structure between marine and terrestrial biomes.For this analysis we grouped all datasets solely based on whether they originated from a marine or terrestrial environment, regardless of their association with a surface, interface, or subsurface environment.Alpha diversity (community diversity per locality/sample) and beta diversity (community diversity between localities/samples) of marine and terrestrial microbiomes showed the same trends across ASV, rpS3 and pF16S-based analyses.Richness, estimated richness (Chao1) and evenness (Shannon entropy, Inverse Simpson diversity) were similar for archaea but significantly higher for bacteria in marine relative to terrestrial microbiomes (Fig. 2A-F, Fig. S2).After subsampling the datasets to account for the unequal sampling effort, i.e., different sequencing depth, we found on average 75 and 79 archaeal and 513 and 419 bacterial ASV in marine and terrestrial samples, respectively (Table S5, S6).Microbial community structure and composition was very different between the marine and terrestrial biomes.On average only 2-6 % of ASVs, 2 % of rpS3 genes and 8-9 % of pF16S genes were shared between any marine and terrestrial dataset (Fig. 2G, H; Fig. S3; S4).Gamma diversity (total community diversity per biome/environment) tended to be higher in the terrestrial biome (Fig. S3), with exception of bacterial ASVs which were higher in the marine biome.Analyses of all gene families found in the metagenomes, support a higher functional richness and evenness in the terrestrial microbiome (Fig S5A, B), yet the dissimilarity in functional genes between biomes was less pronounced, likely due to housekeeping genes that are found ubiquitously in most microorganisms (Fig. S5C, D).
A multivariate association analysis (63) of our dataset suggests that most bacterial phyla are significantly more abundant and prevalent in the marine biome (Fig. 2I, S6A).Phyla that are more common in marine environments include well-known Cyanobacteria, Planctomycetota, Verrucomicrobiota, and Desulfobacterota, but also less prominent phyla such as Marinimicrobia, Nitrospinota, Fermentibacterota, Caldatribacteriota, Latescibacterota, Aerophobota and Gemmatimonadota.In terrestrial environments we found more commonly Firmicutes, Acidobacteriota, Actinobacteriota, Nitrospirota, and Patescibacteria, as well as Methylomirabilota, Armatimonadota and Elusimicrobiota among others.A third group of bacterial phyla occupy both biomes at similar relative sequence abundances and prevalences.These included Proteobacteria, Bacteroidota, Chloroflexi, Spirochaetota, Myxococcota, Campilobacterota and Thermotogota.In contrast, most archaeal phyla were more abundant and prevalent in the terrestrial biome (Fig. S6B).Only Thermoplasmatota were significantly more common in marine environments, and Crenarchaeota and Asgardarchaeota were found in both biomes at similar abundances and prevalence.All other archaeal phyla including Euryarchaeota, Halobacterota, Hadarchaeota, Nanoarchaeota, Aenigmarchaeota, and Altiarchaeota were significantly more abundant and prevalent in the terrestrial biome.Relative sequence abundances of archaeal and bacterial phyla (Fig. S6C, D) support the trend revealed by the association analysis.Interestingly, some phyla contribute disproportionally to ASV-level diversity including Nitrososphaeria, Bathyarchaeia, Chloroflexi and Planctomycetota (Fig. S6E, F), being responsible for a larger part of the ASV-level diversity than their relative abundance suggests.Archaeal (A-C) and bacterial (D-F) alpha diversity (per sample community richness and evenness) in marine and terrestrial biomes using 16S rRNA gene amplicon sequence variants (ASVs; A, D), as well as metagenome-derived ribosomal protein S3 genes (rpS3; B, E) and 16S rRNA gene sequences detected by phyloFlash (pF16S).To allow comparison the datasets were subsampled to the same number of reads.Pairwise comparisons were performed using a Wilcoxon rank sum test.Significance: ***: p<0.001.The number of datasets is shown below the boxplots.Community dissimilarity between marine and terrestrial communities was shown by non-metric multidimensional scaling ordinations of ASV-based dissimilarity matrices using 469 archaeal (G) and 1105 bacterial datasets (H).Each circle represents the community structure of a dataset and is connected to the group centroid (weighted average mean of within-group distances), the ellipses depict one standard deviation of the centroid.Statistical testing using ANOSIM showed that the groups are overlapping but significantly different (R~0.5, p<0.001).(I) Differential abundance analyses of marine vs terrestrial bacterial phyla.The phyla are ordered from top to bottom based on increasing phylum level MaAsLin2 coefficient, i.e. likeliness of their occurrence in terrestrial-derived samples ("terrestrial-ness"). Boxplots summarize order level MaAsLin2 coefficients, i. e., "terrestrial-ness", within the listed phyla.Note: due to ease of visualization boxplots are also shown for very small number (n) of orders.Significance levels are: *: p<0.05, **: p<0.01, ***: p<0.001, ****: p<0.0001.Additional phyla are shown in Fig. S6.

Subsurface microbiomes can be as diverse as surface microbiomes
Within each biome, differences in alpha diversity between surface, interface, and subsurface environments, i.e., depth realms, were less pronounced than between the biomes (Fig. S7, S8).Across depth realms, communities of both biomesmarine and terrestrialmainly had a similar richness and shared many ASVs (Fig. S9, S10, Video S1, S2).Archaeal and bacterial communities in the three depth realms were distinct but with considerable overlaps.The larger differences in community structure between marine and terrestrial biomes than between depth realms, indicates that the divide between microbial life on land and in the sea is more pronounced than the divide between surface and subsurface communities.We found that species richness and diversity in many subsurface environments rival those in surface environments.This finding was consistent across the 13 investigated environment typeswhere levels of microbial diversity are often comparable across the surface, interface, and subsurface (Fig. 3A-D).Archaeal alpha diversity was overall highest in samples from cold seeps and brines, caves, springs, and the deep marine subsurface (Fig. 3A).Bacterial alpha diversity was very high in the cave samples and sediments, but also in seeps and vents.Marine subsurface bacterial diversity rivaled the diversity found in marine surface water, wastewater, or host-associated microbiomes (Fig. 3B).Total archaeal diversity (gamma diversity) was significantly higher in marine interface and subsurface environments than in the marine surface (Fig. 3C, S7A-D).Similarly, bacterial gamma diversity was highest in marine interface environments (Fig. 3D).In the terrestrial biome, archaeal gamma diversity was comparable across all depth realms (Fig. 3C), whereas bacterial diversity was highest in the surface (Fig. 3D).Environments are grouped based on biome (marine, terrestrial) and depth realm (surface, interface, subsurface).Note: Archaeal terrestrial water and sediment samples contain too few datapoints for robust visualization using boxplots, yet the plots were retained for completeness.Total archaeal (C) and bacterial richness (D) using a subsampling approach to account for different group sizes (100 iterations using the smallest group size).Normalized gamma diversity corroborates that archaeal diversity is highest in marine interface and subsurface ecosystems even when group sizes are considered.All pairwise comparisons were significantly different except for gamma diversity between terrestrial interface and subsurface biomes (archaea and bacteria), and marine and terrestrial surface biomes (bacteria).
Interestingly, Thermoplasmata, Nitrososphaeria, and Bathyarchaeia together comprised more than half of the relative sequence abundance in any investigated environment, except the terrestrial subsurface (Fig. 4A; S11).The contribution of these three classes to global ASV richness was similarly outsized (Fig. 4B), while other clades that can occur at relatively high abundances contributed very little to overall richness, including Hadarchaea, ANME-1, Methanobacteria, Thermoprotei and Methanococci (Fig. 4A, B; S11).This suggests that a large part -if not most -of the global archaeal abundance and richness can be attributed to three archaeal classes Thermoplasmata, Nitrososphaeria, and Bathyarchaeia.Bacterial communities were largely dominated by Proteobacteria and Bacteroidota, except those of caves and the subseafloor where Chloroflexi played an important role (Fig. 4C).The contribution of the lineages to overall bacterial richness was often not proportional to their abundance.Proteobacteria, Bacteroidota, Firmicutes, Cyanobacteria and Desulfobacterota contributed less, while Chloroflexi, Planctomycetota, and Verrucomicrobia contributed more richness than expected based on their abundance (Fig. 4D).We found increasing archaeal richness and evenness with depth, supporting that archaea are well suited for life in subsurface and subsurface influenced, i.e. interface, environments.The importance of archaea in marine subsurface ecosystems has previously been demonstrated by studies assessing community structure using lipid analyses (8), cell abundances (15), and meta-omics (13,64).The high diversity of archaea in the terrestrial subsurface, rivaling that of terrestrial surface ecosystems is less well documented.Similarly surprising is the high diversity of bacteria in the marine subsurface in comparison to surface ecosystems, supporting previous findings that focused on marine sediments (16).This work collectively improves our understanding of the contribution of archaea and bacteria to global microbial diversity and ecosystem function.To further investigate the global trend and corroborate the high diversity of archaeal and bacterial diversity in marine subsurface ecosystems, future studies should include datasets of additional surface environments such as soils and freshwater sediments, as those were underrepresented in our datasets.Interface and subsurface microbiomes are phylogenetically more diverse in marine than in terrestrial biomes Microbial species-level diversity was significantly higher in marine interface and subsurface environments than in terrestrial interface and subsurface environments (Fig. 3C, D, S7).This was the case for archaea and bacteria, and for all tested alpha diversity indices concerning richness, evenness, and estimated richness, as well as gamma diversity.We corroborated this ASV-based trend through comparisons of the diversity of rpS3 and pF16S genes (Fig. S7), for which marine interface and subsurface microbiomes were also either significantly more diverse or statistically indistinguishable.Beta diversity between the marine and terrestrial subsurface also substantially differed based on ASVs (Fig. S9A, B), showing that analogous to surface ecosystems marine and terrestrial subsurface environments harbor many unique microbial species and represent fundamentally different habitats.Generally, the average community dissimilarity between terrestrial samples was greater than between marine samples.The processes that lead to lower richness and greater dissimilarity in the terrestrial subsurface cannot be explained with our dataset but could be due to fundamental properties of the environment, including high habitat heterogeneity, high disturbance, or low dispersal.The diversity trends were also supported by rpS3 and pF16S gene-based analyses (Fig. S9C, D).

Abundant microbial lineages in the subsurface
Archaeal and bacterial community composition also differed between the surface and subsurface environments in terms of lineage identity, relative sequence abundance and prevalence, i.e., the percentage of samples in which a lineage occurred (Fig. 5, 6, S11-13).Certain archaeal and bacterial lineages were significantly more prevalent in samples from subsurface than surface ecosystems.Most archaeal phyla predominantly occurred in the subsurface (Fig. 5B), except for Thermoplasmatota, which due to their abundance in the ocean are surface-associated, and Crenarchaeota, which have low statistical support for a preferred surface occurrence, likely caused by the class Bathyarchaeia which occurs in almost all marine and terrestrial subsurface samples (Fig. 6, S11).Among the Bacteria the phyla Firmicutes, Caldatribacteriota, Nitrospirota, Patescibacteria, Aerophobota, Chloroflexi, Desulfobacterota, Methylomirabilota, and Spirochaetota are among those that are more likely found in the subsurface (Fig. 5A, Fig. S12, S14).This finding supports previous reports of microbes that are abundant in deep subsurface environments including the sulfate-reducing Candidatus Desulforudis audaxviator (65)(66)(67), Spirochaeta (68) or organisms affiliating with (Cald)atribacteriota (69)(70)(71)(72)(73)(74)(75).In contrast, Cyanobacteria, Bdellovibrionota, Fusobacteriota, Verrucomicrobiota, Planctomycetota, Bacteroidota, Campilobacterota, and Proteobacteria are more widespread in surface ecosystems.We have grouped the most abundant subsurface class and order level lineages based on their relative sequence abundances (RSAs) and prevalence.The first group comprises lineages with high RSAs and a high prevalence, e.g., Bathyarchaeia in the marine subsurface or Gammaproteobacteria in the terrestrial subsurface.Both lineages occur in all samples (prevalence = 1) at up to 2 % RSA (Fig. 6A,  D).Bathyarchaeia still occur at more than half of all marine samples (prevalence >0.5) with RSAs of >40 % (Fig. 6A).The second group comprises lineages that have a high prevalence and low RSAs, i.e., they occur at almost all sites but are mostly rare.This group comprises terrestrial subsurface Bathyarchaeia and Methanosarciniales, as well as Bacilli, Clostridia, Bacteroidia, Alphaproteobacteria, and Actinobacteria in both the marine and terrestrial subsurface (Fig. 6C, D).The third group features those with low prevalence and high abundance, which means they do not occur at all sites, but when they occur, they are very abundant or even dominant.This group includes marine-associated Lokiarchaeia and ANME-1a, as well as Hadarchaeia and ANME-1b, which exhibit this distribution in both the marine and terrestrial subsurface.In the bacterial domain, examples include JS1 Caldatribacteriota, which constitute up to 20 % of the community in about half of the marine subsurface samples (Fig. 6C), as well as marine Dehalococcoidia, Aminicenantia, Aerophophia, Phycisphaerae, and Desulfobacteria.The fourth group represents organisms with a low prevalence and low abundance.These lineages are generally rare, and their occurrence may be stochastic.Depending on an emphasis on prevalence or abundance, the lineages in the first three groups can be interpreted as a core microbiome of the marine or terrestrial subsurface.Although there is little community overlap at species-level, there are lineages at higher phylogenetic levels that occur in both biomes, including Bathyarchaeia, Hadarchaeia, ANME-1b, Alpha-and Gammaproteobacteria, Bacteroidia, Bacilli, and Clostridia which can be considered a subsurface core microbiome.

Phylogenetic novelty in the subsurface
Many lineages that are abundant in subsurface environments belong to branches in the tree of life that have few or no cultured representatives, e.g., Loki-, Bathy-, and Hadarchaeia, and Caldatribacteriota.Together with their often outsized contribution to richness (Fig 3F ) it is likely that the subsurface holds a substantial degree of phylogenetic and likely genomic novelty.To test whether the described high richness in interface and subsurface environments indeed represents phylogenetic novelty, we analyzed the phylogenetic proximity of the detected lineages to their closest isolated relatives.For this, we mapped ASV against a database of cultivated isolates.Our analyses show that most archaeal ASVs in the marine subsurface are only 80-90 % sequence identical to the closest isolated relative (Fig. 7A), on average 88 % (Fig. S15A), indicating that novel family-, order-or potentially even class-level clades can be expected (76).In the marine surface and interface and in terrestrial environments archaeal ASVs are on average more closely related to a cultured isolate (95-97 % sequence identity, Fig. 7A, S15), nevertheless representing potential novel lineages.In the case of bacteria, average sequence identity to the closest isolate is also lowest in the marine subsurface (88 % on average), supporting the notion of extensive phylogenetic novelty in the seafloor (16).However, bacterial phylogenetic novelty is similarly high in all other environments, with most ASV being only ~90 % sequence identical to the closest isolate (Fig. 7B), ranging on average from 89 % -92 % (Fig. S15).Notably, many of the ASVs that are phylogenetically distant from their closest cultured relative have high relative sequence abundances (Fig. 7).This is especially true for Archaea, for which several phylogenetically distant ASV had relative sequence abundances between 1-10 % (Fig. 7A).These findings support the notion of a largely untapped reservoir of uncultivated archaeal diversity in the subsurface (77).But our findings highlight extensive novelty in most other environments, and in the bacterial domain.

Global continua and great divides
Overall, we demonstrate that communities in the marine and terrestrial subsurface can be as phylogenetically diverse as those on the surface, harboring a substantial part of Earth's biodiversity.This finding confirms previous insights from global marine sediments ( 16) and the terrestrial subsurface (54) yet bridges the two distinct biomes in one global analysis.Our comparison of marine and terrestrial microbiomes and depth realms shows that microbial diversity is particularly high in interface environments that are influenced by surface and subsurface processes.The communities at interface ecosystems shared community membership with those of surface and subsurface environments, which reflects the location, as well as the exchange of fluids and other materials between the depth realms.Mud volcanoes and methane seeps have been shown to connect surface and subsurface environments (31)(32)(33), as have marine crustal fluids (78,79).Archaeal interface communities shared more genus-level clades with the subsurface (Fig. 8A, B), whereas bacterial interface communities shared more genuslevel clades with the surface (Fig. 8C, D), suggesting that different ecological processes shape archaeal and bacterial communities.
Despite harboring many distinct microbial lineages, the microbiomes of surface, interface, and subsurface environments showed a relatively high overlap in community composition, suggesting a diversity continuum with depth (Fig. S9, Video S1, S2) that connects surface and subsurface communities across environments and deep time.Moreover, many of the environments we studied cannot be easily defined as surface or subsurface environment.Subsurface or subsurface-impacted ecosystems are not necessarily energy depleted and, in fact, can be relatively energy rich -such as hydrocarbon seeps (80), brines (81), hydrothermal vents (82), oil, coal and shale deposits (51,83,84), serpentinizing systems (85)(86)(87), and caves (88).The subsurface is often replete with methane, hydrogen, and other energy and carbon sources (45,(89)(90)(91)(92).An absence of oxidants, nutrients, trace metals, or extreme physical or chemical conditions, however, can shape the specific community structures and hinder microbial growth or activity in relatively energy-rich ecosystems (93,94).Depth and oxygen content are also not always reliable proxies for subsurface conditions, e.g., in marine sediments with very low organic matter content, oxic conditions can prevail throughout the sediment column (95).Sediment depth and sediment age can be vastly different between locations (56), with past surface conditions or depositional environments (Fig. S16) apparently being imprinted into a given subsurface sample (24,25).Defining environments and samples by a few select parameters such as depth, oxygen content, or age is useful, but to fully understand the subsurface geobiosphere we may need to consider large-scale gradients and continua, analogous to the concept of the critical zone in soil science (96,97).
Our analyses further reveal a major divide between marine and terrestrial microbiomes, showing significant differences in local and overall richness.In contrast to the studied depth realms, the differences between marine and terrestrial microbiomes were striking, with little overlap in community structure (Fig. 2, S2).This divide was also found based on the analysis of metagenome-derived 16S rRNA and rpS3 genes (Fig. 2, S4).The taxonomic differentiation between marine and terrestrial microbial communities mirrors the differences between animal communities in these biomes (98) and is likely caused by the very different chemical and physical properties that shape these environments (99).The environmental drivers responsible for the gradients, continua, and divides remain undefined but they likely include factors such as salinity (100,101), energy availability (57), geological activity (102,103), hydrologic conditions and recharge rates (104,105), and constrictions of void space within which microbial cells might exist (94).Improvements in global scale modeling and increasing availability of data can provide the tools to describe the surface and subsurface on large scales, refine our insights into microbial provinces in the subsurface (106), and potentially define regions of similar subsurface biogeochemical regimes analogous to Longhurst provinces for the surface ocean (107).A, C) and terrestrial biome (B, D).Each circle is a genus-level clade, circle size is average relative sequence abundance in the respective biome, circle colors represent to which of the top 5 phyla the lineage belongs.The location of the circle shows the average relative sequence abundance in each depth realm (surface, interface and subsurface).For example, if a circle lies exactly in the center of the plot, the respective lineage is equally abundant (33.3 %) in each of the three depths, if it lies at one of the corners (e.g., surface) it means it occurs only in the surface, circles that lie on the sides of the plot occur in only two of the three depths.

Dataset specifications
To investigate the microbial diversity and composition of surface, interface, and subsurface microbiomes, we used 478 archaeal and 1054 bacterial 16S rRNA gene amplicon datasets sequenced at the W. M. Keck Ecological and Evolutionary Genetics Facility of the Marine Biological Laboratory (Woods Hole, MA, USA).Most archaeal and bacterial datasets from subsurface and interface environments were part of the Census of Dep Life (CoDL), and most datasets from surface ecosystems were not collected during the CoDL but were kindly shared by numerous other investigators (see Acknowledgements).All sequencing datasets were prepared using identical primers, nearly identical library preparation protocols, identical sequencing chemistries, and bioinformatic analyses to minimize batch effects, biases and ensure comparability of communities based on amplicon sequence variants (ASVs).We removed datasets from enrichment samples, blanks, and controls, as well as datasets that we considered as failed runs (< 2000 bacterial reads or <1000 archaeal reads) and subsurface datasets that contained lineages belonging to known contaminants of subsurface samples (59).Contextual data for each sample/sequence dataset is listed in Table S2.Dataset 1 and 2, representing the archaeal and bacterial analysis logs using the "biome" grouping as a representative workflow).We are aware that certain environments are overrepresented (e.g., marine surface), while others are underrepresented (e.g., caves) or missing (e.g., soils).However, the highlighted global trends are likely reliable due to the size and breadth of the dataset, the comparability of environments and robustness of the results (e.g., towards the removal of reads, Fig. S8).

16S rRNA gene amplicon-based community analyses
Raw sequences were analyzed using DADA2 (110) following the DADA2 Pipeline Tutorial v1.16 (https://benjjneb.github.io/dada2/tutorial.html).Each Illumina run was analyzed separately to use runspecific error profiles, as is recommended best practice for big data (https://benjjneb.github.io/dada2/bigdata.html).In brief, forward and reverse reads were quality-trimmed to 275 bp and 205 bp, respectively, and primer sequences (17 bp forward, 21 bp reverse) were removed.Reads with more than two expected errors were discarded and paired reads were merged.All runs were combined, and chimeric sequences were removed.Species level taxonomy was assigned with the silva_nr_v138_train_set and silva_species_assignment_v138 based on the Silva small subunit reference database SSURef v138 (release date: 16-Dec-2019; (111)).After quality control and the removal of enrichments, blanks, biological and technical replicates, we obtained 478 archaeal and 1054 bacterial amplicon datasets.Archaeal datasets contained a total of 4.9 × 10 7 sequence reads belonging to 31099 unique amplicon sequence variants (ASVs).Archaeal samples had on average 1.02  10 5 reads and 155 ASVs (Table S3).Bacterial datasets comprised a total of 13  10 7 sequence reads belonging to 415423 unique ASVs.Bacterial samples had on average 1.2  10 5 reads and 1257 unique ASVs (Table S4).Amplicon datasets may not quantitatively represent the sampled community yet are reliable to compare community structure and make ecological interpretations (62).Alpha diversity (richness, Shannon entropy, Inverse Simpson Diversity and Chao1 estimated richness) was calculated from the ASV-bysample table using a subsampling approach to account for unequal sampling effort.We used 1142 and 2216 randomly chosen reads from each archaeal and bacterial sample, respectively.Even when using very stringent subsampling conditions of 50000 randomly chosen archaeal and bacterial sequences, respectively, the trends in alpha and beta diversity did not change substantially, hence we decided to include as many samples as possible.Bray-Curtis dissimilarities (112) between all samples were calculated and used for two-dimensional nonmetric multidimensional scaling (NMDS) ordinations with 20 random starts (113).All analyses were carried out with VisuaR (https://github.com/EmilRuff/VisuaR)-a workflow based on the R statistical environment, custom R scripts and several R packages including vegan (114) and ggplot2.Exemplary analysis logs (DatasetS1, S2), the used DADA2 script (DatasetS3), the VisuaR v40 script (DatasetS4) and an example VisuaR user input file (DatasetS5) for this study are available as supplementary materials.

Gene sequence identity analyses
To determine the relatedness of detected 16S rRNA gene sequences to those of the closest isolated strain, we performed a BLASTn search of each 16S rRNA gene amplicon sequence against a database of sequences from cultured archaeal and bacterial isolates, selected to yield only the single best hit for each amplicon.The isolate database was created with makeblastdb using sequences from cultured isolates in the SILVA NR Ref database v123 available at (https://www.arb-silva.de/).We included recently cultured (Cald)atribacteria (73).Only alignment lengths > 350 (97 % of bacterial amplicons and 64 % of archaeal amplicons) were analyzed.
For taxonomic and gene count analyses, cleaned paired-end reads were subsampled to a depth of 2M reads per sample using reformat.shin BBMap (with options: samplereadstarget=2000000 sampleseed=121) and then mapped back to the annotated genes using BBMap with read alignment of ≥95% for the unigene (with options: minid=0.95idfilter=0.95ambiguous=random) and ≥99% for rpS3 genes (with options: minid=0.99 idfilter=0.99 ambiguous=random).Gene count tables were normalized to gene per million (e.g., equivalent to transcript per million, TPM), following (121).Metagenomic short reads were mapped to the SILVA SSU reference database (111) to assign nearest taxonomic units, as well as full-length 16S/18S rRNA gene sequences were reconstructed from metagenomes using phyloFlash v3.4 (122).Metagenome-derived rpS3 and 16S rRNA genes were subsampled (Table S3, S4), analyzed and visualized analogous to the amplicon-derived 16S rRNA genes using the same VisuaR community analysis workflow.

Statistical analyses
Differences in alpha diversity metrics between conditions were tested using the Wilcoxon signed ranktest (ggsignif) as implemented in ggplot2 (123).P values were corrected stringently using the Bonferroni method.We used Analysis of Similarity (124) as implemented in vegan to test whether community structure between conditions was significantly different, as visualized in NMDS plots.Multivariate associations were tested as implemented in the MaAsLin2 method (63).

Fig. 1 .
Fig.1.Geographic location and origin of samples.Maps of samples used for metabarcoding of 16S rRNA gene amplicon taxonomic marker genes (A) and shotgun metagenomic analyses of unassembled and de novo assembled taxonomic marker genes (B).Each symbol represents one project, which comprises multiple individual samples.Both terrestrial as well as marine samples may contain rock, sediment, or water samples.Further maps showing sample material, pH and temperature are included in the Supplementary Material (Fig.S1).(C) Overview of sample origins derived from marine and terrestrial biomes, depth realms, and environments.

Fig. 2 .
Fig. 2. Microbial diversity in marine and terrestrial biome.Archaeal (A-C) and bacterial (D-F) alpha diversity (per sample community richness and evenness) in marine and terrestrial biomes using 16S rRNA gene amplicon sequence variants (ASVs; A, D), as well as metagenome-derived ribosomal protein S3 genes (rpS3; B, E) and 16S rRNA gene sequences detected by phyloFlash (pF16S).To allow comparison the datasets were subsampled to the same number of reads.Pairwise comparisons were performed using a Wilcoxon rank sum test.Significance: ***: p<0.001.The number of datasets is shown below the boxplots.Community dissimilarity between marine and terrestrial communities was shown by non-metric multidimensional scaling ordinations of ASV-based dissimilarity matrices using 469 archaeal (G) and 1105 bacterial datasets (H).Each circle represents the community structure of a dataset and is connected to the group centroid (weighted average

Fig. 3 .
Fig. 3. Observed archaeal and bacterial richness across environments based on 16S rRNA gene amplicon sequence variants.Archaeal (A) and bacterial (B) richness found in the 14 studied environments (n: number of included samples).Environments are grouped based on biome (marine, terrestrial) and depth realm (surface, interface, subsurface).Note: Archaeal terrestrial water and sediment samples contain too few datapoints for robust visualization using boxplots, yet the plots were retained for completeness.Total archaeal (C) and bacterial richness (D) using a subsampling approach to account for different group sizes (100 iterations using the smallest group size).Normalized gamma diversity corroborates that archaeal diversity is highest in marine interface and subsurface ecosystems even when group sizes are considered.All pairwise comparisons were significantly different except for gamma diversity between terrestrial interface and subsurface biomes (archaea and bacteria), and marine and terrestrial surface biomes (bacteria).

Fig. 4
Fig. 4. Relative sequence abundance and richness of most important lineages across environments.Relative sequence abundance of top 10 most abundant archaeal classes (A) and bacterial phyla (C) in the 13 studied environments.Contribution of most abundant lineages to average number of archaeal (B) and bacterial (D) ASVs per sample (richness).

Fig. 5 .
Fig. 5. Multivariate association analyses of microbial lineages.Analyses compare the occurrence of bacterial (A) and archaeal (B) phyla in surface vs subsurface realms.The phyla are ordered from top to bottom based on increasing likeliness of their occurrence in subsurface-derived samples (increasing MaAsLin2 coefficients; "subsurface-ness").Boxplots summarize order level MaAsLin2 coefficients within the listed phyla.Note: due to ease of visualization boxplots are even shown for very small number (n) of orders.The significance of the MaAsLin2 phylum level coefficient is shown in the column denoted "signif".Significance levels are: *: p<0.05, **: p<0.01, ***: p<0.001, ****: p<0.0001.Additional bacterial phyla are shown in Fig. S14.

Fig. 6 .
Fig. 6.Subsurface core microbiomes.The heatmaps show potential archaeal order level core microbiomes in the marine (A) and terrestrial subsurface (B), and bacterial class level core microbiomes in the marine (C) and terrestrial subsurface (D).The 20 most relevant lineages, regarding their prevalence and relative sequence abundances, are shown as rows, each column color represents the lineages prevalence at a certain RSA threshold.Uncultured/unclassified lineages are denoted by "_unc" and the closest known phylogenetic level or the closest phylogenetic level with an isolated representative is shown.For example, Bathyarchaeia_unc are all uncultured/unclassified/ order level clades in the class Bathyarchaeia.

Fig. 7
Fig. 7. Microbial phylogenetic novelty.Percent identity values (PIVs) of archaeal (A) and bacterial (B) amplicon sequence variants (ASVs) relative to their closest cultured relative for the studied biomes and realms.Each square in the density plot represents one ASV and depicts its PIV (y-axis) and relative sequence abundance (RSA, x-axis, logarithmic).The color gradient shows how many ASVs have identical PIV and RSA.The more yellow, the more ASVs are represented by the square.

Fig. 8 .
Fig. 8. Genus-level diversity within and between depth realms.Ternary plots show archaeal (A, B) and bacterial genuslevel diversity (C, D) in the marine (A, C) and terrestrial biome (B, D).Each circle is a genus-level clade, circle size is average relative sequence abundance in the respective biome, circle colors represent to which of the top 5 phyla the lineage belongs.The location of the circle shows the average relative sequence abundance in each depth realm (surface, interface and subsurface).For example, if a circle lies exactly in the center of the plot, the respective lineage is equally abundant (33.3 %) in each of the three depths, if it lies at one of the corners (e.g., surface) it means it occurs only in the surface, circles that lie on the sides of the plot occur in only two of the three depths.