TaxonTableTools - A comprehensive, platform-independent graphical user interface software to explore and visualise DNA metabarcoding data

DNA metabarcoding is increasingly used in research and application to assess biodiversity. Powerful analysis software exists to process raw data. However, when it comes to the translation of sequence read data into biological information many end users with limited bioinformatic expertise struggle with the downstream analysis and explore data only to a minor extent. Thus, there is a growing need for easy-to-use, graphical user interface (GUI) analysis software to analyse and visualise DNA metabarcoding data. We here present TaxonTableTools (TTT), a new platform independent GUI software that aims to fill this gap by providing simple and reproducible analysis and visualisation workflows. TTT uses a so-called “TaXon table” as input. This format can easily be generated within TTT from two input files: a read table and a taxonomy table that can be obtained by various published metabarcoding pipelines. TTT analysis and visualisation modules include e.g. Venn diagrams to compare taxon overlap among replicates, samples or among different analysis methods. It analyses and visualises basic statistics such as read proportion per taxon as well as more sophisticated visualisation such as interactive Krona charts for taxonomic data exploration. Various ecological analyses such as alpha or beta diversity estimates, and rarefaction analysis ordination plots can be produced directly. Data can be explored also in formats required by traditional taxonomy-based analyses of regulatory bioassessment programs. TTT comes with a manual and tutorial, is free and publicly available through GitHub (https://github.com/TillMacher/TaxonTableTools) and the Python package index (https://pypi.org/project/taxontabletools/).

for large scaled biomonitoring programs. As a direct consequence, analyses of this growing amount of produced 76 data and its translation into biological meaningful results increasingly becomes the bottleneck, which limits the 77 uptake of the methods by non-experts and bioinformatics beginners. However, it is the biologists that need to 78 work with the data and interpret these. 79 To address the clear need of analysing increased amounts of data in a user-friendly way, we developed the 80 software TaxonTableTools (TTT in the following). TTT was developed as part of the GeDNA project, which tests 81 the implementation of eDNA metabarcoding as part of regulatory biomonitoring. The program provides easy-to-82 use tools for biologists and non-bioinformaticians to analyse and visualize their metabarcoding data quickly and 83 reproducibly via a GUI. It unites commonly used data processing steps for metabarcoding data with a set of 84 modules used for taxonomic exploration of the results, ecological analyses as well as options to use the data as 85 part of regulatory biomonitoring applications. 86 87 Implementation 88 TaxonTableTools is written in python and available at GitHub (https://github.com/TillMacher/TaxonTableTools). 89 Python is currently supported by all three major operating systems Windows, MacOs and Linux-based 90 distributions (e.g. Ubuntu). Program installation only requires minimum user input. When python and pip are 91 properly installed, the required python packages can be easily installed via pip. To improve user-friendliness, TTT 92 comes with a mouse-driven graphical user interface (GUI), which allows the user to easily execute the various 93 modules as well as a detailed manual and a tutorial with a test data set (figure 1). 94 A key advantage of TTT is a comprehensive data management structure. New projects are created within a 95 dedicated project folder. All generated files are stored in the respective project directory, which drastically 96 increases clarity and structure when working with different data sets or projects. When launched TTT will ask to 97 either create a new project folder or load an already existing one. This circumvents the explicit naming of output 98 files. Newly generated files are named according to their input file and the conducted application. 99 A major goal of TTT was to offer a rapid and easy tool to visualize the data for reports or publications. Thus, the 100 standard output format for most plots is the pdf format, which retains the vectors in graphics. This allows post-101 processing of plots created with TTT with any vector manipulation software. 102 103 Input formats and data conversion 104 Input format requirements 105 TTT requires two input files, a read table and a taxonomy table. Read tables are generally referred to as a data 106 frame, which contains the read abundances for each OTU (operational taxonomic unit) or ASV (amplicon 107 sequence variant) per sample and its respective sequence. Read tables are generated by various published DNA 108 metabarcoding pipelines and wrappers, e.g. JAMP, DADA2, QIIME2 or OBITOOLS. Since the output layout differs 109 between pipelines, TTT requires a specific input layout. This can easily be created from the various other formats. 110 Mostly only the header requires minor adjustments. Taxonomy tables are defined as a data frame that holds 111 taxonomic information for each OTU of the read table. The layout and informational content often drastically 112 differ, as there is no current consensus on a standard format. Taxonomy tables can for example be created using 113 QIIME2, SLIM, blast+, BOLDigger or even be compiled manually. As the standard input format TTT uses the output 114 format from the BOLDigger tool. As a requirement, the same OTUs have to be present in both the taxonomy 115 recognized via the sample names, which have to be marked with a trailing underscore and a user-defined symbol 139 at the end (e.g. commonly used "_rep1", "_rep2" or "_a", "_b"). Here, the first module allows to filter OTUs by 140 keeping only OTUs that are present in all replicates of one sample. In detail, this will exclude OTUs that are not 141 present in all replicates of one sample, by setting the OTU read counts to zero. When the research interest is 142 focused on low abundant or rare OTUs, this module is not recommended, since it might lead to exclude real, but 143 rare OTUs. Afterwards replicates can be merged into one representative sample, by calculating the sum of reads 144 for each OTU of the replicates. This will drastically reduce redundancy in the TaXon table and  implemented. However, the TTT roadmap includes the analyses of metadata for OTUs in a future version. 156 Conversion to incidence data 157 The use of read abundances as a proxy for specimen counts or biomass estimates has been subject of discussion 158 with the development of DNA metabarcoding. Due to PCR stochasticity, varying primer binding efficiency and 159 sequencing bias, there is often only a weak correlation between read abundances and specimen counts or it is recommended to convert the read abundance data to incidence data for biodiversity analyses. However, this 163 conversion comes with a downside, since incidence data limits the pool of appropriate diversity estimate 164 analyses. 165 166 TaXon table analysis 167

Getting first insights 168
To get a first overview of the data set, it is helpful to visualize the number of reads, number of OTUs and OTUs 169 assigned to species level in a plot ( figure 2a). This plot allows investigations of the overall quality of the data set. 170 Generally, negative controls should represent only a fraction of the overall reads. Furthermore, samples that 171 have drastically lower read counts and thus often also less OTUs and OTUs on species level, should be considered 172 to be removed from the data set. They are often prone to create outliers in statistical analyses or alter the 173 perspective of between locations or categories comparisons. 174

Read proportions 175
Read proportions can be illustrated in a scatter plot. Contrary to the commonly applied illustration in a bar plot, 176 a scatter plot remains readable even for larger data sets (figure 2b). Each taxon is represented by its own entity 177 so that the dependency on colors schemes is no longer required. 178

Taxonomic richness and resolution 179
Measuring the taxonomic richness of a sample assemblage is an essential objective in every biodiversity analysis. 180 In classical ecology terms, the species richness is defined as the number of species in an ecological community, 181 landscape or region. The most straightforward computation of species richness is to count the number of OTUs 182 or species in the data set. The species richness can either be calculated for the whole data set or for each sample 183 itself. 184 However, identification to species level is often not possible. Many species remain undescribed and there is a 185 lack of reference sequences for a vast number of species. Still, higher taxonomic levels (i.e. genus or family) can 186 hold information for assessing biodiversity. The overall taxonomic resolution of a data set can be visualized in a 187 bar chart, which plots the number of OTUs assigned to the respective level as lowermost rank. The taxonomic 188 resolution can be used as an indicator for several potential sources of bias, like varying primer binding efficiency 189 or bioinformatics process bias (e.g. remaining primer sequences). These can act as sources for a reduced 190 taxonomic resolution, as OTUs are often not assigned to species level in consequence.

Diversity analyses and ordination methods 214
DNA metabarcoding data is often used for diversity analyses. TTT offers calculation of alpha and beta diversity 215 and ordination analyses. The implemented tools are mostly dependent on the python package scikit-bio 216 (http://scikit-bio.org/). All diversity analyses require an incidence data TaXon table. The alpha diversity  217 calculation is based on the number of OTUs per sample, which are displayed as a scatter plot. Beta diversity is 218 calculated as Jaccard-distances, which are illustrated in a distance matrix. Furthermore, a Jaccard-distance based 219 principle coordinate analysis (PCoA) can be performed (figure 2c). A canonical-correlation analysis (CCA) tool is 220 also implemented. For both ordination analyses, it is possible to choose two axes from all available axes for 221 plotting. 222

Report and taxa list 223
A taxa list can be created from the TaXon table. This table includes all found taxa from the input TaXon table and  224 reduces redundant hits. Optionally, for each hit that was identified on species level, a link to the Global 225 Biodiversity Information Facility (GBIF) website is created. The GBIF database (https://www.gbif.org/) is accessed 226 via the application programming interface (API). These links allow for quick investigations of the taxon list, 227 particularly for checking unfamiliar taxa. Furthermore, statistics can be calculated for each taxon. These include 228 the absolute number of reads per taxon and the relative proportion within the data set, the occupancy across all 229 samples, the number of OTUs identified as the respective taxon and the intraspecific distances for taxa with 230 multiple OTUs. In addition to the taxon list, a report file is created. This report file includes many relevant 231 information on how the data was processed from the wet lab to bioinformatics processing. This information can 232 be filled out in the GUI and enhances the data documentation in hindsight. 233

Conversion to regulatory assessment programs 234
One additional aim of TTT is to convert metabarcoding data sets into formats required by tools used for 235 regulatory frameworks such as the European Water Framework Directive (WFD, Directive 2000/60/EC). 236 Monitoring activities for the WFD, and for counterparts in other areas and for other ecosystems, aim to provide 237 standardized assessments of the ecological quality of waterbodies derived from biota. In the initial version, we 238 provide the opportunity to convert metabarcoding lists into a format that can be used as input to the German 239 Water Framework Directive analysis tool. This online tool (www.gewaesser-bewertung-berechnung.de) is 240 designed to allow the upload of taxa lists from monitoring activities, from which the ecological quality according 241 to the German river assessment scheme is calculated. In addition, many supporting metrics (such as feeding 242 types or habitat preferences of macroinvertebrates) are calculated. Upload requires a species-station table in an 243 Excel or ASCII format with species in rows and stations in columns, giving the abundance (or alternatively the 244 presence / absence) of the recorded taxa. Each taxon is accompanied by an ID that allows for linking the taxon 245 to its specific autecological characteristics. For comparability reasons, the system standardizes the taxonomy to 246 an operational taxa list, which defines for each taxon the taxonomic level achievable by identification in routine 247 water management. As an alternative to the direct upload, the system offers a batch mode, allowing large data 248 sets to be automatically read from databases of water authorities and the assessment results to be returned. 249 TTT provides species station tables in the format required by this system, including the taxon ID, that can be 250 directly uploaded and used for river assessment. Correlations between samples can be investigated by performing a principle coordination analysis (PCoA), which 413 is based on Jaccard distances (C). An analysis of similarities (ANOSIM) and p-test is performed automatically. 414 Taxa overlaps of up to three samples can be visualized with Venn diagrams (D). 415