Contents
Sample | Raw | Filtered | Merged | In OTUs | shannon | simpson | invsimpson |
---|---|---|---|---|---|---|---|
F3D144_S210_L001_001 | 4827 | 4809.0 | 3107.0 | 2968.0 | 3.126734 | 0.924098 | 13.174933 |
F3D142_S208_L001_001 | 3183 | 3168.0 | 2185.0 | 2092.0 | 3.269211 | 0.934444 | 15.254106 |
F3D141_S207_L001_001 | 5958 | 5939.0 | 4270.0 | 4110.0 | 3.434652 | 0.944073 | 17.880536 |
F3D143_S209_L001_001 | 3178 | 3172.0 | 2255.0 | 2149.0 | 3.441376 | 0.942087 | 17.267208 |
Sample1 **OMITTED** | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN |
Samples | OTUs | OTU Total Count | OTU Table Density |
---|---|---|---|
4 | 137 | 11319.0 | 0.844891 |
Samples are ordered from least diverse to most.
hundo wraps a number of functions and its exact steps are dependent upon a given user's configuration.
Paired-end sequence reads are quality trimmed and can be trimmed of adapters and filtered of contaminant sequences using BBDuk2 of the BBTools package. Passing reads are merged using VSEARCH then aggregated into a single FASTA file with headers describing the origin and count of the sequence.
Prior to creating clusters, reads are filtered again based on expected error rates:
To create clusters, the aggregated, merged reads are dereplicated to remove singletons by default using VSEARCH. Sequences are preclustered into centroids using VSEARCH to accelerate chimera filtering. Chimera filtering is completed in two steps: de novo and then reference based. The reference by default is the entire annotation database. Following chimera filtering, sequences are placed into clusters using distance-based, greedy cluster with VSEARCH based on the allowable percent difference of the configuration.
After OTU sequences have been determined, BLAST or VSEARCH is used to align sequences to the reference database. Reference databases for 16S were curated by the CREST team and hundo incorporates the CREST LCA method. ITS databases are maintained by UNITE.
Counts are assigned to OTUs using the global alignment method of VSEARCH, which outputs the final OTU table as a tab-delimited text file. The Biom command line tool is used to convert the tab table to biom.
Multiple alignment of sequences is completed using MAFFT. A tree based on the aligned sequences is built using FastTree2.
This workflow is built using Snakemake and makes use of Bioconda to install its dependencies.
fastq_dir: /Users/brow015/devel/hundo/example/mothur_sop_data filter_adapters: None filter_contaminants: None allowable_kmer_mismatches: 1 reference_kmer_match_length: 27 reduced_kmer_min: 8 minimum_passing_read_length: 100 minimum_base_quality: 10 minimum_merge_length: 150 fastq_allowmergestagger: False fastq_maxdiffs: 5 fastq_minovlen: 16 maximum_expected_error: 1.0 reference_chimera_filter: True minimum_sequence_abundance: 2 percent_of_allowable_difference: 3.0 reference_database: silva aligner: blast blast_minimum_bitscore: 125 blast_top_fraction: 0.95 read_identity_requirement: 0.97
name: hundo_env channels: - bioconda - conda-forge - defaults dependencies: - python>=3.6 - bbmap=37.17 - biom-format=2.1.6 - biopython - blast=2.6.0 - bzip2=1.0.6 - click=6.7 - docutils=0.14 - mafft=7.313 - fasttree=2.1.10 - numpy>=1.14.0 - pandas>=0.23.0 - pigz=2.3.4 - pyyaml>=3.12 - vsearch=2.6.0 - zip=3.0 - pip - pip: - relatively - plotly>=3.0
Not all files written by the workflow are contained within the Downloads section of this page to minimize the size of the this document. Other output files are described and are written to the results directory.
The zip archive contains the following files:
Biom table with raw counts per sample and their associated taxonomic assignment formatted to be compatible with downstream tools like phyloseq.
Representative DNA sequences of each OTU.
Newick tree representation of aligned OTU sequences.
Tab-delimited text table with columns OTU ID, a column for each sample, and taxonomy assignment in the final column as a comma delimited list.
Other files that may be needed, but that are not included in the attached archive include:
OTU sequences after alignment using MAFFT.
Quality-controlled, dereplicated DNA sequences of all samples. The header of each record identifies the sample of origin and the count resulting from dereplication.
The BLAST assignments per OTU sequence.
- archive:
- samples: