Summary

Sequence Counts

Sample	Raw	Filtered	Merged	In OTUs	shannon	simpson	invsimpson
F3D144_S210_L001_001	4827	4809.0	3107.0	2968.0	3.126734	0.924098	13.174933
F3D142_S208_L001_001	3183	3168.0	2185.0	2092.0	3.269211	0.934444	15.254106
F3D141_S207_L001_001	5958	5939.0	4270.0	4110.0	3.434652	0.944073	17.880536
F3D143_S209_L001_001	3178	3172.0	2255.0	2149.0	3.441376	0.942087	17.267208
Sample1 OMITTED	3.0	NaN	NaN	NaN	NaN	NaN	NaN

Sequence Quality

OTUs

Samples	OTUs	OTU Total Count	OTU Table Density
4	137	11319.0	0.844891

Taxonomy

Samples are ordered from least diverse to most.

Methods

hundo wraps a number of functions and its exact steps are dependent upon a given user's configuration.

Paired-end sequence reads are quality trimmed and can be trimmed of adapters and filtered of contaminant sequences using BBDuk2 of the BBTools package. Passing reads are merged using VSEARCH then aggregated into a single FASTA file with headers describing the origin and count of the sequence.

Prior to creating clusters, reads are filtered again based on expected error rates:

To create clusters, the aggregated, merged reads are dereplicated to remove singletons by default using VSEARCH. Sequences are preclustered into centroids using VSEARCH to accelerate chimera filtering. Chimera filtering is completed in two steps: de novo and then reference based. The reference by default is the entire annotation database. Following chimera filtering, sequences are placed into clusters using distance-based, greedy cluster with VSEARCH based on the allowable percent difference of the configuration.

After OTU sequences have been determined, BLAST or VSEARCH is used to align sequences to the reference database. Reference databases for 16S were curated by the CREST team and hundo incorporates the CREST LCA method. ITS databases are maintained by UNITE.

Counts are assigned to OTUs using the global alignment method of VSEARCH, which outputs the final OTU table as a tab-delimited text file. The Biom command line tool is used to convert the tab table to biom.

Multiple alignment of sequences is completed using MAFFT. A tree based on the aligned sequences is built using FastTree2.

This workflow is built using Snakemake and makes use of Bioconda to install its dependencies.

Configuration

fastq_dir: /Users/brow015/devel/hundo/example/mothur_sop_data
filter_adapters: None
filter_contaminants: None
allowable_kmer_mismatches: 1
reference_kmer_match_length: 27
reduced_kmer_min: 8
minimum_passing_read_length: 100
minimum_base_quality: 10
minimum_merge_length: 150
fastq_allowmergestagger: False
fastq_maxdiffs: 5
fastq_minovlen: 16
maximum_expected_error: 1.0
reference_chimera_filter: True
minimum_sequence_abundance: 2
percent_of_allowable_difference: 3.0
reference_database: silva
aligner: blast
blast_minimum_bitscore: 125
blast_top_fraction: 0.95
read_identity_requirement: 0.97

Execution Environment

name: hundo_env
channels:
    - bioconda
    - conda-forge
    - defaults
dependencies:
    - python>=3.6
    - bbmap=37.17
    - biom-format=2.1.6
    - biopython
    - blast=2.6.0
    - bzip2=1.0.6
    - click=6.7
    - docutils=0.14
    - mafft=7.313
    - fasttree=2.1.10
    - numpy>=1.14.0
    - pandas>=0.23.0
    - pigz=2.3.4
    - pyyaml>=3.12
    - vsearch=2.6.0
    - zip=3.0
    - pip
    - pip:
      - relatively
      - plotly>=3.0

Output Files

Not all files written by the workflow are contained within the Downloads section of this page to minimize the size of the this document. Other output files are described and are written to the results directory.

Attached

The zip archive contains the following files:

OTU.biom

Biom table with raw counts per sample and their associated taxonomic assignment formatted to be compatible with downstream tools like phyloseq.

OTU.fasta

Representative DNA sequences of each OTU.

OTU.tree

Newick tree representation of aligned OTU sequences.

OTU.txt

Tab-delimited text table with columns OTU ID, a column for each sample, and taxonomy assignment in the final column as a comma delimited list.

Other Result Files

Other files that may be needed, but that are not included in the attached archive include:

OTU_aligned.fasta

OTU sequences after alignment using MAFFT.

all-sequences.fasta

Quality-controlled, dereplicated DNA sequences of all samples. The header of each record identifies the sample of origin and the count resulting from dereplication.

blast-hits.txt

The BLAST assignments per OTU sequence.

Downloads

archive:

hundo_results.zip

samples:

SAMPLES.txt

Joe Brown // hundo, version 1.2.1 // 2018-09-21

hundo - Experiment Summary