hundo - Experiment Summary

Contents

Summary

Sequence Counts

F3D143_S209_L001_001F3D142_S208_L001_001F3D144_S210_L001_001F3D141_S207_L001_0010100020003000400050006000
RawFilteredMergedIn OTUsCounts Per Sample By StageCount

Sequence Quality

5010015020025020253035
Mean Quality Scores for R1 and R2PositionQuality (Phred score)ForwardReverse
5010015020025000.020.040.06
Mean Expected Error Rate Across Merged ReadsBase PositionExpected Error

OTUs

Taxonomy

Samples are ordered from least diverse to most.

01000200030004000F3D144_S210_L001_001F3D142_S208_L001_001F3D141_S207_L001_001F3D143_S209_L001_001020406080100
Assigned Taxonomy Per SampleCountsRelative Abundance (%)Level:Phylum

Methods

hundo wraps a number of functions and its exact steps are dependent upon a given user's configuration.

Paired-end sequence reads are quality trimmed and can be trimmed of adapters and filtered of contaminant sequences using BBDuk2 of the BBTools package. Passing reads are merged using VSEARCH then aggregated into a single FASTA file with headers describing the origin and count of the sequence.

Prior to creating clusters, reads are filtered again based on expected error rates:

To create clusters, the aggregated, merged reads are dereplicated to remove singletons by default using VSEARCH. Sequences are preclustered into centroids using VSEARCH to accelerate chimera filtering. Chimera filtering is completed in two steps: de novo and then reference based. The reference by default is the entire annotation database. Following chimera filtering, sequences are placed into clusters using distance-based, greedy cluster with VSEARCH based on the allowable percent difference of the configuration.

After OTU sequences have been determined, BLAST or VSEARCH is used to align sequences to the reference database. Reference databases for 16S were curated by the CREST team and hundo incorporates the CREST LCA method. ITS databases are maintained by UNITE.

Counts are assigned to OTUs using the global alignment method of VSEARCH, which outputs the final OTU table as a tab-delimited text file. The Biom command line tool is used to convert the tab table to biom.

Multiple alignment of sequences is completed using MAFFT. A tree based on the aligned sequences is built using FastTree2.

This workflow is built using Snakemake and makes use of Bioconda to install its dependencies.

Configuration

fastq_dir: /Users/brow015/devel/hundo/example/mothur_sop_data
filter_adapters: None
filter_contaminants: None
allowable_kmer_mismatches: 1
reference_kmer_match_length: 27
reduced_kmer_min: 8
minimum_passing_read_length: 100
minimum_base_quality: 10
minimum_merge_length: 150
fastq_allowmergestagger: False
fastq_maxdiffs: 5
fastq_minovlen: 16
maximum_expected_error: 1.0
reference_chimera_filter: True
minimum_sequence_abundance: 2
percent_of_allowable_difference: 3.0
reference_database: silva
aligner: blast
blast_minimum_bitscore: 125
blast_top_fraction: 0.95
read_identity_requirement: 0.97

Execution Environment

name: hundo_env
channels:
    - bioconda
    - conda-forge
    - defaults
dependencies:
    - python>=3.6
    - bbmap=37.17
    - biom-format=2.1.6
    - biopython
    - blast=2.6.0
    - bzip2=1.0.6
    - click=6.7
    - docutils=0.14
    - mafft=7.313
    - fasttree=2.1.10
    - numpy>=1.14.0
    - pandas>=0.23.0
    - pigz=2.3.4
    - pyyaml>=3.12
    - vsearch=2.6.0
    - zip=3.0
    - pip
    - pip:
      - relatively
      - plotly>=3.0

Output Files

Not all files written by the workflow are contained within the Downloads section of this page to minimize the size of the this document. Other output files are described and are written to the results directory.

Attached

The zip archive contains the following files:

OTU.biom

Biom table with raw counts per sample and their associated taxonomic assignment formatted to be compatible with downstream tools like phyloseq.

OTU.fasta

Representative DNA sequences of each OTU.

OTU.tree

Newick tree representation of aligned OTU sequences.

OTU.txt

Tab-delimited text table with columns OTU ID, a column for each sample, and taxonomy assignment in the final column as a comma delimited list.

Other Result Files

Other files that may be needed, but that are not included in the attached archive include:

OTU_aligned.fasta

OTU sequences after alignment using MAFFT.

all-sequences.fasta

Quality-controlled, dereplicated DNA sequences of all samples. The header of each record identifies the sample of origin and the count resulting from dereplication.

blast-hits.txt

The BLAST assignments per OTU sequence.

Downloads

samples:
Joe Brown // hundo, version 1.2.1 // 2018-09-21