Hybrid metagenome assemblies link carbohydrate structure with function in the human gut microbiome

A schematic overview of the workflow and experimental design is displayed in Fig. 1: a human faecal sample is used in a fermentation experiment with carbohydrate substrates, using a colon model to investigate carbohydrate degradation and its effect on the gut microbiota. Samples obtained from the colon model are used for short- and long-read metagenomics sequencing. Bioinformatics analyses include hybrid assembly, genome annotation, phylogenetic analysis and CAZyme annotation to characterise changes in the microbiome for different types of carbohydrate across time. A more detailed overview of the tools and bioinformatic analyses is shown in Supplementary Fig. 1.


Native maize starch (catalogue no. S4126), native potato starch (catalogue no. 2004), Avicel PH-101 (catalogue no. 11365) and chicory inulin (catalogue no. I2255) were purchased from Sigma-Aldrich, (Gillingham, UK). Hylon VII® was kindly provided as a gift by Ingredion Incorporated (Manchester, UK).

Retrograded maize starch was prepared from 40 g of native maize starch in 400 mL of deionized water. The slurry was stirred continuously at 95 °C in a water bath for 20 min. The resulting gel was cooled to room temperature for 60 min, transferred to aluminium pots (150 mL, Ampulla, Hyde UK), and stored at 4 °C for 48 h. The retrograded gel was then frozen at −80 °C for 12 h and freeze-dried (LyoDry, MechaTech Systems Ltd, Bristol, UK) for 72 h.

Each substrate (0.500 ± 0.005 g, dry weight) was weighed in sterilised fermentation bottles (100 mL) prior to the start of the experiment.

Inoculum collection and preparation

A single human faecal sample was obtained from one adult (≥18 years old), a free-living, healthy donor who had not taken antibiotics in the 3 months prior to donation and was free from gastrointestinal disease. Ethical approval was granted by Human Research Governance Committee at the Quadram Institute (IFR01/2015) and London—Westminster Research Ethics Committee (15/LO/2169) and the trial was registered on (NCT02653001). Signed informed consent was obtained from the participant prior to donation. The stool sample was collected by the participant, stored in a closed container under ambient conditions, transferred to the laboratory and prepared for inoculation within 2 hours of excretion. The faecal sample was diluted 1:10 with pre-warmed, anaerobic, sterile phosphate buffer saline (0.1 M, pH 7.4) in a double meshed stomacher bag (500 mL, Seward, Worthing, UK) and homogenised using a Stomacher 400 (Seward, Worthing, UK) at 200 rpm for two cycles, each of 60 s length.

Batch fermentation in the colon model

Fermentation vessels were established with media adapted from Williams et al.43. In brief, each vessel (100 mL) contained an aliquot (3.0 mL) of filtered faecal slurry, 82 mL of sterilised growth medium, and one of the six substrates for experimental evaluation: native Hylon VII or native potato starch (highly recalcitrant starches); native maize starch or gelatinised, retrograded maize starch (accessible starches); Avicel PH-101 (insoluble fibre; negative control); and chicory inulin (fermentable soluble fibre; positive control). There was also a media-only control with no inoculum (blank), making a total of seven fermentation vessels.

For each fermentation vessel, the growth medium contained 76 mL of basal solution, 5 mL vitamin phosphate and sodium carbonate solution, and 1 mL reducing agent. The composition of the various solutions used in the preparation of the growth medium is described in detail in Supplementary Note 2. A single stock (7 l) of growth medium was prepared for use in all vessels. Vessel fermentations were pH controlled and maintained at pH 6.8 to 7.2 using 1 N NaOH and 1 N HCl regulated by a Fermac 260 (Electrolab Biotech, Tewkesbury, UK). A circulating water jacket maintained the vessel temperature at 37 °C. Magnetic stirring was used to keep the mixture homogenous and the vessels were continuously sparged with nitrogen (99% purity) to maintain anaerobic conditions. Samples were collected from each vessel at 0 (5 min), 6, 12 and 24 h after inoculation. The biomass from two 1.8 mL aliquots from each sample were concentrated by refrigerated centrifugation (4 °C; 10,000 × g for 10 min), the supernatant removed, and the pellets stored at −80 °C prior to bacterial enumeration and DNA extraction; one pellet was used for enumeration and one for DNA extraction.

Bacterial cell enumeration

All materials used for bacterial cell enumeration were purchased from Sigma-Aldrich (Gillingham, UK) unless specified otherwise. To each frozen pellet, 400 µL of Phosphate Buffered Saline (PBS) and 1100 µL of 4% paraformaldehyde were added and gently thawed at 20 °C for 10 min with gentle mixing. Once thawed, each resuspension was thoroughly mixed and incubated overnight at 4 °C for fixation to occur. The resuspensions were then centrifuged for 10 min at 8000 × g, the supernatant removed, and the residual pellet washed with 1 mL 0.1% Tween-20. This pellet then underwent two further washes in PBS to remove any residual paraformaldehyde and was then resuspended in 600 µL PBS: ethanol (1:1).

The fixed resuspensions were centrifuged for 3 min at 16,000 × g, the supernatant removed, and the pellet resuspended in 500 µL 1 mg/mL lysozyme (100 µL 1 M Tris HCl at pH 8, 100 µL 0.5 M EDTA at pH 8, 800 μL water, and 1 mg lysozyme, catalogue no. L6876) and incubated at room temperature for 10 min. After thorough mixing and centrifugation for 3 min at 16,000 × g, the supernatant was removed, and the pellet was washed with PBS. The resulting pellet was then resuspended in 150 µL of hybridisation buffer (HB, per mL: 180 µL 5 M NaCl, 20 µL 1 M Tris HCl at pH 8, 300 µL Formamide, 499 µL water, 1 µL 10% SDS), centrifuged, the supernatant removed, and the remaining pellet resuspended again in 1500 µL of HB and stored at 4 °C prior to enumeration. For bacterial enumeration, 1 µL of Invitrogen SYTO 9 (catalogue no. S34854, Thermo Fisher Scientific, Loughborough, UK) was added to 1 mL of each fixed and washed resuspension. Within 96-well plate resuspensions were diluted to 1:1000 and the bacterial populations within them enumerated using flow cytometry (Luminex Guava easyCyte 5) at wavelength of 488 nm and Guava suite software, version 3.3.

DNA extraction

Each pellet was resuspended in 500 μL (samples collected at 0 and 6 h) or 650 μL (samples collected at 12 and 24 h) with chilled (4 °C) nuclease-free water (Sigma-Aldrich, Gillingham, UK). The resuspensions were frozen overnight at −80 °C, thawed on ice and an aliquot (400 μL) used for bacterial genomic DNA extraction. FastDNA® Spin Kit for Soil (MP Biomedical, Solon, US) was used according to the manufacturer’s instructions which included two bead beating steps of 60 s at a speed of 6.0 m/s (FastPrep24, MP Biomedical, Solon, USA). DNA concentration was determined using the Quant-iT™ dsDNA Assay Kit, high sensitivity kit (Invitrogen, Loughborough, UK) and quantified using a FLUOstar Optima plate reader (BMG Labtech, Aylesbury, UK).

Illumina NovaSeq library preparation and sequencing

Genomic DNA was normalised to 5 ng/µL with elution buffer (10 mM Tris HCl). A miniaturised reaction was set up using the Nextera DNA Flex Library Prep Kit (Illumina, Cambridge, UK). 0.5 µL Tagmentation Buffer 1 (TB1) was mixed with 0.5 µL Bead-Linked Transposomes (BLT) and 4.0 µL PCR-grade water in a master mix and 5 µL was added to each well of a chilled 96-well plate. About 2 µL of normalised DNA (10 ng total) was pipette-mixed with each well of tagmentation master mix and the plate heated to 55 °C for 15 min in a PCR block. A PCR master mix was made up using 4 µL kapa2G buffer, 0.4 µL dNTP’s, 0.08 µL Polymerase and 4.52 µL PCR-grade water, from the Kap2G Robust PCR kit (Sigma-Aldrich, Gillingham, UK) and 9 µL added to each well in a 96-well plate. About 2 µL each of P7 and P5 of Nextera XT Index Kit v2 index primers (catalogue No. FC-131-2001 to 2004; Illumina, Cambridge, UK) were also added to each well. Finally, the 7 µL of Tagmentation mix was added and mixed. The PCR was run at 72 °C for 3 min, 95 °C for 1 min, 14 cycles of 95 °C for 10 s, 55 °C for 20 s and 72 °C for 3 min. Following the PCR reaction, the libraries from each sample were quantified using the methods described earlier and the high sensitivity Quant-iT dsDNA Assay Kit. Libraries were pooled following quantification in equal quantities. The final pool was double-SPRI size selected between 0.5 and 0.7X bead volumes using KAPA Pure Beads (Roche, Wilmington, US). The final pool was quantified on a Qubit 3.0 instrument and run on a D5000 ScreenTape (Agilent, Waldbronn, DE) using the Agilent Tapestation 4200 to calculate the final library pool molarity. qPCR was done on an Applied Biosystems StepOne Plus machine. Samples quantified were diluted 1 in 10,000. A PCR master mix was prepared using 10 µL KAPA SYBR FAST qPCR Master Mix (2X) (Sigma-Aldrich, Gillingham, UK), 0.4 µL ROX High, 0.4 µL 10 μM forward primer, 0.4 µL 10 μM reverse primer, 4 µL template DNA, 4.8 µL PCR-grade water. The PCR programme was: 95 °C for 3 min, 40 cycles of 95 °C for 10 s, 60 °C for 30 s. Standards were made from a 10 nM stock of Phix, diluted in PCR-grade water. The standard range was 20, 2, 0.2, 0.02, 0.002, 0.0002 pmol. Samples were then sent to Novogene (Cambridge, UK) for sequencing using an Illumina NovaSeq instrument, with sample names and index combinations used. Demultiplexed FASTQ’s were returned on a hard drive.

Nanopore library preparation and PromethION sequencing

Library preparation was performed using SQK-LSK109 (Oxford Nanopore Technologies, Oxford, UK) with barcoding kits EXP-NBD104 and EXP-NBD114. The native barcoding genomic DNA protocol by Oxford Nanopore Technologies (ONT) was followed with slight modifications. Starting material for the End-Prep/FFPE reaction was 1 μg per sample in 48 μL volume. About 3.5 µL NEBNext FFPE DNA Repair Buffer (NEB, New England Biolabs, Ipswich, USA), 3.5 µL NEB Ultra II End-prep Buffer, 3 µL NEB Ultra II End-prep Enzyme Mix and 2 µL NEBNext FFPE DNA Repair Mix (NEB) were added to the DNA (final volume 60 µL), mixed slowly by pipetting and incubated at 20 °C for 5 min and then 65 °C for 5 min. After a 1X bead wash with AMPure XP beads (Agencourt, Beckman Coulter, High Wycombe, UK), the DNA was eluted in 26 µL of nuclease-free water. About 22.5 µL of this was taken forward for native barcoding with the addition of 2.5 µL barcode and 25 µL Blunt/TA Ligase Master Mix (NEB) (final volume 50 µL). This was mixed by pipetting and incubated at room temperature for 10 min. After another 1X bead wash (as above), samples were quantified using Qubit dsDNA BR Assay Kit (Invitrogen, Loughborough, UK). In the first run, samples were equimolar pooled to a total of 900 ng in a volume of 65 µL. In the second run, samples were pooled to 1700 ng followed by a 0.4X bead wash to achieve the final volume of 65 µL. About 5 µL Adaptor Mix II (ONT), 20 µL NEBNext Quick Ligation Reaction Buffer (5X) and 10 µL Quick T4 DNA Ligase (NEB) were added (final volume 100 µL), mixed by flicking, and incubated at room temperature for 10 min. After bead washing with 50 µL of AMPure XP beads and two washes in 250 µL of Long Fragment Buffer (ONT), the library was eluted in 25 µL of Elution Buffer and quantified with Qubit dsDNA BR and TapeStation 2200 using a Genomic DNA ScreenTape (Agilent Technologies, Edinburgh, UK). About 470 ng of DNA was loaded for sequencing in the first run and 400 ng in the second run. The final loading mix was 75 µL SQB, 51 µL LB and 24 µL DNA library.

Sequencing was performed on a PromethION Beta using FLO-PRO002 PromethION Flow Cells (R9 version). The sequencing runtime was 57 h for Run 1 and 64 h for Run 2. Flow cells were refuelled with 0.5X SQB (75 µL SQB and 75 µL nuclease-free water) 40 h into both runs.

Bioinformatics analysis

The bioinformatics analysis was performed using default options unless specified otherwise.

Nanopore basecalling

Basecalling was performed using Guppy version 3.0.5 + 45c3543 (ONT) in high accuracy mode (model dna_r9.4.1_450bps_hac), and demultiplexed with qcat version 1.1.0 (Oxford Nanopore Technologies,

Sequence quality

For Nanopore, sequence metrics were estimated by Nanostat version 1.1.244. In total, 22 million sequences were generated with a median read length of 4500 bp and median quality of 10 (phred). Quality trimming and adaptor removal was performed using Porechop version 0.2.3 ( For Illumina, quality control was done for paired-end reads using fastp, version 0.20.045 to remove adaptor sequences and filter out low-quality (phred quality <30) and short reads (length <60 bp). After quality control, the average number of reads in the samples was over 26.1 million reads, with a minimum of 9.7 million reads; the average read length was 148 bp.

Taxonomic profiling

Trimmed and high-quality short reads are processed using MetaPhlAn3 version 3.0.246 to estimate both microbial composition to species level and also the relative abundance of species from each metagenomic sample. MetaPhlAn3 uses the latest marker information dataset, CHOCOPhlAn 2019, which contains ~1 million unique clade-specific marker genes identified from ~100,000 reference genomes; this includes bacterial, archaeal, and eukaryotic genomes. Hclust2 was used to plot the hierarchical clustering of the different taxonomic profiles at each time point []. The results of the microbial taxonomy were analysed in RStudio Version 1.1.453 (

Gene abundance was calculated using the HMP Unified Metabolic Analysis Network programme HUMAnN347. The output of the HUMAnN3 pipeline was normalised to copies per million (cpm) to facilitate comparisons between samples with different read depths and regrouped into MetaCyc reaction abundances for analysis. Fold changes were calculated between the cpm counts at time 0 h to the corresponding counts at 6, 12 and 24 h using gtools R package version 3.5.0. A threshold of 1.5x was regarded as an active metabolic gene pathway.

Hybrid assembly

Trimmed and high-quality Illumina reads were merged for each treatment, and then used in a short-read-only assembly using Megahit version 1.1.348,49. Then OPERA-MS version 0.8.214 was used to combine the short-read-only assembly with high-quality long reads, to create high-quality hybrid assemblies. By combining these two technologies, OPERA-MS overcomes the issue of low-contiguity of short-read-only assemblies and the low base-pair quality of long-read-only assemblies.

Genome binning, quality, dereplication and comparative genomics of hybrid assemblies

The hybrid co-assemblies from Opera-MS14 were used for binning. Here, Illumina reads for each time period were mapped to the co-assembled contigs to obtain a coverage map. Bowtie2 version was used for mapping, and samtools to convert SAM to BAM format. MaxBin2 version 2.2.650 and MetaBat2 version 2.12.151 which uses sequence composition and coverage information, were used to bin probable genomes using default parameters. The binned genomes and co-assembled contigs were integrated into Anvi’o version 6.152 for manual refinement and visual inspection of problematic genomes. We used the scripts: ‘anvi-interactive’ to visualise the genome bins; ‘anvi-run-hmms’ to estimate genome completeness and contamination; ‘anvi-profile’ to estimate coverage and detection statistics for each sample; and ‘anvi-refine’ to manually refine the genomes. All scripts were run using default parameters. Additionally, DAS tool version 1.1.253 was used to aggregate high-quality genomes from each treatment by using single copy gene-based scores and genome quality metrics to produce a list of good-quality genomes for every treatment. CheckM version 1.0.1854 was used on all final genomes to confirm completion and contamination scores. In general, genomes with a ‘quality satisfying completeness – 5*contamination >50 score’ and/or with a ‘>60% completion and <10% contamination’ score according to CheckM, were selected for downstream analyses.

Dereplication into representative clusters

In order to produce a dereplicated set of genomes across all treatments, dRep version 2.5.055 was used. Pairwise genome comparisons or Average Nucleotide Identity (ANI) was used for clustering. dRep clusters genomes with ANIs of 97% were regarded as primary clusters. Further the genomes are clustered to an ANI of 99%. These unique genomes are regarded as secondary clusters. A representative genome is provided for each of the secondary clusters.

Metagenomic assignment and phylogenetic analyses

Genome bins that passed quality assessment were analysed for their closest taxonomic annotation. To assign taxonomic labels, the genome set was assigned into the microbial tree of life using GTDB version 0.3.5 and database R95 to identify the closest ancestor and obtain a putative taxonomy assignment for each genome bin. For genomes where the closest ancestor could not be determined, the Relative Evolutionary Distance (RED) to the closest ancestor and taxa names were provided to taxa not previously identified in NCBI. Using these genome bins, a phylogenetic tree was constructed using Phylophlan version 0.99 and visually inspected using iTOL version 4.3.1 and ggtree from package The R packages ggplot2 version 3.3.2, dplyr version 1.0.2, aplot, ggtree version 2.2.4 and inkscape version 1.0.1 were used for illustrations.

Relative abundance of genomes

Since co-assemblies were used for binning, it was possible to calculate the proportion of reads recruited to the MAG across all time periods for each treatment. The relative abundance provides an estimate of which time point recruited the most proportion of reads. To provide this estimate in relative terms, the value was normalised to the total number of reads that was recruited for that genome. The time 0 h sample for Avicell was not available, so a mean relative abundance from the other samples (representing the same starting inoculum) at time 0 h was used to represent 0 h for Avicell. The relative abundance scores were provided by ‘anvi-summarise’ (from the Anvi’o package) as relative abundance. Further, fold changes were calculated between the relative abundance at time 0 h to the corresponding relative abundance at 6, 12 and 24 h using gtools R package version 3.5.0. Fold changes provide an estimate of change in MAG abundance, which might be a result of utilisation of a particular carbohydrate. Fold changes were converted to log ratios. MAGs with a fold change of 2x (log2 fold change = 1) were regarded as an active carbohydrate utiliser.

Carbohydrate metabolism analyses

All representative genome clusters were annotated for CAZymes using dbCAN56. The genome’s nucleotide sequences were processed with Prodigal to predict protein sequences, and then three tools were used for automatic CAZyme annotation: (a) HMMER57 to search against the dbCAN HMM (Hidden Markov Model) database; (b) DIAMOND58 to search against the CAZy pre-annotated CAZyme sequence database; and (c) Hotpep59 to search against the conserved CAZyme PPR (peptide pattern recognition) short peptide library. To improve annotation accuracy, a filtering step was used to retain only hits to CAZy families found by at least two tools. PULs were identified in MAGs using the tool PULpy29 with the default settings. The R packages ggplot2, dplyr, ComplexHeatmap version 2.4.3 and inkscape were used for illustrations.

Statistics and reproducibility

Principle Coordinate analyses using the pcoa function in the ape package version 5.3 ( and the vegan package were used to identify differences in microbiome profiles amongst treatments using the diversity analyses. To calculate the distances between the microbiomes at different time points, the permdisp test was performed using the betadisper function from the taxonomy profiles from vegan. Significance testing between the distances was calculated using ANOVA.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Related Articles