Skip to content

Latest commit



1645 lines (1580 loc) · 88.7 KB

File metadata and controls

1645 lines (1580 loc) · 88.7 KB

Sunday, September 23 Kate Baker, University of Liverpool: Finding Private Cable and a Century of Dysentery Slides: 100 years since ww1 turning point, was for the first time more likely to die of combat and not infectious diseases dysentery caused by shigella, enteroinvasive e.coli invades the epithelium in the gut incidence rate now highest in low/middle income countries children under the age of 5 four different species of shigella, main culprit is flexneri flexneri traditonally separated by serotypes immunity is serotype specific no vaccine available yet shigella acquiring amr New antimicrobials urgently needed for shigella Shigella infections are described as one of the four great epidemic diseases in the world Two types described by ww1, no genus coined, almost not quite not satisfying Kochs postulates Almost under control before ww1, however, trench warfare perfect for transmission typhoid, cholera controlled by vaccine, control levels so good that no incentive to develop vaccine 1919: report on comparison of 21 strains to define 5 serological "races" gave rise to the NCTC in 1920, first 200 in the collection were the dysentery strains from WWI NCTC1 is a dysentery bug that is living and breathing Story about how they traced down the patient that had NTCT1 - finding private Cable Found him in a hospital in France, he was a soldier from Surrey, and he died in France. Why is dysentery still a problem 100 years later? really just a specialized ecoli - discussed whether it should be reclassified - sims and Kim pnas 2011 for more Shigella genomics paper: ntct1 looks like a standard flexneri got a lot of insertion sequences that make things challenging was amr already back then, penicillin and erythromycin NCTC1 was resistant to penicillin around 25 years before penicillin was used in the clinic, similar with erythromycin. Compared many s.flexneri with NTCT1 in 2014 - it's the differences that tell the story Only five complete S. flexneri genomes in 2014 NTCT1 very similar to 2a lineage as well as serotype 5 compared them to genomes in 2a lineages, those were approx divided across the time scale since ww1 compared to 1954: function gain over time, added virulence, amr, and som eother 1984 these things are being retained 2002: adding more of virulence, amr, an also serotype conversion Tailored evolution Hygiene breakdown, lack of specific therapies, and problems with bacterial diagnosis were drivers for rampant bacillary dysentery during WW1 "Species-wide whole genome sequencing reveals historical global spread and recent local persistence in Shigella flexneri": NCTC3000: joint project with Sanger and PacBio to sequence complete genomes of 3000 bacterial strains from the collection Sample of an Unknown Soldier - Sanger Institute:

Peter Daszak, EcoHealth Alliance: Pandemic Forecasting allen et al, 2017, nature communications Emerging infectious disease = EID emergerging diseases are out to get us, increasing speed of emergence with time most pandemics have a wildlife origin undersstanding emergence requires a systematic approach key quuestions for pandemic emergence where will the next one originate what will cause its emergence what's the wildlife source how many unknown viruses exist that could infect us how do we reduce future emerging infection disease risk, and can we afford it Jones et al 2008 Nature Using more ecological methods, mapping all origins of EIDs, with ecological events Frequency of EID is increasing, 5 each year, out of which 3 will be zoonotic allen et al nature comm 2017 EIDR repository, continually updated ( ) population density, mammal density, and changing populations and changing land use are major drivers Olival et al 2017, Nature The more phylogenetically closer a species is with us, the more likely it is to carry potentially zoonotic agents bats turned out to be the highest proportions of zoonotic emerging diseases european red fox is another badguy USAID PREDICT: a global surveillance project for emerging pathogens ( when finding a virus in a species where the virus is similar to something that infects humans, what's the real risk of spillover? Good indicator of new virus is likely to infect people is to look at the breadth of how many species it can infect already Li et al 2005 Science: Reanalyzing sars Ge et al 2013 Nature : Zhou et al nature 2018, SADS-CoV, bat origin CoV causing lethal swine disease Anthony et al 2013 mBio : viral accumulation curves - attempt to estimate unknown viral diversity Estimate ~1.7 million unknown viruses, expect ~650,000-840,000 to have zoonotic potential Started the Global Virome project Carrol et al 2018 Science 10 year effort to collect and smaple unknown viruses from approx 50 countries Estimating need ~2000 samples per species How to maximise returns species with high nos of missing zoonoses sites in EID hot spots syndromic survellance sites at wildlife livestock human interfaces rna viruses priority one more that I didn't catch in startup phase now, getting data into public domain reach steady state 2021-2030 postdiscovery state after that not cheap, but can get money from alt sources such as international aid development and private sector, predicts strong return on investment and public health benefit http://data.

Monday, September 24: Session 1 Varvara Kozyreva, California Department of Public Health: CLIA-Compliant Validation of WGS for Clinical Microbiological Applications: Experience of a State Public Health Reference Laboratory Clinical Laboratory Improvements Act (CLIA) - federal regulatory standards for quality for clinical lab testing performed on human samples for diagnostics purposes mdl - microbial diseases laboratory of california dept of public health part of pulsenet miseq (nextera xt)> trimming > denovo/mapping > 16S id, abr, insiloco mlst, also snps phylogenetic tree++ using wgs for diagnostics not approved by FDA yet several challenges for diagnostic use guidelines not in place yet, but parts of the process have been worked thorugh in sets of publications no established performance parameters what do accuracy and specificity actually mean in this context also disagreement - should non-clinical tests such as bacterial typing actually comply with clia standards? Sources of reference materials: PulseNet, NIST, FDA-ARGOS Objectives of valiationatØ performance spedifications validation set optimal conditions develop qa qc measures one more had to establish their own validation set ATCC strains with WGS in NCBI Strains sequenced by the CDC Interestingly there is no LIsteria in the validation set? (H8394, according to the JCM paper. IIRC, it's a PulseNet reference strain CDC-03-H8394) 34 bacterial isolates 19 species Genome sizes: 1.8 - 6.7 Mb Wide range of GC content: 32.1-66.1% Conflicting and incomplete guidance from CLSI and CAP when it comes to validation process. "Methods based validation" because you can never exhaustively test all true positives have developed good validation setup to define accuracy platform accuracy - snp differences with reference found anywhere from 0 to 184 SNP differences with some references question is: if the previous reference strain wrong, or the new one? tested that by 5x replicate sequencing, judged snps in all replicates as real ( assay accuracy - agreement of assay results for strain with original results for bacteria, for instance clustering results genotyping accuracy of genotyping assay - congruence of phylogenetic trees built using reference sequences and validation sequences Paper: A novel algorithm and web-based tool for comparing two alternative phylogenetic trees ( mlst 16S rRNA gene ID antibiotic resistance genes detection pipeline accuracy - validate against other previously published studies, do we get the same results compare different pipelines between laboratories CDC pipeline and Microbial Diseases Lab CA "Laboratory Investigation of Salmonella enterica serovar Poona Outbreak in California: Comparison of Pulsed-Field Gel Electrophoresis (PFGE) and Whole Genome Sequencing (WGS) Results." - Reproducibility and repeatability triplicates going through pipeline evaluate accuracy based on differences in snp calling judging between run vs within run annalytical sensitivity and specificity analytical senstitivity = limit of detection (LOD) minimum genome coverage that allow accurate SNP detection (downsampling data to detect this) analytical specificity = interference and cross-reactivity testing pure genome of e.coli and contamination (25% and 50%, in silico) sensitivity in genotyping assay - "likelihood that all SNPs differing between isolates will be detected" specificity in genotyping assy - "liklihood that variation between isolates (SNPs) will not be detected when none are present" Ideal would be a collection of highly curated authenticated organisms with accurate (reference-grade) whole-genome sequence available, perhaps bundled together Sequence same isolate 5 times, call SNPs independently, and only those that occur in at least 4 of 5 are considered real SNPs relative to the CDC reference genome (I think) formula for calculating accuracy of SNPs called = genome bases covered - SNPs differing from genome/ genome bases covered this metric will always be like 0.99999x ? is essentially ANI (average nucl identity) How were SNPs actually called? This is bigger source of varibility thn nything else In addition SNPs that differed from the ref in all 5 replicates were considered not to differ from the ref since the assumption was that the strain sequenced had accumulated SNPs compared to sequence in Genbank. Could it also be consistent errors in the SNP calling pipeline? If there is a consistent error and the SNP calling pipeline is deterministic then all 5 reps will produce an error. False +ve SNPs did not affect the tree toplogy, only the branch lengths (depsite what speaker said) WGS quality control scheme DNA template QC Library QC Run QC Raw data QC Analysis QC also developed a reporting scheme to ease interpretation of those receiving results have to consider that the only constant is that things change have to have processes in place to allow for re-validation of setup when changes happen

for more on validation:

Questions Where biggest area for improvement? Better isolates, higher-quality genomes generated w/ multiple methods CLIA wants +/- ctrl for every prep, how deal with pooling libraries? Negative control is not main source of contamination, it's more the amplification that introduces errors What does report look like? How determine whether isolates related? Threshold? Try to take a holistic view, not always easy to tell Reference-based validation: what if you see a new/divergent strain? How validate? Difficult problem, closest relative, incorporate PacBio (I think? Misheard?)

Luke Meredith, University of Cambridge: Mobile Genomics Laboratories for Viral Microbial Outbreak Response What triggered the(ir) interest in field sequencing: Ebola outbreak in Sierre Leone Decision to sequence in March 2015 contact-tracing inefficient need for real-time sequencing (to identify epi links, atypical transmission events, investigate virus evolution/repeat human-to-human transmission, establish research tent containing Ion Torrent) What was the value of sequencing? ~650 whole genomes from tent lab during 9 months of operation Identification of ongoing transmission chains: Reliable contact tracing and epidemiology (Nature, 2015, 2016) The challenges of real-time sequencing with Ion Torrent or Illumina logistics (moving equipment, moving data, need semi-permanent location, library prep is slow, Ion PGM parts can break) Suppliers and funding Timing a big issue (+ 1-2 months to deliver to field) Costs inflated 2-fold due to licensing issues (described on slide as "an excuse")... field delivery in Africa no doubt entails extra cost, so not sure this is fair, but not sure it's 2-fold either. When to deploy mobile sequencing? earlier could help define size of outbreak, detection and source attribution Logistics of shipping first gen sequencers mean that early deployment is not realistic => enter Nanopore Mobile lab considerations: portable, robust, rapid, fieldable (work w/ intermittent power, internet, no cold chain), simple, accurate, interpretable, cheap-ish Minion can be very useful for this purpose pros: portable/compact, rapid, cost effective (with multi-plexing) cons: challenging bioinformatics, error rate, too flexible, still need lab skills for library Necrotising cellulitis in Sao Tome team of 3 deployed to provide lab support and sequencing to identify pathogens causing the ulcerative illness Why use sequencing for this when so many cheaper, faster and simpler methods? ST had no + results for any pathogen detected using basic micro/qPCR 372kgs of stuff! Set up lab in hosp Tests: qPCR assays (150 each), 16s sequencing (all bacterial sp), full DNA with nanopore sample extraction when dealing with unknown pathogens: designed around EM labs and PHE ebola labs Extraction kit: ZymoBIOMICS Expedition DNA/RNA Extraction Kit Not ideal for long reads we thought, but it didn't affect the read length (TerraLyzer bead beater) - anyone know what kind of read lengths they saw for this? We did full length 16S sequencing in field Sample workflow: pt comes to ER/ward -> samples taken by local lab for micro, cultures --> molecular testing, NA extraction for sequencing, 16s and qPCR 24 pts sampled + 2 from neighbouring island 16S summary wide range of pathogens identified, but ones identified with both techniques: S. pyogenes, S. aureus in the absence of good epi data, hard to link seq to symptoms cDNA WGS nanopore, 2 isolates mixed together, assembled with canu screened with centrifuge ( results verified using illumina following deployment 16s, wgs, qpcr results agreed nicely epidemic due to community acquired methicilin senstive panton valentine positive s. aureus observations bioinformatics a limiting step Portable seq is valuable in right circumstances: need good access to samples and basic epi, but can be overkill in many situations The ARTICnetwork website provides a good description of the techniques and considerations described in this talk: generate standardized response to go from sample to phylogeny in a hurry Includes Field Lab, Sequencing Methods, Bioinformatics, Phylodynamics, and Visualization Also kit lists and protocols for use by anyone interested in developing field sequencing capabilities Bionformatics: working with Andrew Rambaut, Trevor Beford to develop Bioinformatics tool installation is listed on this page: currently Ebola scheme incorporated but can be easily adapted Testing in DRC Porton Down still sticking with amplicon-based approach rather than whole cDNA, eventually want to get away from amplicon-based Ebola workflow Bioinformatics (run till done): basecalling, ref based assembly with bwa/graphmap, polish genome, variant callng -> consensus (takes about 2 hours for ebola) Protocol Portable phylogenetics: alignment (muscle) -> tree (phyml) Method here total time from sample to phylogeny = 7h 37m Future directions expand list of viruses real-time phylogenetics: Adapt to Flongle rather than flow cells Why sequencing the virus in country? simplify data management and relations data ownership build capacity and knowldge

Questions What instruments did you use/test for real-time PCR?

In case anyone (like me) had not heard of "parachute research":

Matthew Keller, Deployable NGS for Influenza Virus Field Surveillance and Outbreak Response Influenza genomics team (IGT): global surveillance running total 25,000 genomes Pigs act as a 'mixing vessel' for human, avain, and swine strains Swine shows provide avenue for extensive contact between humans and pigs Is the Australian movie "Babe" to blame for this problem? "Like a dog show but grosser" - Babe fans will disagree MIA: mobile influenza analysis The MIA logo is a little bit creepy - kids nightmare stuff :) <-- agreed Moved from big Qiacube to TruTip Sped up PCR Use nanopore for sequencing Fits in 2 suitcases and a cooler cooler is carried on, partitioned to get two temperatures Took MIA to a large swine show "sick pigs and transmission everywhere.." "this pig has a runny nose and there was coughing" "pigs sound a lot like humans when they cough" started by screening with rapid tests ~100 pigs, 7 positives (high false positives for this test though) set up in horse stall, worked overnight: extraction through barcoding ran into software issues withi minKNOW ran sequencing for six hours in car ride and hotel room, did analyses immediately after that Analysis: basecalling and demux (albacore) read mapping (erma) Coverage plots trees (fasttree) FYI - Prof Eddie Holmes says he will reject any paper that uses FastTree; and i concur that is does do some strange stuff sometimes due to its heuristics. For viruses can afford to use better tools. Suggest raxml, raxml-ng, iqtree 13 full genomes 1H3Ns, 1 H1n1, 11 H1N2 H1N2 of special interest young children are naive and young children in contact with pigs at show sent sequences to VPT and they started working on a prototype vaccine within 16 hours Took samples back to Atlanta to confirm results on MiSeq Timelines: Minion timeline was 16 hours, which still fits within the time it takes MiSeq to run (MiSeq workflow is at least 48 hours) Vaccine developed within 4 weeks <-did I hear this right? Future directions: improve logistics (ehtanol and cold chain) improve screening (something better than rapid test) Going to deploy again in Thailand and Puerto Rico Question: did you follow up with the people involved in the outbreak once you saw transmission? sounds like follow-up occurred but perhaps by different team?

Georgia Lagoudas, One Week, One Thousand Bacterial Genomes: Microfluidics for Molecular Epidemiology and High-resolution Intra-patient Bacterial Evolution Blainey Lab, Broad Insitute, MIT How technology improvements and automation have transformed DNA sequencing $1000 genome today very different from 15 years ago and human genome project, sequencing now widely accessible $1 per bacterial cell technology no longer limits large scale bacterial sequencing Preparing DNA for sequencing is the new bottleneck process of preparing library is roughly $200 per sample while sequencing itself is cheap Why do we care to do large scale microbial WGS? Determine what Abx resistance exists in pt samples monitor spread of pathogens in a hospital, trak sequence interested in sequencing 1000s of pathogens involved in clinical trials Microfluidics automate steps and decrease volumne to overcome sample prepbottleneck Library prep $200 + seuencing $1 => now $5 + $1 due to smaller volumes (nL not uL) minimize human labor and also minimize volume of reagents required microfluidics have channels in the device that can flexibly open or close using air pressure you only need to load your samples in once, and then everything happens inside of channels (less tips!) high initial cost to buy this equipment but easy to make that up with savings later Comparison to other DNA libaray construction methods: Benchtop Robotic Microfluidic $/sample 200 20 5 Time/100 samples 7h 7h 4h End-to-end integration No No Yes Reproducible? did triplicate sample comparison and had good reducible results discrepancies corresponded to complex genome regions high resolution intra patient bacterial evolution study made possible with microfluidics Sequencing 1100 samples Burkholderia dolosa outbreak--one index pt and two other CF pts samples taken from lunch, spleen, blood and LN What is spatial distrib of bacterial strains across lung and across body? results still to come Microfluidic device has yet to be commercialized Microfluidic technology can be used for large scale clinical studies Can find validation data here: They are open to partnering up to move this device towards commercialization Have used with water and stool Biggest requirement: must have enough cells (sample can't be very dilute) Also can't have large chunks (clogs the valves) Failure modes thus far have always been clogging and could have been prevented

Justin O'Grady, Rapid diagnosis of lower respiratory infection using nanopore based metagenomic sequencing Infectious disease diagnostics and antibiotic prescibing "In a world of rising abx resistance, culture based diagnostics can be a bit like Brexit" Problems: UTIs trimethoprim replaced with nitrofurantoin (an inferior agent) due to resistance Gonorrhoea ciprofloxiacin replaced with cephalosporins trimethoprim/cipro could be used in 60-70% cases if rapid detection of susecptibility Metagenomic sequencing based infection diagnostics agnostic pathogen detection detection of resistance markers epi, infection control, transmission, virulence enable narrow spectrum antibiotic, trial, and use MinIon speed is of the essence do multiplexing of clinical samples major advancements in technology in past few years Major challenge: sequencing pathogen in vast excess of human DNA, ratio of 10^9: 1 human bacteria in septic blood Solution: host DNA depletion or pathogen DNA enrichment have proprietary and in-house host DNA depletion strategies up to 10^6 fold reduction in human DNA levels Resistance islands: Minion UTI diagnosis Schmidt et al 2016 JAC - simple use case, decided to move onto respiratory tract infections after this Lower resp tract infections third most common cause of death, leading infectious cause Is this a killer app for nanopore seq? high bacterial load: need 1ng for lib prep, ~10^6 bacterial cells HAP/VAP mainly bacterial in aetiology thresholds can be used to remove contamination and barcode cross-talk preprint on biorxiv: tested on ~80 samples shortened lib prep and overall procedure depletion not as good in sputum Results - optomised method assay performance 96.6% sensitive; 41.7% specificity (LoD ~ 10^4 CFU/ml) AMR story is much more complex Detected a bunch of genes, hard to tell if those genes were associated with any particular pathogen Looking at new approach now where rather than detecting resistance genes, could we predict resistance and susceptibility baesd on lineage of the strain detected Is Human DNA depletion necessary? Yes, get much lower coverage with non-depleted approach molecular epi, infection control (Pipeline) fast5-fastq, exclude <2000bp, porechop, minimap2 for mapping, canu for assembly of mapped reads 48h seq data: 34 contigs, longest 416 kb Hospital acquired pneumonia INHALE - pneumonia diagnosis study: 5 year NIHR programme grant diagnositcs evaluation study followed by RCT again using MinIon took about 6 hours and cost about 120 USD per sample performance of MinIon looking good but haven't finished full analysis yet Summary metagenomic sequencing-based identification and diagnostics has come of age sample prep is key enrichment/depletion/tareting makes sequencing based diagnosis faster and cheaper real-time pathogen identification and resistance marker detection Question: concerns with low specificity and treating inappropriately in the clinic he thinks that have detected things that were there, the question is does it matter and what to do with that information do we want to set clinical cutoffs? number of reads that are clinically relevant? will take getting used to moving from clinical culter to metagenomic data of course always trade-off between sensitivity and specificity

Rita Colwell, Translating Metagenomics into Clinical Reality (Dr. Colwell unable to attend; presented by Dr. Nur Hasan, CSO of CosmosID) AMR is fast growing threat Community resistome: collection of all antibiotic resistance genes and their precursors carried by a community of both pathogenic and non-pathogenic commensal bacteria Today's micro is transforming -> metagenomics Needs to be easy to use and fast to be clinically relevant COSMOSID - Possible sources of error in clinical metagenomics workflow sample contamination controls need to be included in each sampling run compositional bias factors contributing to bias genome size diverse taxonomic group abundance gc content genome status (complete vs not) type of microbes pan genome: closed pan genome is harder than open pan genome Compared different library prep approaches on multiple different species Many samples are overwhlemed with host DNA can either deplete or sequence deeper they opted for sequencing deeper host DNA depletion can produce bias Sequencing depth is sample dependent LOD is affected by sequencing depth Why sub-species/strain level ID is important there are definitely more clincally relevant subtypes commensal vs pathogenic COSMOSID curated microbial genome databases Machine learning Strain-level resolution powered by curated databases participated in CFSAN Pathogen Detection Challenge (, won multiple awards in-house and independent benchmarking Making an impact with strain level identification UTI as an example of low-biomass samples where they can still detect resistance biorxiv preprint - 31h workflow Wound biofilm identification and AMR characterization detection of lots of non-target micro-organisms polymicrobial infections more frequent, may be rule rather than exception-- role in pathogenesis? Infant Gut microbiome study c diff occurrence: found in all samples, differentially abundant genes Microbiomes of bloflies and housefly detection of H pylori virulence and AMR associated genes New NGS service: poster 19

Monday, September 24: Session 2 Maia Majumder, MIT: Digital Disease Data for Outbreak Surveillance primary aims assessment of transmissibility case count projections transmission risk analysis ritical barriers emergence due to ecosystem disruption globalization insufficent public health infrastructure exiting opportunities novel data sources advances in computing interdiciplary collaboration 3 classes of digital disease data ETL generated line list data extract, transform load (,_transform,_load) freetext, has to be processed often first done manually, should be automated result: line list data news and social media data twitter, google news, also aggregators healthmap platform: publicly available, automated monitoring of >250 diseases, 15 languages, accurate death counts: google search trends: public API, can help improve epi curves 3 classes of math models network transmission risk analysis captures heterogeneity probabilistic mechanistic model of how individuals interact with each other and how disease is transmissted compartmental ODEs, e.g. SIR deterministic or stochastic strength: parameterization weakness: homogeneity assumptions phenomenological non-mechanistic (descriptive) model of population-level trends in transmission dynamics (e.g. reproduction number) even fewer parameters than compartmental model can tell you what is happening but not why (not mechanistic) How can we use what we have learned from DDD to inform how we use NGS data? Especially how we integrate NGS data with traditional epi data Case studies MERS in S Korea 200 cases, largest outbreak outside Saudi Arabia extreme super-spreading as a result of single individual who went "hospital shopping" Q1: can we reconstruct transm n/w using only public data? Q2: can we analyze risk factors for transmission? Data: text, in Korean from MOHW, and English from WHO, not machine readable "Nosocomial amplification of MERS-coronavirus in South Korea, 2015." - Challenge: network construction ETL: korean -> english, text -> csv (first manual, eventually automated), update db Results: 84% of cases caused no transmissions, 16% caused all secondary infections Transmission risk analysis binary classification by logistic regression: caused secondary inf vs not Result: single best predictor was death possible mechanisms: higher instance LRT invasion among fatal cases, higher viral load among fatal cases hope to use this to indentify cases for detailed contact tracing Parallel work: ETL automation Ghosh et al 2017 - Mumps in Arkansas data limited case study 3000 cumulative cases aug 2016-mar 2017, conc in 33 counties, mostly in schools Q: What % of communities affected were vaxxed? Absence of public data so used case count data from healthmap Majumder et al 2017 Lancet ID R0 4-7 historically, Veff 90%, t 8-10d central question: is what is seen consistent with what we would expect based off the how effective the vaccine is, or are people not vaccinating their kids? found R0 <1.5, vax % 70-89 (compromised herd immunity) example of using DDD to put numbers on vaccine hesitancy without diving into personal/identifiable data Zika in Colombia considered rare, then got massive epidemic in 2015-2016 Q1: can we deduce transmissiblility Q2: can we project expected case counts Sparse data again, HealthMap data showed odd shape likely due to news sources/reporting Combined google search interests with ? "Utilizing Nontraditional Data Sources for Near Real-Time Estimation of Transmission Dynamics During the 2015-2016 Colombian Zika Virus Disease Outbreak." - R0 1.4-3.8, wide CI but case counts matches well with the data ultimately reported by colombian INS Ethical considerations of DDD at aggregate, privacy issues less problematic, but it is a different story for Twitter data or highly-granular geographic data just because someone's Twitter profile is public doesn't mean that they want thier data used for ID surveillance possible soultion: confirmation of consent right now, no global standard for this What data streams and modeling methods we use to address gaps in ID surveillance are context dependent DDD and math modeling are a complement to (not a replacement for) traditional ID surveillance

Tom Schenk Jr., KPMG: Predictive Analytics, Cities and Public Health Slides: technology - of the people, for the people, by the people how can people benefit from such data : Chicago has released lots of open data got a community of local people working on this data to create new things for citizens example: website for when your street will be swept, and where your car was towed to if you forgot to move your car, open source, adopted by other cities Not just sharing data, sharing ideas and sharing code, so that other cities can benefit City of chicago leverages open data to create cooperative relationships, such as with academia prediction (focus of today's talk) example: can we create a model that can predict which food establishments are likely to fail? where do we go to first to inspect? 10 different variables proved to predict critical violations taking into account these variables helps to find riskier establishments faster finding critical violations about one week faster, which reduces the risk of patrons becoming ill data and code are open and available on GitHub: challening other cities to "beat" the current model example: west nile virus moderate issue in IL collect and test mosquito DNA, share data Data on mosquitoes Use generalized linear mixed effect models to predict WNV before positive tests come in. Can get out there and spray earlier. example: clear water have to determine whether there will be elevated fecal indicator bacteria in Lake Michigan to tell people if water is safe to swim culture based = too slow to inform swimmers for a particular day models are incorporating prior day's data, but yesterday's bacteria levels have very poor correlations with today's levels problem isn't overall accuracy, it is sensitiviy solution: hybrid model turning it from a prediction problem to a missing values problem clusters beaches sensitivity went up to 11%, precision grew, false positives remained constant multivariate hybrid model didn't result in better model performance, likely because qPCR testing is capturing what you need to know from that day project done by citizen scientists example: FINDER for predicting food establishments with critical violations optimization evaluationsu questions re how to interact with people from hackathons - treat your citizens as stakeholders

Richard Goater: Pathogenwatch: A Global Platform for Genomic Surveillance wgsa now renamed to pathogenwatch ( wgsa available until end of 2019 got lots of new fancy things you can do interactive trees What's new? improved tree pipeline, now up to 1000 genomes trees build faster, now created in golang can interrogate metadata while you build tree sign-in required to upload data data is private by default Clean way to see single samples in genome reports want to support pathogens from WHO priority list cgmlst for 18 schemes clustering single-linkage, allele differences test with 24,000 salmonella: 20 mins, 19 GB RAM (4 mins from cache) reasonable correlation with phylogenetic trees [demo showing exploration of clusters, correlating with map] Uses "Runner" Architecture reproducability with docker ( standard stream interface assembly via stdin, results to json via stdout easier to integrate new analysis adding new analysis package tool in docker; assign to taxa in a json file; add visualization

Questions: Tell us more about visualization network visualization; layout run on client For large databases (e.g., allele schemes) are they also in docker? yes, the libraries/schemes are all in docker image

Anamaria Crisan: A Method for Systematically Surveying Data Visualizations in Infectious Disease Genomic Epidemiology Slides: on biorxiv - and Bioinformatics - How do people choose a data viz? Systematic review of data viz Literature analysis phase and then a visualization analysis phase (qualitative and then quantitative) Some parts automated, some manual Why, how, and how many Automated text mining to identify topics Visualizations analyzed manually 18000 pprs -> topic clustering by tsne Articles clustered around pathogens adjutant! - literature analysis tool in R 801 figures, 49 tables "missed opportunity tables" : data that should have been visualized GEVIT genomic epi visualization typology ( doesn't yet tell you if a particular viz is the best one for the job most common: phylogenetic tree, followed by table chart combinations were common, also often overlaying metadata gevit in action breaking down visualization chart type, chart combination, chart enhancement availble in gallery - impact move away from ad-hoc visualizations (awareness of alternatives) implications for tool dev need support for complex and expressive visual design lots of manual steps that happen when making a composite tools next steps automate analysis (lterative analysis, etc) use taxonomy (gevit) to create a gevit API R integration for visualization with ggplot Plug: Graduating soon, looking for employment!

Questions Do these represent formalisms in scientific community? sometimes people do follow common trends. Important to have an awareness of different purposes for different visualizations. Did you find examples of chart types overrepresented just from common software? yes, some software visualizations did show up in our analysis

James Davis: Using Machine Learning to Predict Antimicrobial Minimum Inhibitory Concentrations and Associated Genomic Features for Nontyphoidal Salmonella Overall goals: to develop tools to predict AMR phenotypes given genome predict genome regions associated with AMR AMR prediction as a machine learning classification problem not terribly complex "Antimicrobial Resistance Prediction in PATRIC and RAST." - Prediction of S/R in Klebsiella pneumoniae - But what if we cast AMR prediction as a regression problem predict MICs in Klebsiella using XGBoost published in scientific reports - Plug: Poster 88 & 90 in Tuesday's session. In this talk: work on nontyphoidal Salmonella and predicting MICs preprint on biorxiv - FDA collaboration using NARMS isolates from 2002-2016, most are food contamination isolates, some animal isolates wealth of data, but how many genomes do you need for an accurate MIC prediction model? systematically reduced genome number while keep diversity and measured accuarcy of model at each step accuracy starts at 88%, tapers but don't hit plateau 4500 genomes = over 95% accuracy , after that start to get very very similar strains accuracy by MIC value if not good sampling for given MIC for given antibiotic, don't do well accuracy by metadata category no obvious biases accuracy of models built from previous years accuracies are lower (partitioning data) but fairly stable around 88% Could this ever be a diagnostic? looking over all antibiotics, 7 meet very major error rate requirements, 8 don't Can look for important resistance and susceptibility k-mers susceptibility k-mers of special academic interest to him ongoing work moving MIC models into PATRIC improve AMR gene sampling looking at global strains classification of nanopore data understanding suceptible k-mers matched patient studies would prediction of what anitbiotic to use make a patient better off?

Questions What does the distribution of kmers look like for your model? Data in supplemental materials of the scientific report publication Classification of resistance in metagenomics? Has not worked on this area formally - only pure culture isolates, but has ideas.

Tuesday, Sep 25

Human and Environmental Microbial Health: A Global Perspective - Jack A. Gilbert (@gilbertjacka) "Once the diversity of the microbial world is catalogued, it will make astronomy look like a pitiful science" Earth microbiome project: work together to standardize methods so that people/data harmonize better 400+ collaborators, 100k 16S'd samples Most useful descriptors: Host-assoc. (rhizosphere most diverse) Animal v. plant Site Free-living (sediments/soil most diverse) Saline Non-saline ... Use to create ontologies for more useful cataloging Huge sampling depth reveals a continuum rather than discrete groups "Microbial nestedness" - phyla tend to be gained but not lost Sub-OTUs (via deblur/DADA2) are habitat-specific Predict taxa/communities based on environmental data 65-75% accurate Use model to predict functions: picrust Nested prediction: microbial products More fine-grained geography: great prairie areas with or without agricultural nitrogen Using "undisturbed" areas, e.g. graveyards Fierer et al Science 2013 Verrucomicrobia disturbed by add'l nitrogen Brewer et al Nature Microbiology 2016 Highly distributed single species Network analysis of soil bac/arch/fungal communities by climate reveals region-specifc (highly adaptive) vs. flexible communities Ma et al, ISMEJ, 2016 Organisms in the north highly correlated with few organisms, opposite relationship further south Consequence of environment, moisture Fertilization reduces soil bacterial diversity wood et al 2105 J applied ecology Microbiome-wide association study: association between microbial activity and disease Example of C. diff carriage maintained by particular community Mosquitoes attracted by particular microbial molecules--microbiota transplant shifts bite location Fruit-fly-fancy: microbiota transplant -> recipient mate preference the same flies as the donor Modulating microbes as an asthma therapy Hutterites: 25%, Amish, 4% asthma prevalence Immunological differences: Amish have more (and younger) neuts, fewer eos: old neuts more proinflammatory (higher turnover in Amish b/c see more bugs?) Mice treated with dust from Amish vs. Hutterite houses Treatment with Amish house dust -> less allergic/athsmatic response Cooties are real: 38 million microbes released per hour Microbial fingerprinting: mapping communities back to individuals (people and pet) Reducing stress on microbiome as a medical goal

Questions: How stable is allergic phenotype? Need to change it during development, hard to perturb adult animal

Yu Wang: The Genetic Diversity of Salmonella and Listeria Isolates from Food Facilities Since 2008, WGS used at FDA How to interpret WGS analysis (snp distance thresholds, bootstrap support, tree topology) snp distances established by comparing isolates from the same facility and figuring out base level from that isolates from same facility should be fairly similar, doesn't always hold up q1: facility match : what is the probability that 2 isolates are collected from the same food faciility if their genetic distance is no more than d snps? q2: what is thefprobability that the genetic distance of 2 isolates is no more than d snips if both of them are collected from the same food facility fda: most focus on q1 Data came from SRA, FACTS (FDA/ORA), GIMS (FDA/CFSCAN), and Pathogen Detection (NCBI) 6,351 Salmonella isolatea (2,196 facilities, 779 clusters), 5321 listeria isolates (846 facilities, 248 clusters) listeria: used CFSAN SNP pipeline to calculate snp distances after got pairwise distances, calculated probabilities divided into 4 groups low distance negative, low dist positive; high distance negative, high dist positive at 20 snp distance; probability of two listeria isolates from same facility is 70%; for salmonella 65% at 0 snp distance; probability of two isolates from same facility is 94% for Listeria and only 82% for Salmonella critiquing results possible biases sampling not random; dirty facilities overrepresenting data; isolate-pair classified as +/- only by responsible firms; methadology; unknown patterns in supply chains variability (experiment to experiment; sampling noise) Low genetic distances (# of SNPs) between isolates from apparently different places can signal false negatives after further digging evaluation of false omission rate (FOR) by examining false negative pairs within different thresholds of SNP distances differences in probabilities of facility matches between listeria/salmonella, even after bias adjustement Subsampling: 2 isolates randomly selected from each facility with at least 2 isolates, repeat summary for salmonella/listeria: if genetic distance <= 20 SNPs; likely collected from same facility but, differences in proability for salmonella and listeria Closely related isolates (<10-20 SNPs) generally originate from the same facility A similar set of slides can be seen here :

Questions Re: false negatives--any reports on de novo mutation rates? Not known Any suggestions for food facilities? How many samples? (re: sampling bias) Targeting facility is more important, more swabs are better, but hard to give a specific number 3-5k swabs sequenced annually by FDA, follow-up sampling differs from initial

Nikki Shariat: CRISPR-SeroSeq: A Novel Amplicon-based Tool For Probing Salmonella Serovar Diversity Salmonella is the leading bacterial cause of foodborne illness 2500 serotypes, combos of O and H antigens not all serovars are created equal, how and which species they are found in also geographic differences seen thermal resitance toxin/dna damage want to survey serovars present in one sample culture, then pick some clones for serotyping when we're picking clones, we are missing minority serovars developed crispr-seroseq: amplicon based ngs tool to identify serovars 2 crisprs are present in salmonella create a crispr "profile" spacer differences in crispr can be used to subtype salmonella in single serovar: spacer content is highly conserved however there is spacer variation within a serovar that can be used to subtype a serovar somewhat similar to spa typing in S.areus, spoligotyping in TB can we harness serovar-specific spacers to map to serovars? how (CRISPR-SeroSeq): sample, grow everything dna isolation get pcr products, not sure what the primers were sequence -> match to crispr-seroseq database Pilot study: Was able to identify minority serovars in a population that would have required impossibly large number of islates from colonies Next step: poultry Serovar diversity in poultry has fluctuated when salmonella is detected and flocks are culled, new serovar will come in and take over after Poultry sample collection: bootsock samples fecal samples 2 days (farms, chicken houses, processing plants) CRISPR-SeroSeq can identify serovar populations in poultry lots of heterogeneity in number of serovars in samples were also able to seperate between multiple strains within certain serovars Slight aside: Salmonella Kentucky-found in domestic food animals infrequently isolated from humans 2 groups, asked if one group was found more frequently group 1 was seen throughout food animal samples they studied, so where is group 2 coming from? group 2 more prevalent in South Africa (?) and Europe groups have different resistance profiles suggests they come from different ecological environments Back to poultry data: they were able to identify a group 2 Kentucky strain highlights importance of being able to identify background serovars CRISPR-SeroSeq can be used to investigate Slamonella populations in different Ag systems can we examine the phenotypes of minority serovars Switching gears: Salmonella is cultured on a suite of different enrichement media TV and RV most common sometimes you see differences between the two worked to address the potential biases of different enrichement media question: is there a difference between enriching different volumnes of samples? no, there doesn't seem to be much of a difference found that as a minority, S. Enteritidis is preferentially detected in TT Next project: mapping Salmonella populations on the Susquehanna River address persistance and transmission of Salmonella in water 4 day sampling trip, mapped 30 sites detected multiple serovars in all samples (up to 6) of 30 sites, 25 were positive for salmonella Salmonella Give is most frequently detected serovar enrichment broth influences serovar diversity dominant serovar can flip depending on which enrichment broth you use Applications for CRISPR-SeroSeq working on developing web-based tool so people can plug in sequence data and get Salmonella Serovar content 4 potential future applications" predict/model serovars emerging in niches pre/post anitbiotic treatment what is the diversity of salmonella serovars in food how good/biased are salmonella enrichement procuedures Environment: What is the risk of agricultural land use to fresh water? How long can salmonella persis? How far can Salmonella be transmitted? Question: can humans be infected with multiple serovars? she doesn't know but would bet that they could be Just a thought (complete speculation on my part): There may be a selection/enrichment effect similar to that seen in the various media types. A mixed serovar culture on the way in might not have the same component ratios on the way out. Could even be a single serovar.

Caroline Barretto: A Validation Approach of an End-to-End Whole Genome Sequencing Workflow for Source Tracking of Listeria monocytogenes and Salmonella enterica Nestle: wide product portfolio, worldwide both factories and R&D Have a program to sample for pathogens in both raw materials and finished products see a clear benefit in WGS in factory setting, have been using for a while now Benefits of WGS for source tracking in food industry: knowing root cause of contamination, avoid recurring events can lab be source? is isolate persistent in factory env? is isolate unique to factory? corrective actions needed? WGS gives precision, insight, provide leads early on Nestle Research: development, validation, support for deplyment to operations isolate, DNA extract, seq on hiseq/miseq, bioinformatics and interpretation Bioinformtics pipeline consist of CFSAN pipeline v1 and GARLI Validation approach based on Struelens 1996, Van Belkum 2007 Stability: selected major serotypes of Listeria and Salmonella related to food Reproducibility: .. Repeatability Discriminatory power Epidemiological concordance ( reanalyzed Jackson et al. 2016: L. monocytogens outbreak ) for Salmonella : Inns et al. 2017 .. ref to their ppr Is pipeline robust to change in platform, software? PacBio vs Illumina contigs -> no impact SPAdes vs skesa: 6x faster using skesa, same number of snps and conclusions Nestle partnering with authorities, universities, industry towards ISO standards Thank you to bioinformaticians for making software freely available, impact to food safety industry

Arthur Pightling: Measuring the Influences of Contamination on Whole-Genome Sequence Analyses of Foodborne Pathogens Questions: What levesl of contaminiatoin in Illumina datasets influence downstream analyses? How can we detect levels of contamination that influence downstream analyses? going to make a large fastq and assembly dataset available on figshare Design identify organisms in GenomeTrakr with closed assembles closely related to another isolate Simulated reads from those genomes to contaminate and shuffle the dataset, also inter species contaminants, e.g. listeria with salmonella pruned trees to less than or equal to 20 taxa use CFSAN SNP pipieline and cgMLST to analysize simulated "contaminated" datasets and record distances from focal individuals to closely related isolates 8 subjects of each species of interest were analyzed Pipelines tested: CFSAN, LmCGMLST SNP matrices generated by CFSAN SNP pipeline and analyzed with GARLI recording bootstrap reports for monophyly and nearest neighbor Metrics from pipelines and assemblies are good indicators of contamination Results from Listeria with CFSAN SNP pipeline, contamination doesn't seem to influence SNP counts until 40% contamination within species contaminants causing the biggest change interspecies don't cause trouble because they aren't mapping to the reference strength of reference guided approach bootstrap support: see drops at the 40-50% contamination level Results from Salmonella similar: see influence at 50% contamination same results with within species contamination causing the most trouble bootstrap support: see drops at aroundn 30% contamination Results from E coli same general patterns cgMLST seems to be a little bit more sensitive to contamination started seeing false positives at as low as 10% see more effect with interspecies contaminants than SNP pipeline, especially at high levels Metrics you can use to detect contamination MLST: average missing alleles and partial alleles worth doing this in your data, definitely informative Average N50 drops quickely with contamination Assembly length NOT as useful although if your assembly is much longer than it should be, that could be a red flag Number of contigs if you count the number of contigs above 500 nucleotides, good indicator of contamination threshold for this can differ by bug Question: do you use external tools like MASH? k-mer hashing is greatly affected by contaminants Comment: consensus base calling threshold of CFSAN can explain these results

Chao Jiang: Dynamic Human Environmental Exposome Revealed by Longitudinal Personal Monitoring Exposome: total env exposures, "personal bubble" of 10^3-10^7 microbial particles, biological and abiological (also physical) Important role in allergy/asthma Can we track env exposures at the personal level? adapted an existing device to develop a wearable device for sampling Sampling: DNA -> sequencing one person sampled for the longest time and across the widest geo region, remaining participants stayed mostly local Questions: what constitutes the exposome? plant pollen, spores, brochosomes (from insects), inorganic materials Sequencing, excluced human reads, identified 2500+ species in > 70 bill reads Dynamics at individual level What drives dynamics? spatial, env, technical variables - spatial and env played largest role individual genera subjected to diverging influences from different sources PC plot: location is important Seasonal effect: able to reproduce ecological findings from other studies ML model can predict season given sample Chemical features location and season dependent dynamics correlation between biotic and abiotic: sCCA Evolutionary, ecological and health implications exposure to pathogens: rare K pneumoniae, etc.. not validated with further assays Nt diversity and SNP density highly correlated Beyond taxonomy: intraspecies diversity ..? mainly in bacteria and viruses? Species interactions: map species found on to global interaction network

Mushal Allam: Whole-genome Sequencing Analyses to Investigate a Nationwide Outbreak of Listeriosis Caused by Ready-to-Eat Processed Meat Products, South Africa, 2017-2018 Unable to attend due to visa issues

Blake Hanson: Nanopore Sequencing to Understand AMR we've had a sweet period for some years when AMs worked first effective treatment with penecillin was in 1942 1945 nobel price for penecillin, then 1948 resistance to penecillin (paraphrasing the presenter: now we are in deep,deep trouble) considering bang for the buck, we get a lot more out of am drugs than many others we may be exiting the antibiotic age thus, enter antiboitic stewardship, aka how not to throw the remaining ams away genomics good identify pathogens when traditional diagnostics fails allows for strain typing dettect am genes and virulence factors can go beyond single bacterial genomes and look at community bad cost lack of widespread availability of sequencing infrastructure difficult analyses rapidly evolving technology limited by what we don't know: need better database of resistance mechanisms Barriers to implementation turnaround time gen-phen correlation analytical pipelines critical to last two: well curated, dynamic ref dbs Horiz gene transfer largely drives AMR Resolution of transposon repeats / clin e coli isolate of unknown origin TZP resistance: saw increasing # copies of bla gene is this high cp number plasmid? Nanopore: longest read 67kb, identified 5 repeats, accuracy 72% aligned to 5-mer of repeat: multiple reads align to either side of traI gene with island interrupting the gene tandem repeats insert into chromosome confirmation of multiple tn4401 transposons on single plasmids sequenced with minion had to go the long way around the plasmid found two copies 1 large plasmid 181kb containing two separate copies of tn4401 characterization of extensively drug resistance p. aeruginosa ppr submitted extremely drug resistant using long reads, was able to find two beta-lactamases created full, perfect assemblies, from two genomes that were very drug resistant genomes nearly identical, but many mobile element differences MRCA couple of years ago characterization of multiple resistant isolates from single pt medical tourism e coli, klebsiella, candida (not sequenced), pseudomonas sequenced e.coli and k.pneumonia - see similar resistance mechanisms (few bp differences) large deletion in one of them Comparison of two co-circulating strains of k pneumoniae chromosomal similarity mobile elements driving structural changes of bacteria plasmids more diverse, but some similar MGEs (identical transposon structures) barriers to implementation turnaround time need better assemblers still going with canu need them to maintain structural vaiants should be scalable with higher sequence coverage dealing with long reads with cirucularization genotype-phenotype correlation larger surveillance of diversity of amr mechanisms analytical pipelines better implementation of methods/dbs to allow for commonity efforts to sequence a variety of bacteria for assessment of AMR mechanisms Questions Heterogeneity in colony of AMR genes? we are looking into it, but don't repot on this now What about short read pipelines for polishing long reads? we are using such pipelines to get enough accuracy. Do confirm (visually) for correct pileups. Issues with just long read polishing. Are you just relying on nanopore now (not pacbio)? cost is an issue. also issues with pacbio - longer reads lead to higher errors. Svant? software coming out? Should be released on github with publication.

Mike Feldgarden: AMR Resources at NCBI’s Pathogen Portal NCBI's role: create repositories for AMR ... Data: SRA, genbank, genometrackr, etc. AMRfinder: combined BLAST/HMM AMR gene discovery Example: Routine genomic surveillance finds the 4th US mcr-1+ isolate (Vasquez et al 2016, MMWR - stool sample from pediatric pt, O157 isolated and sequenced -> SRA amr module detected an mcr-1 gene CDC notified, published Novel gene discovery: FosA7 Fosfomysin is used to treat uncomplicated UTIs In 2017 AMRfinder identified a possible Fos gene that was widespread (in 100s of isolates) Rehman et al 2017: novel gene, glutathionine transferase - found in ~2.5% of salmonella Building an AMR db starts with protein domain experts, large scale dbs, manual extraction from literature HMM Lots of manual curation: standardize names, verify HMM cutoffs, curate start sites, only full-length genes Haft 2018 - Utility of HMMs: beta-lactamases in genbank Refseq vs Genbank (refseq curated by NCBI, Gb user-submitted) Only 11% of sequences labelled b-lacs are actually b-lacs Berglund et al 2017 microbiome - Lessons learned: issues with nomenclature two partial, overlapping, incomplete nomenclature systems used ramirez 2010 many synonyms: CARD is a good resource for AME synonyms trying to get agreement too many things called OXA, some as different as yeast v human AMRFinder Comprehensive test case: NARMs data 1000s of genomic and AST data to confirm genotype predictions tested against resfinder (was done a while ago, could be changes) examined ~6000 isolates 88% calls identical between two systems some misclassifiications due to missing sequences (250 Resfinder, 0 AMRFinder) underspecification (5 AMRFinder, 0 ResFinder) overspecification - novel or partial sequences (0 AMRFinder, 977 ResFinder) phenotypes consistency was high resistance markers for fluroquiolones are linked with intermediate suseptibility but not clinical resistance in S. enterica lessons overall, high consistency some interesting results in data antibiogram submissions Need more!!! [live demo] future improvements to AMRFinder expanded organism point mutations drug classes affected by gene/mutation AMR gene browser

NCBI Resources Curated AMR Gene download: AMR HMM Download: Table of AMR gene accessions and names: Isolate Browser:

Questions Is there any support to add amr/phenotypic data afterwards? Yes, support for this. Did any antibiogram data you have had validated HMM data? HMM models don't always work. good luck with beta-lactamases. Difference between NCBI/resfinder (NCBI full genes, ResFinder partial genes). We found some differences in false positives with own comparison. Is this related to differences you saw? Probably not. Most of AMR genes we assembled were complete. Did not have a lot of partials.

Emily Snavely: Development and Validation of a Clinical Whole-Genome Sequencing Pipeline for the Detection of Antimicrobial Resistance Genes in Bacterial Isolates AR Lab Network Detect, respond, prevent, innovate concerned with specific resistances (specific organisms) uses real-time PCR to detect CRE, MDR PA, colistin resistance AMR Pipeline development choose bioinformatics tools (spades, mlst, abricate, etc) AMR Pipeline: illumina data -> trim reads for Q -> species ID -> de novo assembly and QC Finding AR genes: ABRicate uses 3 dbs NCBI, ResFinder, ARG-ANNOT pipeline compares genomic location of gene, pick gene with max cov Reporting: capture identification, mlst, assembly qc, ar genes (40% coverage over 60% identity), plasmid replicons Validation CLEP NGS validation guildline assessing reproducability intra-assay three clinical isolates analyzed in triplicate inter-assay three clinical isolates analyzed on 3 diff days accuracy verification 62 isolates across multiple organisms WGS compared to AST and real-time PCR for one example organism 100% correlation btwn PCR and AMR pipeline WGS provides more conclusion WGS AMR pipeline helps identify AR genes not detected by other methods

David Rasko: Diversity Among blaKPC-containing Plasmids in Escherichia coli and Other Bacterial Species Isolated from the Same Patients The plan: identify samples that contain a carbapenem-resistant E. coli and ask: What is the genomic diversity of the E. coli Are there other species isoaltes that are also CR from the same samples? Are the mechanisms similar? Potentially shared? The samples 29 subjects with E. coli isolate that was CR in the clinical lab all were PCR positive for blaKPC2 or 3 were from surgical drianage sites, ruine, sputum, blood, BAL, and more 10 subjects also had CR organisms of a different organism Are plasmids moving between these species? Lots of diversity of resistance to other antibiotics actual measurements of AMR match the genomics in some cases and in others do not The E coli genomes Obtained all blaKPC2 and 3 E. coli in GenBank A lot more diversity among the E. coli isolates in the current study but still majority of isolates are ST131 or ST648 selected one isolate for PacBio complete sequencing (YDC107) not ST131 or 648, it is phylogroup D, less complete genomes of phylogroup D looked like there were a bunch of different plasmids in this isolate Results: no resistance genes in the chromosome 4 large plasmids over 40KB with multiple resistance mechanisms two phage that were not integrated into the chromosome pYDC107_70 is one plasmid 6 resistance mechanisms resistance genes surrounded by lots of transposons (version of this plasmid) also in K pneumo and S marcescens strains in same patient 3 other E. coli strains had similar plasmids but not identical ones mobility of plasmid into other species: functional studies underway rates of transfer mechanisms can we target transfer as an antimicrobial therapeutic? pyDC107_184 (another plasmid) appears to be more mosaic than the 70KB plasmid (not conserved) complete plasmid not identified in any other isolate Summary genomics has identified plasmids that appear to be moving between isolates within patient samples is E. coli the sink or just part of the strain? other plasmids not as common but may still be involved in transfer a lot more complete genomes and plasmids needed to get into how this system is working

Question: How many plasmids have you seen in one isolate? have some vtec isolates with 12 independent replicons. For some plasmids, not sure if they are mosaic plasmids. Since these can occure at different frequencies, do you have any tricks. caesium-cholride preps any method to automatically detect if there was horizontal transfer in plasmids? not that I am aware of. plasmids tend to get lost in genome sequencing. hard to do in automated fashion. if you got deep enough sampling of different patients ?? serial-longitudanal sampling is only way to do this. some people starting to work on transmission networks.

Finlay Maguire: AMRtime: Rapid Accurate Identification of Antimicrobial Resistance Determinants from Metagenomic Data AMR identification from metagenomes pulling out only reads related to amr genes CARD database pulling out lots of metadata (ARO ontology) multiple models for different AMR homology models, protein variant models, rRNA gene variant models, efflux pump, gene cluster Diffculties with AMR in metagenomes? genes are very rare issues with false positives wildly different abundances some genes present once, others present 100s of times overlaps in sequence space for AMR genes difficult to separate out other constrains (want tool easy to use, minimal resources) AMRtime structure reads -> filtering -> sensitive homology classification -> homology predictions variant predictions metamodel predictions filter homology search as filter (blastx, diamond, etc) doesn't work particularly well k-mer approaches large missing data returned balance in approaches between true positives and false positives sensitive homology class. sequence similarity encoding sequence bitscore matrix (reads x bitscores) inbalanced training data encoding -> SMOTE -> ... initial classifier quite low revised classifier initially classify to AMR family separate classifier in each AMR family cross-validation best: normalised bitscore random forest performance quite well family-level classifiers varied a bit more in performance many good, subset is problematic problematic related to sequence-diversity in family. many shared k-mers -> poorer performance ongoing work benchmarks, comparisons to other approaches threshold identification integrate into CARD and IRIDA conclusion direct homology searches poor but, useful when combined with machine learning k-mer approaches poor

Questions did you do benchmarking against shortbread? yes, found shortbread strugglng on full-sized metagenomes

Torsten Seemann: How to Write Bioinformatics Software no one will use Microbiological Diagnostic Unit - Oldest public health lab in Australia (1897) National reference lab, Salmonella, Listeria, EHEC Bioinformatician, honorary microbiologist. Written and maintained >10 packages (prokka, snippy, etc) Unix command line software tools How to get a bioinformatics headache See a tweet about a new tool / Read abstract sounds good / Fail to find link to source, google / Attempt to compile and install / Google fixes / Finally build it / Run on small set / Error out / Delete and never try again Unhelpful core dumps are common - lack of good error messages Blog post on 10 minimum requirements for acceptable tools -> GigaScience requested a paper. (PLOS 10 simple rules club was exclusive) - Should you stay? Yes if you write tools, yes if you need to identify bad tools as a user Should you write a new tool? No -> if it exists , if you can't maintain it, if you won't use it. Yes -> if you need it, if you'll use it, if you want others to use it, Desire to give back. "Eating your own dog food" - if you develop it, you should have to use it. One of the most popular success stories in this is Amazon. Amazon employees are forced to use the API not some back channel, so if the API is terrible, they have their own incentive to improve it.. Bioinformaticians should use their own tools! Prokka - prokayotic genome annotation (Torsten's best known tool, likely) or Stands for Prokaryotic annotation Lessons from prokka - nearly all feedback positive, but maintenance is a burden Discoverability Choose a home base - where are you going to put your tool so people can find it? GitHub, Bitbucket, GitLab, personal or lab site? Sourceforge (don't use it anymore per Torsten), NOT Google Code Don't use your university website if it is likely to go away when you finish PhD or leave. Namng - try to be unique, google first! misspelling English words is a common approach. Avoid dodgy acronyms. (JABBA awards) Just Another Bogus Bioinformatics Acronym (JABBA) First Impressions Count / Keep it Simple Stupid (KISS) First page of documentation: WHAT does it do? How do I install? How do I run it? Try to keep in one place Usability (VHS versus Betamax) VHS won the awards Betamax was slightly better ....but why did VHS win Sony kept Betamax privateissh.....VHS was cheaper and they gave it to everyone Lesson from VHS vs. Betamax Expectations from a piece of software: Print something useful if no parameters "people use --help for instructions" Always have a help flag, standard format for unix commands Always have a version flag [sorry, Bracken doesn't have this - Jen] or (if you're looking for Bracken) its also on his earlier point about not leaving student code on university websites........ Check that dependencies are installed! - brew/conda do this for you Always let users control output UIKeyInputDownArrowfilenames - if they just overwrrite the output files all the time, then you can run 1000x and then you only get outputs for the last one Run with minimum parameters -- sensible defaults [people are lazy and like defaults] Validate nums/strings Command Line Interface Use the standard "getopt" interface available to you in whatever you're coding in. Unix exit codes Positive integer, loose standards - traditionally 0 = success, anything else is a failure. $? variable stores the exit code. Can 'echo $?' to view it if needed. Use stdin, stderr, stdout < for stdin

for stdout 2> for stderr | for pipe allows easy chaining together of tools Don't invent new file formats! Put headings in your columnar data files (TSV or CSV) Don't use XML for structured data, use JSON or YAML Keeping audience "each error encountered during installation will half your audience" Traditional packaging systems debian, redhat cross-platform packaging brew, conda, others many people switched to these language-specific repositories python (pypi), etc publish preprint (biorxiv); method-focused (bioinformatics) software-focused (journal of open source software); alternative model to publish software focused on quality of software, reviewing on github plugin it twitter - ask someone popular you know to retweet it....if you have no followers - no one will see it blog, conferences supporting users reply to emails, issues page (e.g. github) monitor biostars/seqanswers [if someone asks a Bracken/Kraken question here....prob wont get answered...- try github instead plz- Jen Lu] mailing list update docs, fix bugs Take home messages !! make it as painless as possible to install keep documentation clear/simple get people to use it BEFORE you publish people are not judging coding skills (or torsten isnt but people might be....) but they will curse you if you waste their time most users are grateful - leads to free beer [according to torsten....never seen this myself. - Jen ] --- "Some of you owe me beers" - Torsten a good tool is worth MUCH more than a paper Torsten - What I am working on next - TorstyVerse suite Ready snippy 4.x - rapid SNP calling and core SNP alignments (version 3 has problems that are fixed so upgrade) shovill 1.x - wrapper around SPAdes to make it faster + better Nullarbor 2.x - new plugin architecture Improvements abricate - AMR gene calling - support NCBI hierarchy and classes Prokka 1.14 --> ISfinder + AMR, better ncRNA annotation Planned Mokka - metagenome annotation Prokka 2 - genome annotation --> GO terms, plugins, pseudo-genes :) QUESTIONS:: "How to write a bad database" -- speaker TBD "provide an input example" - some kind of testing/example data with input/output" -- Github can help you with this, test suites can the community provide better supported software - maturity in free open-source software more long lived more supportive A: need more software engineers, but hard to get a salary for one Canu has a full time software engineer that maintains it software engineer + bioinformatician $$$$ "conda will win the war - not really a war" brew is an all in one solution conda allows you to set up separate environments with diff versions of software - this is why conda will win

Beth Neuhaus: The CDC Domestic Influenza Surveillance System: From Pipe Dream to High-Performance Reality #flufighter starts with public service announcement about flu season 100th year of the flu pandemic Global influenza surveillance and response system (GISRS) : ziegler et al WHO disease outbreak news entries analysis - relative comparison of diseases that have been reported over the years Influenza surveillance: domestic disease prevalence and severity indicators gis.cdc link virologic surveillance - weekly report, starting at the beginning of the flu season - How do they go from NGS --> vaccines? collect circulating viruses from pts who show influenza-like illness, as many samples as possible in progress this week at CDC determine whether or not to change the current vaccine full influenza genome seqs, compare In 70s, used to do phenotypic analysis -> 90s sanger based -> now 10000 seqs per year sequenced, take 2000 to propagate in ferrets and then phenotypic analysis Jester et al 2018 EID - ~ 1 million patient reviews -> 85000 -> subset of ~10000 for NGS and further analysis (tier 4 and 5) LABEL: shepard et al 2014 - LABEL = Lineage Assignment By Extended Learning used to define the H5, H7 and H9 lineages IRMA: shepard 2016 bmc genomics - IRMA = iterative refinement meta-assembler flu, ebola, sars hat tip to lots of open source components iterative meta-assembler made it in-house because found that many reads got dropped by other progs Output: visualization of SNVs, phase map Process: Type from IRMA + BLAST -> verify -> LABEL for genetic group designation -> curate consensus seqs APHL AIMS cloud for data services - Progress: rapid generation of NGS data, now at 25000 total genomes as of Aug 2018 3 1/4 day TAT in local PHLs is the fastest, usually 5 days TAT including data analysis. less chance of passage-mediated changes due to choice of host Flu cloud has allowed early identification of anomalies, e.g. viruses related to pediatric deaths, DR strains, coinfections, re-emergence of older strains "More plumbing": how does all this work at the CDC make use of hadoop ( vis with tableau H3 HA geo spatial occurrence: visualized by relative number of changes from a reference in aa diffs vs time "65 years of influenza surveillance by a World Health Organization-coordinated global network" - integration of genetic and antigenetic characterization data visualization influenza surveillancegenomic epidemiology surveillance fingerprinting of virus strains cloud computing next steps additional surveillance and prevention data sources automated release of NGS reads to NCBI The Junior Disease Detectives: Operation Outbreak Graphic Novel - Keller 2018 sci rep: direct RNA sequencing of coding complete influenza A genome -

Questions Still collecting minor variant data? Tools to use this? yes, but we haven't tapped into this data yet. Best place to download the flu genomes? consensus level data at NCBI. Some data is coming soon. Still have to work out some issues/policies.

Anders Goncalves: AusTrakka: Enabling Data Sharing for Surveillance --- or Why Your Parents Were Right Tools and platforms for sharing (or that enable sharing) (e.g., IRIDA, INNUENDO, GenomeTrakr) problem is not technology, but psychological sharing at personal cost enables people to do the same later Austrakka as a prosocial experiment: labs have option of participating, no confidential metadata, all labs can see their genomic data in the context of others. Just fastq and minimal metadata Similar to what genometrackr does, but at a smaller scale Hopefully they will be willing to share with genometrackr later The technical hurdle is that people complain that uploading data is TOO DIFFICULT Q: Regarding whether AusTrakka is an open db or close - A: Only open to PH contributors

Brian Ondov: Mash Screen: Fast Sequence Containment Estimation Using MinHash based on MinHash algorithm: "on the resemblance and containment of docs" works for ngs (kmers) just like web searches What if you want to do metagenomic data? (Is my genome contained within the metagenome data) -> doesn't work that well You can stream read data through all sketches in a database to get containment Validation: shakya 2013 Six-frame validation using HMP ORFs - Reads (Diamond mapping) Performance: 8 Gb per cpu-hour - mulitthreaded - 10-15 Gb of RAM Can be used to detect contamination in sequencing data 440 CPU days for 10,000,000,000 comparison. (genomes vs. metagenomes [SRA]) novel virus discovery screen against refseq -> look up nearest neighbor in large genomes/metagenomes table -> find SRA run tested with polyomavirus and it worked well

Sam Minot: Gene-level metagenomic analysis identifies microbial genes reproducibly associated with IBD and CRC across independent clinical cohorts focusing on discovery aspect of applied microbiology looking towards applications to therapeutics ontologies of microbiome analysis: shotgun -> microbial taxa/metabolic pathways/assembled genomes <-> microbial genes can be challenging to connect different ontologies to eachother hypothesis: what if instead, we took as our primary ontology, the microbial genes? Data: colorectal cancer datasets Discovery (Zeller et al 2015) [ maybe?] Validation (?) prokka FAMLI is also part of the analysis/validation pipeline DIAMOND for taxonomic Functional annotation using eggNOG mapper Pipeline: de novo assembly based: metaspades & prokka -> mmseq2 to cluster genese -> quantify with diamond building de novo gene database and using that for primary analysis big problem: dimensionality (too many genes) 10.4M genes x 199 samples decided not to use assembly to group together the genes but rather to go down the route of co-abundant genes as unit of analysis co-abundant genes (CAGs) - consistently on same chromosome hierarchical clustering doesn't work, too many comparisons big data heuristics -> approximate nearest neighbor algorithm result: able to cluster 10.4M genes into 57k CAGs from 10.4M genes to "dealable" set of clusters 7.6% og CAGs in discovery cohort assoc with cancer validation cohor: 41.9% of CAGS validated can look at prevalence of genes among people that either do or do not have cancer association of microbial genes with cancer is reproducible across four independent cohorts now have a list of genes, and can map back to things like strain collections, and ask which genomes have these genes can lead towards further invesitgation into biological mechanism

Ivan Liachko: From Contigs to Chromosomes: A Hi-C Based Graph Assembly Tool Significantly Improves Metagenome Contiguity and -Cfree Metagenomic Deconvolution from Phase Genomics "it's easy to sequence things, it's hard to put them together" problem exacerbated in shotgun metagenomics Hi-C captures the 3D structure of chromosomes take a cell, treat it with crosslinker, crosslinker goes into cells, links regions that are prioximally close to one another, sequence these junctions Lierberman-Aiden Science 2009 - anytime two sequences are interacting, they must have started out from the same cell plasmids and phage will cross link to their hosts ProxiMeta Hi-C example: unculturable genomes from a bacterial vaginosis sample from fred hutch can pull apart species can also separate strains ProxiMeta outputs completeness, novelty, abundance, ... other examples: fecal samples from Jen Gardy's cat, human fecal samples Connecting plasmids/phages to host: Marbouty Sci Adv 2017; Press et al biorxiv 2017 Hundreds of genomes from rumen samples Stewart et al Nat Comms 2018 - from complex wastewater community: Top lab, U Idaho Metagenome assemblers make many misjoins aseembler creates ambiguous paths can we use HiC to guide the assembly? PGMA: proximity guided metagenome assembler, improves spades assembly Kits available for virtually any sample type Q/A: If you don't want to sequence everything and just need to know which MO has which R gene, can enrich for specific targets and look for the linkage

Karin Lagesen: ORION - One health surRveillance Initiative on harmOnization of Data Collection and interpretatioN One Health European Joint Porject foodborne zoonoses, AMR, and emerging threats 38 Vet and PH instituions from 19 countries 2 "integrative" projects purpose is to develop joint resources such as infrastructure across 3 targeted areas Little state of the art established for doing surveillance in a one health perspective need to work across sectors goal is to strengthen interinstitutional collaboration transdisciplinary knowledge transfer consists of work packages: making a one health codex surveillance infrastructure such as ontologies inventories for data, methods, tools, and systems next phase is to see if they can implement and improve some of the methodologies out there want to create a knowledge hub for one health surveillance bioinformatics part: use of seq for surveillance is currently a very diverse field: some are very far along and some haven't started diversity in what is being analyzed and how analyses are being done want to integrate and harmonize results need for establishing pathways for doing sequencing for surveillance reasons for diversity: funding, legal, regulatory, goals, personell, etc makes harmonization challening but needed focus areas infrastructure sequences, compute, storage, personell pipelines anlysis platforms, workflow systems, software outcomes cg/wgMLST schemes, SNP analyses, phylogeny surveillance (how to use the results) resolution, clustering methods, outbreak detection their goal is to analyze the 4 focus areas and provide solutions within each component taking into account starting point of each institution want to provide updatable "how-to" -> harmonized endpoints

Questions What is the key to building a good sustainable community? Communication. Easy for people to contribute. Feedback.

Dmitry Antipov: Plasmid Detection and Assembly in Genomic and Metagenomic Datasets why look in metagenomes for plasmids? untapped resource of still unkonwn plasmids Jorgensen et al plos one 2014 - hundreds of novel plasmids - plasmid assembly challenges standard ones (bubbles, chimeric reads, etc) repeats within plasmids repeats shared by different plasmids repeats shared by plasmid and chromosom metagenomes: plasmid assembly challenges amplified known approaches regular metagenomic assembler plasmidome sequencing recycler (rozov et al Bioinformatics 2017) - based on spades, high false pos rate cultivation + plasmidSPADdes (Antipov et al Bioinformatics 2016) - plasmidSPAdes idea since many plasmids have copy numbers exceeding 1, they often form cycles in the assembly chromosomes and plasmids may differ in copy number chromosomal contigs removal improves plasmid assembly basically taps into difference between chorosomal and plasmid coverage plasmidSPAdes on metagenomes but chromosomal copy number can vary between chromosomes need to have variable coverage cutoffs metaplasmidSPAdes approach went over this quickly, can hopefully find details elsewhere! metaplasmidSPAdes results [didn't catch...get impression seems to work well/lead to discovery of novel plasmids] take home: some plasmids in genomic and meta genomic data can be extracted using coverage info plasmidic sequences can be verified with gene centric approach known plasmids variety can be dramatically increased after reassembling already sequenced genomic and metagenomic datasets plasmidSPAdes already incorported in SPAdes package, metaplasmidSPADes: wait for next SPAdes release

Questions Could this be applied to viruses, phage, etc? We are interested in viruses in metagenomes. Some problems with quasispecies in metagenomes. We are interested. Will this replace need for circulator (will it circularize)? Unsure. Any plans to make most of false positive. optomize pulling out genomic islands?, whic unsure

Daniel Baker: Dash: Efficient Genomic Set Operations Using HyperLogLogs Dash uses approximate counting and hyperloglogs approx counting: trades and exact solution for log(log(n)) space hyperloglog: partitions approximate counter into subsets and uses the harmonic mean to reduce variance same asymptotic performance as MinHash instead of exact estimate over a subset of your kmers you do an inexact estimate over all of your kmers accuracy e. coli genomes dash is generally less biased how fast mash: sketching dominates cost dash: comparison dominates cost does way more genome comparisons per cpu second than mash in general: more accurate and faster than mash and scales well as processors get faster and we hit max as how fast processors will go, we need to move towards moving more processors works on uncompressed and standard compressed files "New cardinality estimation algorithms for HyperLogLog sketches" - in general: anything you can do with sets of kmers and comparisons with genomes, you can use approximate counting and hyperloglog to do code: