Using individual scale de novo assembly to identify mutations causing a phenotypic trait Scott Geib USDA-ARS Daniel K Inouye Pacific Basin Agricultural Research Center Hilo HI Ceratitis capitata: the Mediterranean.
Download ReportTranscript Using individual scale de novo assembly to identify mutations causing a phenotypic trait Scott Geib USDA-ARS Daniel K Inouye Pacific Basin Agricultural Research Center Hilo HI Ceratitis capitata: the Mediterranean.
Using individual scale de novo assembly to identify mutations causing a phenotypic trait Scott Geib USDA-ARS Daniel K Inouye Pacific Basin Agricultural Research Center Hilo HI Ceratitis capitata: the Mediterranean fruit fly • Not Drosophila • Pest infesting over 300 different hosts • Threat to the U.S. , not officially established in CA • But, infestations found periodically • California produces $18 BILLION of agriculture and represents almost half of total agriculture in the US • Cost $50-200k in quarantine efforts • Sterile insect male release part of eradication program SIT strain genetics • SIT (Sterile Insect Technique) strain • Reared by the billions (in Hawaii and Guatemala) • Is a genetic sexing strain • Carries two sex-linked traits • Temperature sensitive lethal (tsl), females die at increased temperature • White pupae (wp), female pupae are white • Females killed in egg stage and thus halves the cost of rearing • Desire to replicate this in other species SIT strain genetics • SIT (Sterile Insect Technique) strain • Reared by the billions (in Hawaii and Guatemala) • Is a genetic sexing strain • Carries two sex-linked traits • Temperature sensitive lethal (tsl), females die at increased temperature • White pupae (wp), female pupae are white • Females killed in egg stage and thus halves the cost of rearing • Desire to replicate this in other species SIT strain Females X X 5 Males X 5 tsl wp tsl wp Y/5 5/Y tsl+ wp+ 5 tsl wp SIT strain Females X X 5 Males X 5 tsl wp tsl wp Y/5 5/Y tsl+ wp+ 5 tsl wp Wildtype lab line Females X X 5 Males X 5 tsl+ wp+ tsl+ wp+ Y 5 5 tsl+ wp+ tsl+ wp+ The making of a sterile male • Colony produces eggs • Females are selected out • Eggs hatch and larvae feed • Larvae pupate, are dyed and irradiated • Flies are shipped, eclose and are distributed by plane The making of a sterile male Eggs incubated in water bath at 32 degrees Celsius • Colony produces eggs • Females are selected out • Eggs hatch and larvae feed • Larvae pupate, are dyed and irradiated • Flies are shipped, eclose and are distributed by plane The making of a sterile male • Colony produces eggs • Females are selected out • Eggs hatch and larvae feed • Larvae pupate, are dyed and irradiated • Flies are shipped, eclose and are distributed by plane The making of a sterile male • Colony produces eggs • Females are selected out • Eggs hatch and larvae feed • Larvae pupate, are dyed and irradiated • Flies are shipped, eclose and are distributed by plane The making of a sterile male • Colony produces eggs • Females are selected out • Eggs hatch and larvae feed • Larvae pupate, are dyed and irradiated • Flies are shipped, eclose and are distributed by plane Los Angeles Basin Release Regions Detections over time • Despite efforts to keep flies out, they are consistently detected each year 2014 Detections Goals • Identify causative mutations for white pupae and temperature sensitive lethal traits in medfly Approach 1. Genetic cross with inbred lab line, isolate trait in wild-type background 2. Construct linkage mapping and perform QTL analysis to identify region of genome associated with mutation 3. Through whole genome sequencing of individuals, characterize potential causative mutations within the highly associated regions 4. Re-creating phenotype through generation of CRISPR mutants Goals • Identify causative mutations for white pupae and temperature sensitive lethal traits in medfly Approach 1. Genetic cross with inbred lab line, isolate trait in wild-type background 2. Construct linkage mapping and perform QTL analysis to identify region of genome associated with mutation 3. Through whole genome sequencing of individuals, characterize potential causative mutations within the highly associated regions 4. Re-creating phenotype through generation of CRISPR mutants Goals • Identify causative mutations for white pupae and temperature sensitive lethal traits in medfly Approach 1. Genetic cross with inbred lab line, isolate trait in wild-type background 2. Construct linkage mapping and perform QTL analysis to identify region of genome associated with mutation 3. Through whole genome sequencing of individuals, characterize potential causative mutations within the highly associated regions 4. Re-creating phenotype through generation of CRISPR mutants Goals • Identify causative mutations for white pupae and temperature sensitive lethal traits in medfly Approach 1. Genetic cross with inbred lab line, isolate trait in wild-type background 2. Construct linkage mapping and perform QTL analysis to identify region of genome associated with mutation 3. Through whole genome sequencing of individuals, characterize potential causative mutations within the highly associated regions 4. Re-creating phenotype through generation of CRISPR mutants Goals • Identify causative mutations for white pupae and temperature sensitive lethal traits in medfly Approach 1. Genetic cross with inbred lab line, isolate trait in wild-type background 2. Construct linkage mapping and perform QTL analysis to identify region of genome associated with mutation 3. Through whole genome sequencing of individuals, characterize potential causative mutations within the highly associated regions 4. Confirm mutation and re-creating phenotype through generation of CRISPR mutants 1. Isolate trait using a genetic cross 1. Isolate trait using a genetic cross 2. Generate genome wide variant data and high resolution linkage map A genome assembly exists as a result of the i5k (5000 insect genomes) project (ALLPATHS-LG; BCM) A genotype by sequencing (GBS) approach 2. Generate genome wide variant data and high resolution linkage map Elshire RJ, Glaubitz JC, Sun Q, et al. A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. Orban L, ed. PLoS ONE. 2011;6:e19379. doi:10.1371/journal.pone.0019379. 2. Generate genome wide variant data and high resolution linkage map A genome assembly exists as a result of the i5k (5000 insect genomes) project (ALLPATHS-LG; BCM) A genotype by sequencing (GBS) approach • F4 white and brown pupa individuals targeted for sequenced (in addition to P-gen and parents of F4) • 285 samples total • Sequenced 1 X 75 bp HiSeq, ~1 million reads/sample • Map to reference, identify SNPs (~ 80,000 identified) • Use SNPs across mapping population to calculate linkage map 2. Generate genome wide variant data and high resolution linkage map • We placed 80% of genome into linkage groups • Generated 6 major linkage groups (matching expected Autosomal and Sex Chromosome #) 2. Generate genome wide variant data and high resolution linkage map 80% of the genome “super scaffolded” into 12 pieces 2. Generate genome wide variant data and high resolution linkage map 2. Generate genome wide variant data and high resolution linkage map 2. Generate genome wide variant data and high resolution linkage map QTL LOD score By assessing phenotype of individual (white or brown pupae) we can calculate score based off of linkage disequilibrium 2. Generate genome wide variant data and high resolution linkage map QTL LOD score • Scaffolds associated with chromosome 5 had highest scoring loci • This is a draft assembly, so there are many scaffolds on this chromosome 2. Generate genome wide variant data and high resolution linkage map QTL LOD score • Scaffolds associated with chromosome 5 had highest scoring loci • This is a draft assembly, so there are many scaffolds on this chromosome • Peak at Scaffold43, position 1,353,742 3. Identification of putative causative mutation • Linkage map and QTL identify the relative location in the genome associated with white pupae mutation. • Putative region smaller than 1Mb on a single scaffold of the draft genome assembly (Scaffold43:1353742) • Most GBS loci are in non-coding regions • Utilize whole genome re-sequencing of individual and interrogation of this region of genome to identify putative mutation. • Compare variant finding approaches as a user • Mapping based approach (GATK) • Assembly graph approach (DISCOVAR/DISCOVAR denovo) 3. Identification of putative causative mutation • Generating data for DISCOVAR/GATK • Individual flies were subjected to PCR-free library prep methods (~500 ng DNA / fly or less) • Size selection to target a 450 bp fragment • For each sample, 2 X 250 bp sequencing on HiSeq2500 Rapid Run, approximately 70M read clusters per library • Six F4 flies (3 white and 3 brown) sequenced on a single Rapid Run to ~60X coverage 3. Identification of putative causative mutation • DISCOVAR de novo as an individual scale assembler • Contig assembly only, no scaffolding • 6 individual assemblies • Medfly genome size ~450 Mb Sample W3M W5M W7M B44F B47F B56F # read pairs (estimated coverage) 79.6 (42x) 90 (47x) 141.7 (74x) 88.5 (46x) 69.8 (36x) 69.1 (36x) Frag Insert Size (determined from sequence data) 480 485 480 500 470 480 Scaffold N50 28826 40279 18817 37117 29943 6454 Mean Base Quality 35.3 35.3 35.3 35 35.3 35.3 Starting DNA amount (ng) 178 214 94 274 286 304 Starting DNA Peak Size 5.3 13.5 8 3.8 6.3 8.9 3. Identification of putative causative mutation • DISCOVAR de novo as an individual scale assembler Contig N50 between 6.4 kb to over 40 kb • While smaller than suggested by DISCOVAR, 40 kb is much larger than contigs derived from “standard ALLPATHS” assembly • Able to anchor the assemblies to the reference draft genome • Genotype by comparing graph structure between assemblies. 3. Identification of putative causative mutation Possibility of post-assembly scaffolding • Another goal of this project was to test DISCOVAR for de novo assembly of other insect species (lower cost/better contig quality than ALLPATHS??) • Utilizing existing jumping libraries, we can scaffold to similar size as ALLPATHS (SSPACE) • Looking at utility of Hi-C datasets to superscaffold • DISCOVAR + Hi-C = Chromosome scale assembly (????) • Maybe need some pre-scaffolding? • Combine with linkage data? • Received Hi-C data this week ……….. 3. Identification of putative causative mutation • DISCOVAR as a variant caller • Currently it is difficult to pull out variants across the entire genome from a DISCOVAR de novo analysis (in the works) • Today, focused specifically within Scaffold43, surrounding the linked loci (from QTL analysis) 3. Identification of putative causative mutation • DISCOVAR as a variant caller • Example graph structure of 6 genomes together across small region: 3. Identification of putative causative mutation • DISCOVAR as a variant caller • Example graph structure of 6 genomes together across small region: White pupae: Homozygous alternative Brown: Homozygous for reference *** one brown was heterozygous (.44/.50) 3. Identification of putative causative mutation • Comparing results of these 6 genomes to the reference and to each other at genome-wide scale • Using GATK and DISCOVAR • Look at the genotype data • Verifying linkage map 3. Identification of putative causative mutation • Comparison of QTL score to variants discriminating phenotypes from WGS (100kb window) # of SNPs with homozygous calls in all white pupae F4s and opposing call in brown pupae sample (homo or het) 3. Identification of putative causative mutation • Comparing results of these 6 genomes to the reference and to each other at genome-wide scale • Using GATK and DISCOVAR • Look at the genotype data • Verifying linkage map • Identify possible assembly issues from reference assembly 3. Identification of putative causative mutation • A chromosome 5 (linked) scaffold (white males only) 3. Identification of putative causative mutation • An unlinked scaffold, potential scaffolding error 3. Identification of putative causative mutation • An unlinked scaffold, potential scaffolding error 3. Identification of putative causative mutation • Back to variant calling at a more detailed scale • Using GATK and DISCOVAR, called variants across Scaffold43 (~3.3 Mb) • Variant impact analyzed using SNPEff and current NCBI RefSeq annotation set 3. Identification of putative causative mutation • Back to variant calling at a more detailed scale • Using GATK and DISCOVAR, called variants across Scaffold43 (~3.3 Mb) • Variant impact analyzed using SNPEff and current NCBI RefSeq annotation set • Overall, some consistency between variants called: • • • • DISCOVAR called 87,542 variants (76k SNPs / 11k INDELs) GATK called 106,873 variants (85k SNPs / 22k INDELs) 61,105 identical between methods DISCOVAR found several very large insertions (100’s of bp) not identified by GATK, all were in non-coding regions. 3. Identification of putative causative mutation • Generating “short list” of putative mutations • Making some assumptions (disclaimer) • Mutation is in coding region • Not accounting for non-coding mutations that may impact gene expression or regulation • Very little regulatory info available for this non-model genome 3. Identification of putative causative mutation • Generating “short list” of putative mutations • Making some assumptions (disclaimer) • Mutation is in coding region • Not accounting for non-coding mutations that may impact gene expression or regulation • Very little regulatory info available for this non-model genome • Overall, found five major mutations that: • Consistent between phenotypes • Caused major impact • Non-synonymous mutations • Frameshift mutations and/or premature stop codons 3. Identification of putative causative mutation Scaffold Position Ref mutationsAlt • Generating “short list” of putative Scaffold 43 C T • Making some800831 assumptions (disclaimer) • Mutation is 837972 in coding region Scaffold 43 C C • Not accounting for non-coding mutations that may impact gene Scaffold 43 1576424 G A expression or regulation Scaffold 43 2259830info availableAAC • Very little regulatory for this non-model A genome • Overall, found five major mutations Scaffold 43 2262779 C that: A • Consistent between phenotypes • Caused major impact ACAACAGGCATGCCAGCAAGTTGT GGCCGTCTTCCAACAACATGCTGCT ACAACTACAACAGCCAAATGACGAG CCCGCCGTTGCAGCCTCAGCACCAG CCAAGGCTACATATGCAACACTGCG ACATGCGATGGTTGTAGAGGCGCA AGC • Non-synonymous mutations • Frameshift mutations and/or premature stop codons Scaffold 43 2410888 A 4. Demonstrate mutations • Using the union of linkage mapping, whole genome variant calling, structural impact, and RNAseq, identified a prioritized “short list” of mutations causing white pupae and temperature sensitive lethal • Verify mutations through sanger sequencing and SNP assays • Re-create phenotype through CRISPR-CAS targeted editing currently occurring …….. Conclusions • A consortium of approaches allow identification of putative mutations causing phenotypic traits • Graph-based variant calling seems to be of similar quality as mapping based approaches in non-model system (difficult to validate without gold set of variants) • Potential advantage of getting de novo assembly to identify novel structure of specific genome and not carry over errors of reference assembly • Cost is higher, but comparable to 30x coverage of standard HiSeq data. Acknowledgements Pacific Basin Agricultural Research Center Geib Lab Sheina Sim Bernarda Calla Steve Tam Brian Hall Teddy DeRego APHIS-PPQ Norman Barr, Raul Ruiz DISCOVAR Group David Jaffe Funding source/Resources: USDA Farm Bill USDA ARS Moana HPC Cluster NSF XSEDE Consortium Questions?