Pathway and Network Analysis Workshop December 8, 2011, Montreal Quaid Morris Scooter Morris Piet Molenaar Gary Bader Tero Aittokallio Boris Steipe Module #: Title of Module Canadian Bioinformatics Workshops.
Download ReportTranscript Pathway and Network Analysis Workshop December 8, 2011, Montreal Quaid Morris Scooter Morris Piet Molenaar Gary Bader Tero Aittokallio Boris Steipe Module #: Title of Module Canadian Bioinformatics Workshops.
Pathway and Network Analysis Workshop December 8, 2011, Montreal Quaid Morris Scooter Morris Piet Molenaar Gary Bader Tero Aittokallio Boris Steipe Module #: Title of Module Canadian Bioinformatics Workshops 2 Workshop outline • Introduction to gene lists, gene attributes, and interaction networks • Pathway enrichment analysis – Theory: Fisher’s exact test, background and multiple test correction. – Practical: DAVID • Gene recommender systems: – function prediction and gene-centered network browsing using GeneMANIA User account • Username: csuser51 • B1o1nf51 Introduction to Gene Lists, Attributes and Networks Outline • Gene lists and gene attributes – Where they come from – What do they mean • Networks – What are they – Analysis – Use in Biology – Biological questions/applications Interpreting Gene Lists • My cool new screen worked and produced 1000 hits! …Now what? • Genome-Scale Analysis (Omics) – Genomics, Proteomics • Tell me what’s interesting about these genes Ranking or clustering ? GenMAPP.org Interpreting Gene Lists • My cool new screen worked and produced 1000 hits! …Now what? • Genome-Scale Analysis (Omics) – Genomics, Proteomics • Tell me what’s interesting about these genes – Are they enriched in known pathways, complexes, functions Analysis tools Ranking or clustering Prior knowledge about cellular processes Eureka! New heart disease gene! Where Do Gene Lists Come From? • Molecular profiling e.g. mRNA, protein – Identification Gene list – Quantification Gene list + values – Ranking, Clustering (biostatistics) • Interactions: Protein interactions, microRNA targets, transcription factor binding sites (ChIP) • Genetic screen e.g. of knock out library • Association studies (Genome-wide) – Single nucleotide polymorphisms (SNPs) – Copy number variants (CNVs) What Do Gene Lists Mean? • Biological system: complex, pathway, physical interactors • Similar gene function e.g. protein kinase • Similar cell or tissue location • Chromosomal location (linkage, CNVs) Data Gene Attributes Available in databases: • Function annotation – Biological process, molecular function, cell location • Chromosome position • Disease association • DNA properties – TF binding sites, gene structure (intron/exon), SNPs • Transcript properties – Splicing, 3’ UTR, microRNA binding sites • Protein properties – Domains, secondary and tertiary structure, PTM sites • Interactions with other genes What is the Gene Ontology (GO)? • Set of biological phrases (terms) which are applied to genes: – protein kinase – apoptosis – membrane • Dictionary: term definitions • Ontology: A formal system for describing knowledge Jane Lomax @ EBI www.geneontology.org GO Structure • Terms are related within a hierarchy – is-a – part-of • Describes multiple levels of detail of gene function • Terms can have more than one parent or child What GO Covers? • GO terms divided into three aspects: – cellular component – molecular function – biological process (important pathway source) glucose-6-phosphate isomerase activity Cell division Terms • Where do GO terms come from? – GO terms are added by editors at EBI and gene annotation database groups – Terms added by request – Experts help with major development – 32029 terms, >99% with definitions. • • • • 19639 biological_process 2859 cellular_component 9531 molecular_function As of July 15, 2010 Annotations • Genes are linked, or associated, with GO terms by trained curators at genome databases – Known as ‘gene associations’ or GO annotations – Multiple annotations per gene • Some GO annotations created automatically (without human review) Annotation Sources • Manual annotation – Curated by scientists • High quality • Small number (time-consuming to create) – Reviewed computational analysis • Electronic annotation – Annotation derived without human validation • Computational predictions (accuracy varies) • Lower ‘quality’ than manual codes • Key point: be aware of annotation origin For your information Evidence Types • • Experimental Evidence Codes • EXP: Inferred from Experiment • IDA: Inferred from Direct Assay • IPI: Inferred from Physical Interaction • IMP: Inferred from Mutant Phenotype • IGI: Inferred from Genetic Interaction • IEP: Inferred from Expression Pattern • • Computational Analysis Evidence Codes • ISS: Inferred from Sequence or Structural Similarity • ISO: Inferred from Sequence Orthology • ISA: Inferred from Sequence Alignment • ISM: Inferred from Sequence Model • IGC: Inferred from Genomic Context • RCA: inferred from Reviewed Computational Analysis Author Statement Evidence Codes • TAS: Traceable Author Statement • NAS: Non-traceable Author Statement Curator Statement Evidence Codes • IC: Inferred by Curator • ND: No biological Data available • IEA: Inferred from electronic annotation See http://www.geneontology.org Wide & Variable Species Coverage Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304. Accessing GO: QuickGO http://www.ebi.ac.uk/ego/ See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi Gene Attributes • Function annotation – Biological process, molecular function, cell location • Chromosome position • Disease association • DNA properties – TF binding sites, gene structure (intron/exon), SNPs • Transcript properties – Splicing, 3’ UTR, microRNA binding sites • Protein properties – Domains, secondary and tertiary structure, PTM sites • Interactions with other genes Sources of Gene Attributes • Ensembl BioMart (eukaryotes) – http://www.ensembl.org • Entrez Gene (general) – http://www.ncbi.nlm.nih.gov/sites/entrez?db=gen e • Model organism databases – E.g. SGD: http://www.yeastgenome.org/ • Many others: discuss during lab Biomart 0.7 Use this one Ensembl BioMart • Convenient access to gene list annotation Select genome Select filters Select attributes to download BioMART demo http://www.biomart.org What Have We Learned? • Many gene attributes in databases – Gene Ontology (GO) provides gene function annotation • GO is a classification system and dictionary for biological concepts • Annotations are contributed by many groups • More than one annotation term allowed per gene • Some genomes are annotated more than others • Annotation comes from manual and electronic sources • GO can be simplified for certain uses (GO Slim) • Many gene attributes available from Ensembl and Entrez Gene Gene Lists Overview • Interpreting gene lists • Gene function attributes – Gene Ontology • Ontology Structure • Annotation – BioMart + other sources • Gene identifiers and mapping Gene and Protein Identifiers • Identifiers (IDs) are ideally unique, stable names or numbers that help track database records – E.g. Social Insurance Number, Entrez Gene ID 41232 • Gene and protein information stored in many databases – Genes have many IDs • Records for: Gene, DNA, RNA, Protein – Important to recognize the correct record type – E.g. Entrez Gene records don’t store sequence. They link to DNA regions, RNA transcripts and proteins e.g. in RefSeq, which stores sequence. For your information Common Identifiers Gene Ensembl ENSG00000139618 Entrez Gene 675 Unigene Hs.34012 RNA transcript GenBank BC026160.1 RefSeq NM_000059 Ensembl ENST00000380152 Protein Ensembl ENSP00000369497 RefSeq NP_000050.2 UniProt BRCA2_HUMAN or A1YBP1_HUMAN IPI IPI00412408.1 EMBL AF309413 PDB 1MIU Species-specific HUGO HGNC BRCA2 MGI MGI:109337 RGD 2219 ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1 SGD S000002187 or YDL029W Annotations InterPro IPR015252 OMIM 600185 Pfam PF09104 Gene Ontology GO:0000724 SNPs rs28897757 Experimental Platform Affymetrix 208368_3p_s_at Agilent A_23_P99452 Red = Recommended CodeLink GE60169 Illumina GI_4502450-S Identifier Mapping • So many IDs! – Mapping (conversion) is a headache • Four main uses – Searching for a favorite gene name – Link to related resources – Identifier translation • E.g. Genes to proteins, Entrez Gene to Affy – Unification during dataset merging • Equivalent records ID Mapping Services • Synergizer – http://llama.med.harvard.edu/syner gizer/translate/ • Ensembl BioMart – http://www.ensembl.org • PICR (proteins only) – http://www.ebi.ac.uk/Tools/picr/ Synergizer demo (http://llama.med.harvard.edu/synergizer/translate/) also see BioMART ID Mapping Challenges • Avoid errors: map IDs correctly • Gene name ambiguity – not a good ID – e.g. FLJ92943, LFS1, TRP53, p53 – Better to use the standard gene symbol: TP53 • Excel error-introduction – OCT4 is changed to October-4 • Problems reaching 100% coverage – E.g. due to version issues – Use multiple sources to increase coverage Zeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics BMC Bioinformatics. 2004 Jun 23;5:80 Recommendations • For proteins and genes – (doesn’t consider splice forms) • Map everything to Entrez Gene IDs using a spreadsheet • If 100% coverage desired, manually curate missing mappings • Be careful of Excel auto conversions – especially when pasting large gene lists! – Format cells as ‘text’ What Have We Learned? • Genes and their products and attributes have many identifiers (IDs) • Genomics requirement to convert or map IDs from one type to another • ID mapping services are available • Use standard, commonly used IDs to reduce ID mapping challenges Networks • Represent relationships – Physical, regulatory, genetic, functional interactions • Useful for discovering relationships in large data sets – Better than tables in Excel • Visualize multiple data types together – See interesting patterns Mapping Biology to a Network • A simple mapping – one compound/node, one interaction/edge • A more realistic mapping – Cell localization, cell cycle, cell type, taxonomy – Only represent physiologically relevant interaction networks • Edges can represent other relationships • Critical: understand what nodes and edges mean Protein Sequence Similarity Network http://apropos.icmb.utexas.edu/lgl/ Six Degrees of Separation • Many people in N America are connected by at most six links • Which path should we take? • Shortest path by breadth first search – If two nodes are connected, will find the shortest path between them • Are two proteins connected? If so, how? • Biologically relevant? http://www.time.com/time/techtime/200406/community.html Biological Questions • Step 1: What do you want to accomplish with your list or network (hopefully part of experiment design! ) – Summarize biological processes or other aspects of gene function – Find a controller for a process (TF, miRNA) – Find new pathways or new pathway members – Discover new gene function – Correlate with a disease or phenotype (candidate gene prioritization) – Perform differential analysis – what’s different between samples? Other Questions? Applications of Network Biology • Gene Function Prediction – shows connections to sets of genes/proteins involved in same biological process • Detection of protein complexes/other modular structures – discover modularity & higher order organization (motifs, feedback loops) • Network evolution – biological process(es) conservation across species • Prediction of new interactions and functional associations – Statistically significant domaindomain correlations in protein interaction network to predict protein-protein or genetic interaction jActiveModules, UCSD PathBlast, UCSD MCODE, University of Toronto DomainGraph, Max Planck Institute humangenetics-amc.nl Applications of Network Informatics in Disease • Identification of disease subnetworks – identification of disease network subnetworks that are transcriptionally active in disease. • Subnetwork-based diagnosis – source of biomarkers for disease classification, identify interconnected genes whose aggregate expression levels are predictive of disease state • Subnetwork-based gene association – map common pathway mechanisms affected by collection of genotypes Agilent Literature Search PinnacleZ, UCSD Mondrian, MSKCC humangenetics-amc.nl June 2009 http://cytoscape.org Network visualization and analysis Pathway comparison Literature mining Gene Ontology analysis Active modules Complex detection Network motif search UCSD, ISB, Agilent, MSKCC, Pasteur, UCSF, Unilever, UToronto, U Texas Network Analysis using Cytoscape Find biological processes underlying a phenotype Databases Literature Network Analysis Network Information Expert knowledge Experimental Data Manipulate Networks Automatic Layout Filter/Query Interaction Database Search Active Community • Help http://www.cytoscape.org – 8 tutorials, >10 case studies – Mailing lists for discussion – Documentation, data sets Cline MS et al. Integration of biological networks and gene expression data using Cytoscape Nat Protoc. 2007;2(10):2366-82 • Annual Conferences • 10,000s users, 2500 downloads/month • >80 Plugins Extend Functionality – Build your own, requires programming Where to start? Cytoscape tutorials http://opentutorials.cgl.ucsf.edu/index.php/Portal:Cytoscape Pathway enrichment analysis Enrichment Analysis Intro Outline • What is Gene Set Enrichment Analysis? • Theory: Fisher’s exact test and multiple test correction • DAVID enrichment analysis tool What is Gene Set Enrichment Analysis? • Break down cellular function into gene sets - Every set of genes is associated to a specific cellular function, process, component or pathway Nuclear Pore Gene.AAA Gene.ABA Gene.ABC Cell Cycle Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 P53 signaling Gene.CC1 Gene.CK1 Gene.PPP Daniele Merico What is Gene Set Enrichment Analysis? • Find known gene sets (e.g. pathways) enriched in a gene list (e.g. from gene expression) Nuclear Pore Gene.AAA Gene.ABA Gene.ABC Cell Cycle Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 P53 signaling Gene.CC1 Gene.CK1 Gene.PPP What is Gene Set Enrichment Analysis? • Find known gene sets (e.g. pathways) enriched in a gene list (e.g. from gene expression) – Look for significant enrichment (more on how this Nuclear Pore Ribosome works later) Gene.AAA Gene.ABA Gene.ABC Cell Cycle Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 P53 signaling Gene.CC1 Gene.CK1 Gene.PPP Microarray Experiment (gene expression table) Enrichment Test Enrichment Table Spindle Apoptosis ENRICHMENT TEST Gene-set Databases 0.00001 0.00025 Microarray Experiment (gene expression table) Enrichment Test Enrichment Table Spindle Apoptosis ENRICHMENT TEST Experimental Data Gene-set Databases A priori knowledge + existing experimental data 0.00001 0.00025 Microarray Experiment (gene expression table) Enrichment Test Enrichment Table Spindle Apoptosis 0.00001 0.00025 ENRICHMENT TEST Interpretation & Hypotheses Gene-set Databases http://david.abcc.ncifcrf.gov/ DAVID demo http://david.abcc.ncifcrf.gov/tools.jsp Step 1: Define your gene list • Either – (a) Copy and paste your list in – (b) Upload a gene list file – (c) Choose an example gene list (so, click “demolist1” on next slide) Step 1: Define your gene list If you choose the wrong identifier, you may need to use the conversion tool Click here to define type of list (now we are doing a gene list, next slide we will define the background) Step 2: Choose background You can either upload a background list (in the upload tab, see previous step) or choose one of the background sets (shown here) Example list is from the Human U95A array, select it here. Step 3: Check list & background Step 4: Run Enrichment Analysis Click here Step 5: Select categories of gene lists to Click here totest expand Red indicates default selections, click check marks if you want to change Click here once you’ve selected your sets Step 6: View enrichment Gene set name # of genes in set with annotation EASE P-value 4.0E-4 means: 4.0 x 10-4 Step 7: Change parameters if desired Set “count” to 1 and “EASE” to 1 if you want maximum # of categories. Beware, only corrects within category. Step 8: Download results Download spreadsheet (in tabdelimited text format). Beware: if you click you may get text in a browser window, if this happens, “right-click” to save as a file. Step 8a: What can happen if no right-click Step 8b: How it looks in a spreadsheet Outline of theory component • Fisher’s Exact Test for calculating enrichment Pvalues (also used for calculating EASE score) • Multiple test corrections: – Bonferroni – Benjamini-Hochberg FDR • Other enrichment tests widely used but not covered here: – GSEA for ranked lists – See: http://www.broadinstitute.org/gsea/index.jsp Fisher’s exact test a.k.a., the hypergeometric test Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Null hypothesis: List is a random sample from population Alternative hypothesis: More black genes than expected Background population: 500 black genes, 4500 red genes 70 Fisher’s exact test a.k.a., the hypergeometric test Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Null distribution P-value Answer = 4.6 x 10-4 Background population: 500 black genes, 4500 red genes 71 Newly added slide, not in your binder 2x2 contingency table for Fisher’s Exact Test Gene list RRP6 MRD1 RRP7 RRP43 RRP42 In gene list Not in gene list In gene set 4 496 Not in gene set 1 4499 e.g.: http://www.graphpad.com/quickcalcs/contingency1.cfm Background population: 500 black genes, 4500 red genes 72 Important details • To test for under-enrichment of “black”, test for overenrichment of “red”. • The EASE score used by DAVID subtracts one from the observed overlap between gene list and gene set to ensure >1 from the list is in the gene set. • Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. • To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later*** 73 Multiple test corrections How to win the P-value lottery, part 1 Random draws … 7,834 draws later … Expect a random draw with observed enrichment once every 1 / P-value draws Background population: 500 black genes, 4500 red genes How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw RRP6 MRD1 RRP7 RRP43 RRP42 Different annotations RRP6 MRD1 RRP7 RRP43 RRP42 Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that one or more of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)” Bonferroni correction caveats • Bonferroni correction is very stringent and can “wash away” real enrichments. • Often one is willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments. False discovery rate (FDR) • FDR is the expected proportion of the observed enrichments due to random chance. • Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. • Typically FDR corrections are calculated using the Benjamini-Hochberg procedure. • FDR threshold is often called the “q-value” For your information Benjamini-Hochberg example Rank Category P-value 1 2 3 Transcriptional regulation Transcription factor Initiation of transcription 0.001 0.01 0.02 … … … 50 51 52 53 Nuclear localization RNAi activity Cytoplasmic localization Translation 0.04 0.05 0.06 0.07 Adjusted P-value Sort P-values of all tests in decreasing order FDR / Q-value For your information Benjamini-Hochberg example Rank Category P-value Adjusted P-value 1 2 3 Transcriptional regulation Transcription factor Initiation of transcription 0.001 0.01 0.02 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … … … … 50 51 52 53 Nuclear localization RNAi activity Cytoplasmic localization Translation 0.04 0.05 0.06 0.07 0.04 x 53/50 0.05 x 53/51 0.06 x 53/52 0.07 x 53/53 FDR / Q-value = 0.042 = 0.052 = 0.061 = 0.07 Adjusted P-value = P-value X [# of tests] / Rank For your information Benjamini-Hochberg example Rank Category P-value Adjusted P-value FDR / Q-value 1 2 3 Transcriptional regulation Transcription factor Initiation of transcription 0.001 0.01 0.02 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 0.042 0.042 0.042 … … … … … 50 51 52 53 Nuclear localization RNAi activity Cytoplasmic localization Translation 0.04 0.05 0.06 0.07 0.04 x 53/50 0.05 x 53/51 0.06 x 53/52 0.07 x 53/53 = 0.042 = 0.052 = 0.061 = 0.07 0.042 0.052 0.061 0.07 Q-value = minimum adjusted P-value at given rank or below For your information Benjamini-Hochberg example Rank Category 1 2 3 Transcriptional regulation Transcription factor Initiation of transcription … … 50 51 52 53 Nuclear localization RNAi activity Cytoplasmic localization Translation P-value 0.001 0.01 0.02 P-value threshold for FDR < 0.05 … 0.04 0.05 0.06 0.07 Adjusted P-value FDR / Q-value FDR < 0.05? 0.001 x 53/1 = 0.053 0.042 0.01 x 53/2 = 0.27 0.042 0.02 x 53/3 = 0.35 0.042 … 0.04 x 53/50 0.05 x 53/51 0.06 x 53/52 0.07 x 53/53 = 0.042 = 0.052 = 0.061 = 0.07 Y Y Y … … 0.042 0.052 0.061 0.07 Y N N N P-value threshold is highest ranking P-value for which corresponding Q-value is below desired significance threshold Reducing multiple test correction stringency • The correction to the P-value threshold depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be • Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or select only larger GO categories. Multiple test correction in DAVID • In DAVID, the “Benjamini-Hochberg” column corresponds to the false discovery rate as it is typically defined. It is unclear what the FDR means. • DAVID does multiple test correction separately within each category of gene sets, so adding more categories does not change the FDRs or P-values. Be careful how you report these numbers. Summary • Enrichment analysis: – Statistical tests • Gene list: Fisher’s Exact Test • Gene rankings: GSEA, also see Wilcoxon ranksum, Mann-Whitney U-test, Kolmogorov-Smirnov test – Multiple test correction • Bonferroni: stringent, controls probability of at least one false positive* • FDR: more forgiving, controls expected proportion of false positives* -- typically uses Benjamini-Hochberg * Type 1 error, aka probability that observed enrichment if no association Gene function prediction with GeneMANIA Outline • Concepts in gene function prediction: – Guilt-by-association – Gene recommender systems • GeneMANIA demo • Gene function prediction use cases • Scoring interactions by guilt-by-association Using genome-wide data in the lab Pathway-based networks Protein-protein interaction data Genetic interaction data ?!? Microarray expression data Genomics revolution, the bad news Genomics datasets are: • • • • • noisy, redundant, incomplete, mysterious, massive Google can’t do biology Google can’t do biology Guilt-by-association principle Microarray expression data Conditions Co-expression network Cell cycle CDC3 CLB4 CDC16 Genes UNK1 RPT1 RPN3 RPT6 Eisen et al (PNAS 1998) UNK2 Protein degradation Fraser AG, Marcotte EM - A probabilistic view of gene function - Nat Genet. 2004 Jun;36(6):559-64 GeneMANIA Demo Main site (stable but still fun): http://www.genemania.org Beta site (new and edgy but possibly unreliable): http://beta.genemania.org Two types of functional prediction • “Give me more genes like these”, – e.g. find more genes in the Wnt signaling pathway, find more kinases, find more members of a protein complex • “What does my gene do?” – Goal: determine a gene’s function based on who it interacts with: “guilt-by-association”. “Give me more genes like these” Input Network and profile data Output from GeneMANIA Query list CDC48 CPR3 MCA1 TDH2 Gene recommender system e.g., GeneMANIA, STRING http://www.string-db.org, bioPIXIE http://pixie.princeton.edu/pixie/ “What does my gene do?” Input Network and profile data Output Query list CDC48 Gene recommender system then enrichment analysis e.g., GeneMANIA, bioPIXIE Composite functional interaction/linkage/association networks Pathway-based networks Protein-protein interaction data Genetic interaction data Microarray expression data Composite functional association network Pre-computed functional interaction networks Cell cycle CDC23 CDC27 Pre-combine networks e.g. by simple addition or Naïve Bayes APC11 + UNK1 RAD54 XRS2 + Genetic Co-complexed Tong et al. 2001 Jeong et al 2002 DNA repair MRE11 UNK2 Co-expression Pavlidis et al, 2002, Marcotte et al, 1999 bioPIXIE Composite networks: One size doesn’t fit all • Gene function could be a/the: – Biological process, – Biochemical/molecular function, – Subcellular/Cellular localization, – Regulatory targets, – Temporal expression pattern, – Phenotypic effect of deletion. Some networks may be better for some types of gene function than others Query-specific composite networks weights w1 x Cell cycle CDC23 w3 x CDC27 APC11 UNK1 RAD54 w2 x + + Genetic Co-complexed Tong et al. 2001 Jeong et al 2002 XRS2 DNA repair MRE11 UNK2 = Co-expression Pavlidis et al, 2002, Lanckriet et al, 2004 Mostafavi et al, 2008 Two rules for network weighting Relevance The network should be relevant to predicting the function of interest • Test: Are the genes in the query list more often connected to one another than to other genes? Redundancy The network should not be redundant with other datasets – particularly a problem for co-expression • Test: Do the two networks share many interactions • Caveat: Shared interactions also provide more confidence that the interaction is real. Scoring nodes by guilt-by-association Query list: “positive examples” MCA1 CDC48 CPR3 TDH2 Scoring nodes by guilt-by-association Query list: “positive examples” MCA1 Score CDC48 high CPR3 TDH2 low Direct neighborhood CDC48 MCA1 CPR3 TDH2 Two main algorithms Label propagation CDC48 MCA1 CPR3 TDH2 Node scoring algorithm details • Direct neighbour node score depends on: – Strength of links to positive examples – # of positive neighbors • Label propagation node score depends on: – Strength of links and # of positive direct neighbors – # of shared neighbors with positive examples – “modular structure” of network Label propagation example Before After Three parts of GeneMANIA: • A large, automatically updated collection of interactions networks. • A query algorithm to find genes and networks that are functionally associated to your query gene list. • An interactive, client-side network browser with extensive link-outs GeneMANIA data sources Network types Legend * minor curation ** major curation Co-expression* -Gene ID mappings from Ensembl and Ensembl Plant Co-localization** Pathways -Network/gene descriptors from Entrez-Gene and Pubmed Physical interactions Genetic interactions* Shared domains Predicted interactions** - Gene annotations from Gene Ontology, GOA, and model org. databases Other MGI, Chemogenomics Gene identifiers • All unique identifiers within the selected organism: e.g. – – – – – Entrez-Gene ID Gene symbol Ensembl ID Uniprot (primary) also, some synonyms & organism-specific names • We use Ensembl database for gene mappings (but we mirror it once / 3 months, so sometimes we are out of date) Cytoscape plugin http://www.genemania.org/plugin/ + QueryRunner http://cytoscapeweb.cytoscape.org/ Other resources for GeneMANIA info • “About” page on GeneMANIA interface – http://www.genemania.org/pages/about.jsf • OpenHelix tutorial (not available everywhere) – http://www.openhelix.com/ – http://www.openhelix.com//cgi/tutorialInfo.cgi?id=11 3 • Our papers: – – – – GeneMANIA website: Warde-Farley et al, NAR 2010 GeneMANIA algorithm: Mostafavi et al, GB 2008 Cytoscape plugin: Montojo et al, Bioinfo 2010 Cytoscape Web: Lopes et al, Bioinfo 2010 Principal Investigators Quaid Morris Gary Bader Outreach David Warde-Farley Students Who? Sylva Donaldson Sara Mostafavi Ovi Comes Christian Lopes Max Franz Harold Rodriguez Developers Khalid Zuberi Jason Montojo Farzana Kazi