Computer Aided Indexing Overview of the key features of software candidates Mary Huxlin Australia Human vs.
Download ReportTranscript Computer Aided Indexing Overview of the key features of software candidates Mary Huxlin Australia Human vs.
Computer Aided Indexing Overview of the key features of software candidates Mary Huxlin Australia Human vs. Computer-Aided Indexing Human indexing logical controllable resource intensive inconsistent - subject knowledge, human error not scalable to the same degree as automated Computer-Aided Indexing accurate, limited accuracy difficult to train lack control consistent speeding up the process; greater throughput Content management technologies Use various algorithms for statistical analysis, semantic processing, NLP and neural networks (AI) - most often combining them. Offer automatic or semi-automatic (hybrid) solutions for: extracting relevant content from a document or web page mapping documents into one or multiple pre-defined or customer-specific hierarchies of categories (taxonomies) machine-aided indexing thesaurus development Definitions Thesaurus - one type of controlled-vocabulary - a collection of terms referring to specific concepts, together with variants and conceptual relationships Taxonomy (from Greek taxis meaning arrangement or division and nomos meaning law) is the science of classification according to a pre-determined system, that divides a subject area hierarchically into progressively smaller subdivisions. The integration of classification and thesauri in an automated environment result in the construction of TAXONOMY Taxonomy process Reproduced from Delphi Group White Paper Taxonomy for the Semantic Web defines topics and their relationships - improves user and technology efficiency Browse alphabetically: ABCDEFGHIJKLMNOPQRSTUVWXYZ# Browse categorically: WhatIs:All Categories:Computing Fundamentals:General Computing Terms Word list for General Computing Terms >tag (searchWebServices) >Tag Image File Format (whatis) >task (whatis) >taxonomy (whatis) >TCB (searchSystemsManagement) >TCP (searchNetworking) Browse alphabetically: ABCDEFGHIJKLMNOPQRSTUVWXYZ# All Categories- Computing Fundamentals- General Computing -Terms taxonomy Taxonomy (from Greek taxis meaning arrangement or division and nomos meaning law) is the science of classification according to a pre-determined system, with the resulting catalog used to provide a conceptual framework for discussion, analysis, or information retrieval. In theory, the development of a good taxonomy takes into account the importance of separating elements of a group (taxon) into subgroups (taxa) that are mutually exclusive, unambiguous, and taken together, include all possibilities. In practice, a good taxonomy should be simple, easy to remember, and easy to use. One of the best known taxonomies is the one devised by the Swedish scientist, Carl Linnaeus, whose classification for biology is still widely used (with modifications). In Web portal design, taxonomies are often created to describe categories and subcategories of topics found on the Web site. The categorization of words on whatis.com is similar to any Web portal taxonomy. Statistical Text Analysis and Clustering measures: Co-occurrences of words. – “Java ” used in connection with “Starbucks” it is more likely to relates to a document about coffee instead of a programming language. Relative placement of words – Words in the first lines of a document or in the title are likely more important than information contained in the copyright section. Word frequency, placement and grouping, as well as the distance between words in a document. Advanced statistical text analysis and clustering Bayesian probability uses statistical models from words in training sets and pattern analysis to assign the probability of correlation. – If a given document contains the words “cerium”and “gadolinium” it is more than likely this document is about Rare Earths, which leads to the assumption that other metals nouns such as “neodymium” or “ytterbium” will occur. Pattern analysis improves precision of statistical analysis and clustering by resolving ambiguous or multiple meanings of words learn though an iterative process. – “SHRIMP” in a document may refers to a method of analysis (sensitive high-resolution ion microprobe) or a crustaceans Semantic and Linguistic Clustering Linguistic (natural language processing) software – analyses the structure of the sentences identifying the subject, verbs and objects – apply sentence structure analysis to extract the meaning. – use stemming or reducing a word to its root (prone to overkill!) Documents are clustered or grouped depending on meaning of words – using thesaurus / knowledge base, probabilistic grammar, recognition of idioms, verb chain recognition, and noun phrase identifiers Only slightly improvement over statistically generated phrases Rule-based classification/indexing Rules enable the system to think like humans Identity rules to identify concepts that match or are equivalent in meaning (forbidden terms) with Thesaurus terms Context rules (proximity, case of letters, location in the document) “If-Then” or “If-EndIf” when word meanings are ambiguous Rules could be a powerful and flexible means for automatically classifying content based on not just content itself but the metadata that describes the content (e.g. subject categories, journal title). The down side of rule-based system – human experts (=expensive) have to write and maintain the rules. – rules could be complex and thus prone to failure Rule-based indexing “If rules” Text to match: SAFETY IF (NEAR “reactor”) WITHIN 3 WORDS USE Reactor safety ENDIF IF (WITH “standard”) WITHIN SENTENCE USE Safety standards Machine learning An iterative process Identify patterns in manually indexed sample texts (training set) and make predictions about unseen text - also called computational linguistics. Improves its performance based on experience Require a large number of documents Linguistic DNA Statistic & Linguistic processing (extraction) Key concepts extracted have full semantic meaning on their own Enhanced with Logico-Deductive Reasoning and Fuzzy Logic techniques (manipulation) Fuzzy concepts are highly context-dependent Application Programming Interfaces Software applications such as portals, content management systems, knowledge management systems, search and retrieval software, data extraction, and data mining can all benefit from automatic or semi-automatic generated taxonomies. Practically all the software under consideration are sold with Application Programming Interfaces (API) to integrate into local existing applications. Assessment Most of the companies profiled are very technology-centric and spend much of their marketing effort trying to convince us of the advantages of their approach or methodology. None of the products work “out of the box” and require a closer relationship between the user and the supplier of the technology The bottom line is to understand how these differences affect system performance in the only environment that matters —our unique data environment. Assessment (cont.) The main business of the reviewed software candidates is to automatically categorise information, identifying where documents belong within a taxonomy - a solution to “infoglut” problem They are market driven; have or could develop a CAI capability - if there is a sizeable market for it. Some of the products analysed have an advanced CAI component, warranting further investigations: How can we measure CAI software performance? Recall = proportion of the correct indexes generated Precision = proportion of the generated indexes that are correct Overindexing = proportion of incorrect indexes generated Cost benefit (ROI) ISSUES Training Integration with WinFIBRE (or similar products) Tools for authoring and maintenance Could we maintain the quality of the database? Ensure the user can navigate from need to “...Control exercised by machines, far from enslaving human beings, will liberate them for tasks only them can perform…” Wellisch, H.H. (1998). Indexing after the millennium 3: the indexer as helmsman The Indexer, 21(2), 89