Transcript wyner.info
Natural Language Processing Techniques for Managing Legal Resources JURIX 2009 Tutorial Erasumus University, School of Law Rotterdam, The Netherlands December 16, 2009 Adam Wyner University College London [email protected] www.wyner.info/LanguageLogicLawSoftware Overview • • • • Preliminary comments. • From GATE to ontologies and logical representations. Natural Language Processing elements. GATE introduction. Samples of GATE applied to legal resources -legislation, case based reasoning, and gazettes. Main Point Legal text expressed in natural language can be automatically annotated with semantic mark ups using natural language processing systems such as the General Architecture for Text Engineering (GATE). Using the annotations, we can (in principle) extract the information from the texts then use it for answering queries or reasoning. Outcome from the Tutorial • Overview of key issues and objectives of NLP with respect to legal resources. • Idea of how one NLP tool (GATE) works and can be used. • • Idea of what some initial results might be. Sense of what can be done. Audience • Law school students, legal professionals, public administrators. Get things done that are relevant to them. • AI and Law researchers. A collaborative framework for research and development. What the Tutorial is.... • A report of learning and working with this material. Faster way in. • An invitation to collaborate as a research and development community using a common framework. • A presentation of initial elements of a work in progress, not a through and through prototype or full fledged system. Open Data Lists, Rules, and Development Environment • Contribute to the research community and build on past developments. • • • Teaching and learning. • On balance, academic research ought to contribute to the common good rather than be proprietary. If you need to own it, work at a company. • Distributive research, stream results. Interchange. Semantic web chokes on different formats. No publication without replication. Text analytics has an experimental aspect. Sample Texts • Legislation (EU and UK) for structure and rules. • Case decisions (US on intellectual property and crime) for details and CBR factors. • Gazettes (UK). • On paper: What information do you want to identify? How can you identify it (e.g. how do you know what you know)? What do you want to do with it? Semantic Web • Want to be able to do web based information extraction and reasoning with legal documents, e.g. find the attorneys who get decisions for plaintiffs in a corpus of case law. • Machine only “sees” strings of characters, while we “see” and use meaning. John Smith, for plaintiff..... The decision favours plaintiff. • How can we do this? Semantic annotation of documents, then extraction of those annotations in meaningful relations. • “Self-describing” documents What is the Problem? Natural language supports implicit information, multiple forms with the same meaning, the same form with multiple meanings (context), and dispersed meanings: • Entity ID: Jane Smith, for plaintiff. Relation ID: Edgar Wilson disclosed the formula to Mary Hays. Jane Smith, Jane R. Smith, Smith, Attorney Smith.... Jane Smith in one case decision need not be the same Jane Smith in another case decision. Jane Smith represented Jones Inc. She works for Dewey, Chetum, and Howe. To contact her, write to [email protected] As for names, so to with sentences. Knowledge Light v. Heavy Approaches • Statistical approaches - compare and contrast large bodies of textual data, identifying regularities and similarities. Sparse data problem. No rules extracted. Useful for ranking documents for relevance. • Machine learning - apply learning algorithms to known material to extend results to unknown material. Needs known, annotated material. Useful for text classification. Black box cannot really know the rules that are learned and use them further. • Lists, rules, and processes - know what we are looking for. Know the rules and can further use and develop them. Labour and knowledge intensive. Knowledge Light v. Heavy Approaches • Some mix of the approaches. • The importance of humanly accessible explanation and justification in some domains of the law warrants a knowledge heavy approach. Overview • • • • Motivations and objectives of NLP in this context. General Architecture for Text Engineering (GATE). Processing and marking up text. Other technologies for parsing and semantic interpretation (C&C/Boxer). Motivation • Annotate large legacy corpora. • Address growth of corpora. • Reduce number of human annotators and tedious work. • Make annotation systematic, automatic, and consistent. • Annotate fine-grained information: • Names, locations, addresses, web links, organisations, actions, argument structures, relations between entities. Map from well-drafted documents in NL to RDF/OWL. Approaches • Top-down vs. Bottom-up approaches: • Both do initial (and iterative) analysis of the texts in the target corpora. • Top-down defines the annotation system, which is applied manually to texts. Knowledge intensive in development and application. • Annotation system is ‘defined’ in terms of parsing, lists of basic components, ontologies, and rules to construct complex mark ups from simpler one. Apply the annotation system to text, which outputs annotated text. Knowledge intensive in development. • Convergent/complementary/integrated approaches. • Bottom-up reconstructs and implements linguistic knowledge. However, there are limits.... Objectives of NLP • • • Generation – convert information in a database into natural language. Understanding – convert natural language into a machine readable form. Support inference? Information Retrieval – gather documents which contain key words or phrases. Preindex (list of what documents a word appears in) the corpus to speed retrieval (e.g. Google). Rank documents in terms of “relevance”. Documents must be read to identify information. Objectives of NLP • • • • Text Summarization – summarize (in a paragraph) the main meaning of a text or corpus. Question Answering – queries made and answers given in natural language with respect to some corpus of texts. Information Extraction – identify and extract information from documents which is then reused or represented. The information should be meaningfully related. Information extraction can be used to improve information retrieval. Objectives of NLP • • • Automatic mark up to overcome bottleneck. • • • Develop ontologies. Semantic representation for modelling and inference. Semantic representation as a ‘interlanguage’ for translation. Provide gold-standard corpora. To understand and work with human language capabilities. Subtasks of NLP • Syntactic parsing into phrases/chunks (prepositional, nominal, verbal,....). • • • • • Identify semantic roles (agent, patient,....). Entity recognition (organisations, people, places,....). Resolve pronominal anaphora and co-reference. Address ambiguity. Focus on entity recognition (parsing happens, anaphora can be shown, others are working on semantic roles, etc). Computational Linguistic Cascade • Sentence segmentation – divide text into sentences. • Tokenisation - words identified by spaces between them. • Part of speech tagging – noun, verb, adjective.... Determined by lookup up and relationships among words. • Morphological analysis - singular/plural, tense, nominalisation, ... • Shallow syntactic parsing/chunking - noun phrase, verb phrase, subordinate clause, .... • Named entity recognition – the entities in the text. • Dependency analysis - subordinate clauses, pronominal anaphora,... • Each step guided by pattern matching and rule application. Development Cycle • Text -> Linguistic Analysis -> Knowledge Extraction • Cycle back to text and linguistic analysis to improve knowledge extraction. GATE • General Architecture for Text Engineering (GATE) open source framework which supports plug in NLP components to process a corpus of text. Is “open” open? • • • A GUI to work with the tools. A Java library to develop further applications. Where to get it? Lots of tutorial information. • http://gate.ac.uk/ • Components and sequences of processes, each process feeding the next in a “pipeline”. • Annotated text output. Loading and Running GATE with ANNIE • • Start GUI • • Adds Processing Resources and an Application. • When added, RC on the document (BNA sample) > New Corpus with this document. • • • RC on ANNIE under applications to see the pipeline. LC on File > Load ANNIE System > Choose with Defaults. RC on Language Resources > New > Select GATE document > Browse to document > OK. At Corpus, select the Corpus created. Run. GATE Example Inspecting the Result • RC on document (not Corpus) > Show, which shows the text. • • LC on Annotation Sets, LC on Annotations List. • Selecting an annotation highlights the relevant text in colour. In the List box below, we get detailed information about location, Id, and features On right see coloured list with check boxes for annotations; below see a box with headings. Inspecting the Result • For Location, we have “United Kingdom”, with locType = country, matching Ids, and the rules that have been applied. • Similarly for JobTitle, Lookup (from Gazettes), Sentence, SpaceToken, Split (for sentences), and Token (every “word”). • Note different information provided by Lookup and Token, which is useful for writing rules. • Will remark on Type, Start/End, Id, and features. GATE Example GATE Example GATE Example XML -- Inline XML is a flexible, extensible framework for mark up languages. The mark ups have beginnings/endings. Inline XML is strictly structured in a tree (note contains body, body could contain date, no overlap) and is “inline” with the text. Compare to standoff, which allows overlap and sets text off from annotations. Allows reprocessing since text is constant. XML -- Standoff GATE Output Inline In the GATE Document Editor, the Annotations can be deleted (RC > Delete). We have left just Location and JobTitle. To output text with annotations that are XML compatible, RC on the document in Language Resources, then Save preserving document format. Further processing can be done using XSLT. GATE Output Offset - part 1a In the GATE Document Editor, the Annotations can be deleted (RC > Delete). We have left just Location and JobTitle. To output text with annotations that are in XML, RC on the document in Language Resources, then Save as XML. This is the top part. The text is serialized, and annotations relate to positions in the text. GATE Output - part 1b GATE ANNIE Annotations GATE ANNIE Annotations Organisations and Quotations. Case references. GATE • Language Resources: corpora of documents. • Processing Resources: lexicons, ontologies, parsers, taggers. • Visual Resources: visualisation and editing. • The resources are plug ins, so can be added, removed, or modified. See this latter with ANNIC (Annotations in Context) and Onto Root Gazetteer (using ontologies as gazetteers). GATE A source document contains all its original mark up and format. • John Smith ran. A GATE document is: • Document = text + (annotations + features) <Person, gender = “male”>John Smith</Person> <Verb, tense = “past”>ran</Verb> Not really the way it appears in GATE, but the idea using XML. GATE Annotations • Have types (e.g. Token, Sentence, Person, or whatever is designed for the annotation). • Belong to annotation sets (see later). • Relate to start and end offset nodes (earlier). • Have features and values that store a range of information as in (not GATE, but XML-ish): <Person, gender = “male”>John Smith</Person> <Verb, tense = “past”>ran</Verb> GATE Construction: From smaller units, compose larger, derivative units. Gazetteers: Lists of words (or abbreviations) that fit an annotation: first names, street locations, organizations.... JAPE (Java Annotation Patterns Engine): Build other annotations out of previously given/defined annotations. Use this where the mark up is not given by a gazetteer. Rules have a syntax. GATE – A Linguistic Example • • • Lists: • List of Verb: like, run, jump, .... • List of Common Noun: dog, cat, hamburger, .... • List of Proper Name: Cyndi, Bill, Lisa, .... • List of Determiner: the, a, two, .... Rules: (Determiner + Common Noun) | Proper Name => Noun Phrase Verb + Noun Phrase => Verb Phrase Noun Phrase + Verb Phrase => Sentence Input: • • Cyndi likes the dog. Output: • [s [np Cyndi] [vp [v likes] [np [det the] [cn dog]]]]. Lists, Lists of Lists, Rules • Coalesce diverse yet related information in a list, e.g. organisation.lst. What is included here depends on.... What is Looked Up from the list is associated with the “master category”. • Make a master list of the lists in lists.def, which contains organisation.lst, date.lst, legalrole.lst..... • The master list indicates the majorType of things looked up in the list, e.g. organisation, and minorType, e.g. private, public (and potentially more features). Two lists may have the same majorType, but different minor types. Useful so rules can apply similarly or differently according to major or minor types. GATE organisation.lst GATE Gazetteer – a list of lists What Goes into a List? • A 'big' question. Depends on what one is looking for, how one wants to find it, and what rules one wants to apply. • Every difference in character is a difference in the string even if the 'meaning' is the same. B.C. b.c. May01,1950 bc b.c bC. May 01 1950 • More examples later. • By list or by rule.... 01 May 1950 Token, Lookup, Feature, Annotation • Token - a string of characters with a space. In The big brown dog chased the lazy cat there are seven tokens. Token information includes syntactic part of speech (noun, verb,....) and string details (orthography, kind, position, length,....). • Lookup - look up a string in a list and assign it major or minor types. The “bottom semantic” layer of the cascade. • Annotation - subsequent mark ups which depend on Token, Lookup, or prior annotations. • Feature - additional Token, Lookup, or Annotation information. Rolling your Own • Create lists and a gazetteer. • Add processing resources. • Add documents and make a corpora. • Construct the pipeline - an ordered sequence of processes. • Run the pipeline over the corpora. • Inspect the results. GATE JAPE JAPE rule idea (not the real thing). <FirstName>aaaa</FirstName><LastName>bbbb</LastName> => <WholeName><FirstName>aaaa</FirstName> <LastName>bbbb</LastName></WholeName> FirstName and LastName we get from the Gazetteer. WholeName we construct using the rule. For complex constructions, must have a range of alternative elements in the rule. GATE JAPE • Header - rule name, annotations to use, processing features, processing priority.... • Left hand side of rule (LHS) - refers to various mark ups that are found in the text, relies on order, uses expressions of combination or iteration, and identifies what portion is to be annotated as per the rule. • Right hand side of rule (RHS) - annotates as per the rule (plus some information) • Can have Java on RHS, but will not cover this. GATE JAPE ? means optional GATE JAPE GATE JAPE Other GATE Components • Plug in other parsers or work with other languages.... (no) • Machine learning component. (no) • Search for annotations in context to refine gazettees and JAPE rules (ANNIC). (yes) • Develop an ontology, import it into GATE, then mark up elements automatically or manually (Onto Root Gazetteer). (yes) GATE – Problems and Issues • Any difference in the characters of the basic text or in annotations is an absolute difference • theatre and theater are different strings for entities. Variants in Gazetteers. • Organisation and Organization are different annotations. • Output in XML is possible, but GATE mark up allows overlapping tags, which are barred in standard XML. Must rework GATE XML with XSLT to make it standard XML. • Work to get 100% for a variety of reasons (depends on consistency of input), but it can be 85-95%. GATE – Extraction • So far we have really only covered annotation. Where is the extraction bit? • Currently, GATE has no plug in to support extraction of information with respect to rich schema template, e.g. give cases, parties, attorneys, factors, and outcomes. • With further processing using tools outside GATE, this can be done: • XSLT, Java, .... Example....? Use ontologies (I think the direction to go...) Yet, can output as presented earlier. GATE on Legal Resources • Legislative structure for rule book (Structure identification). • Rule detection for inference (General, UK Nationality Act). • Elements of Cases (CATO intellectual property, California criminal cases). • Gazette/Notices information (TSO/OPSI). Legislative Structure • Legislative structure for rule book that is used for compliance. • Identify and annotate the structure of legislation. • Show what, then how. • Look for “posts” which can help one identify “content”. • RuleBookTest.xgapp Insurance and Reinsurance (Solvency II) Desired Output • Reference Code: Article 1 • Title: Subject Matter • Level: 1.0 • Description: This Directive lays down rules concerning the following: • Level: 1.1 • Description: the taking-up and pursuit, within the Community, of the selfemployed activities of direct insurance and reinsurance; • Level: 1.2 • Description: the supervision in the case of insurance and reinsurance groups; • Level: 1.3 • Description: the reorganisation and winding-up of direct insurance undertakings. GATE Annotation Comments • The article is not a logical statement, but identifies the matters which the directive is concerned with. • Each statement of the article may be understood as a conjunct: the rules concern a, b, and c. However, we have not (yet) represented this. • The JAPE rules work for this example, but need to be further refined to work with the whole legislation. • Break down the text into useful segments that can support identification. Lists • roman_numerals_i-xx.lst: It has majorType = roman_numeral. A list of roman numbers from i to xx. • rulebooksectionlabel.lst: It has majorType = rulebooksection. A list of section headings such as: Subject matter, Scope, Statutory systems, Exclusion from scope due to size, Operations, Assistance, Mutual undertakings, Institutions, Operations and activities. JAPE Rules • ArticleSection.jape: What is annotated with Article (from the lookup) and a number is annotated ArticleFlag. • ListFlagLevel1.jape: The string number followed by a period of closed parenthesis is annotated ListFlagLevel1. • ListFlagLevel1sub.jape: A number followed by a letter followed by a period is annotated ListFlagLevel1sub. • ListFlagLevel2.jape: A string of lower case letters followed by a closed parenthesis is annotated ListFlagLevel2. • ListFlagLevel3.jape: A roman number from a lookup list followed by a closed parenthesis is annotated ListFlagLevel3. JAPE Rules • RuleBookSectionLabel.jape: Looks up section labels from a list and annotates them SectionType. For example, Subject matter, Scope, and Statutory systems. • ListStatement01.jape: A string which occurs between SectionType annotation and a colon is annotated ListStateTop. • ListStatement02.jape: A string which occurs between a ListFlagLevel1 and a semicolon is annotated SubListStatementPrefinal. • ListStatement03.jape: A string which occurs between a ListFlagLevel1 and a period is annotated SubListStatementFinal. JAPE Rules JAPE Rules Note the use of “or”. JAPE Rules JAPE Rules Repeat getting tokens so long as not punctuation. + is one or more tokens. Negation. Rule Detection • Rule detection with a general example and specific (UK Nationality Act). • Sentence classification. • From extraction almost to executable logical form (Haley's manual translation and proprietary logical form). • Conditional.xgapp Rule Detection Rule Detection Problems with: list recognition “(x)”, use of “; “use of “--”, and use of “or”. Lists and Rules • No particular lists. Used the list detection from the previous exercise (so particular to that context). • AntecedentInd01: annotates the token “if” or “If” in the text as a conditional flag. • AntecedentInd02: annotates a sentence as a conditional if it includes the conditional flag. Another (better?) way: use “if” to identify antencedents and consequents; sentence is conditional if it has one or more antecedent sentences and one consequence sentence. Lists and Rules • ConditionalParts 01 and 05: annotates a sentence portion as an antecedent between a conditional flag and some punctuation • ConditionalParts 02, 03, and 04: annotates a sentence portion as a consequent where: it appears between a sentence and some conditional flag, after “then” and a period, before a conditional flag and a list indicator (e.g. colon). Rule Detection Rule Detection Rule Detection * is zero or more tokens, but should be +. Rule Detection Case Factors and Elements • Factors in CATO. Relate to ANNIC. • Case parts in California criminal cases. Relate to Onto Root Gazetteer. Case Based Reasoning • The CATO case base is a set of cases concerning intellectual property. • Given a current undecided case, compare and contrast the “factors” of the current case against factors of decided cases in the case base. Decide the current case following decisions of decided cases. • If a current case has exactly the same factors as a decided case, the decision in the current case is decided as it was in the decided case. • A complex counter-balancing of various factors (and their circumstances and weightings...) Case Based Reasoning The Factors are abstracts of fact patterns that favour one side or the other. Suppose you have a product and a secret way of making it. • Secrets disclosed to outsiders: You announce the method, thereby divulging the secret. If you try to sue someone who uses the method, you are likely not to win. Security measures: You have lots of security methods to protect your method and never publicly divulge it. If you try to sue someone who uses the method, you are likely to win. Case Based Reasoning • Task is simply to find linguistic indicators for the factors in case texts. • This is currently done “manually”. • We do this “roughly”, then look at ANNIC, which is a tool we can use to look at the matter more closely. • CATOCaseFactors.xgapp Case Based Reasoning From Aleven 1997 Case Based Reasoning - Creating a Concept • We are looking for a “concept”, which is an abstraction over particular forms. • Looked up “disclose” and “secret” in WordNet and made two lists for each “concept”: disclosure.lst: announce, betray, break, bring out, communicate, confide, disclose, discover, divulge, expose, give away, impart, inform, leak, let on, let out, make known, pass on, reveal, tell secret.lst: confidential, confidentiality, hidden, private, secrecy, secret The majorType of the disclose list is “disclose” and that of the secret list is “secret”. Case Based Reasoning - Rules Similarly for Secret Case Based Reasoning - Examples Want to refine these results by looking at context ANNIC - Annotations in Context • A plug in tool which helps in searching for annotations, visualising them, and inspecting features. Useful for JAPE rule development. • How to plug in, load, and run. • CATOCaseFactors.xgapp ANNIC - Instantiating an SSD • RC on Datastores > Create datastore > Lucene Based Searchable DataStore • At the input window, provide the following parameters: DataStore URL: Select an empty folder where the data store is created. Index Location: Select an empty folder. This is where the index will be created. Annotation Sets: Provide the annotation sets that you wish to include or exclude from being indexed. Make this list empty (CHECK). Base-Token Type: The tokens which your documents must have to get indexed. ANNIC - Instantiating an SSD At the input window, provide the following parameters: • Index Unit Type: The unit of Index (e.g. Sentence). We use the Sentence unit. Features: Specify the annotation types and features that should be included or excluded from being indexed (e.g. exclude SpaceToken, Split, or Person.matches). Click OK. Creates a new empty searchable SSD will be created. ANNIC - Instantiating an SSD Create an empty corpus and save it to the SSD. • Populate the corpus with some documents. Each document in the corpus is automatically indexed and saved to the data store. • Load some processing resources and then a pipeline. Run the pipeline over the corpus. • Once the pipeline has finished (and there are no errors), save the corpus in the SSD by RC on the corpus, then “Save to its datastore”. • Double click on the SSD file under Datastorees. Click on the “Lucene DataStore Searcher” tab to activate the search GUI. • Now you can specify a search query of your annotated documents in the SSD. ANNIC - The GUI • Top - area to write a query, select corpus, annotation set, number of results, and size of context. • Middle - visualisation of annotations and values given the search query. • Bottom - a list of the matches to the query across the corpus, giving the left and right contexts relative to the search results. • Annotations can be added (green) or removed (red). ANNIC - The GUI ANNIC - Queries (a subset of JAPE) • String • {AnnotationType} • {AnnotationType == String} • {AnnotationType.feature == feature value} • {AnnotationType1, AnnotationType2.feature == featureValue} • {AnnotationType1.feature == featureValue, AnnotationType2.feature == featureValue} • Trandes - returns all occurrences of the string where it appears in the corpus. ANNIC - Queries (a subset of JAPE) • {Person} -- returns annotations of type Person. • {Token.string == "Microsoft"} - returns all occurrences of ``Microsoft''. • {Person}({Token})*2{Organization} - returns Person followed by zero or up to two tokens followed by Organization. • {Token.orth=="upperInitial", Organization} - returns Token with feature orth with value set to "upperInitial" and which is also annotated as Organization • {Token.string=="Trandes"}({Token})*10{Secret} - returns strings ``Trandes'' followed by zero to ten tokens followed by Secret. • {Token.string =="not"}({Token})*4{Secret} ANNIC - Example ANNIC - Example Case Details We would like to annotate a range of case details such as: • Case citation Names of parties Roles of parties Sort of court Names of judges Names of attorneys Final decision.... • Look at some of this and relate to ontologies. • DSACaseInfo.xgapp California Criminal Cases California Criminal Cases California Criminal Cases Onto Root Gazetteer • A plug in tool which uses an ontology as a gazetteer. The ontology can be created and modified in GATE. Can add individuals. Some steps in automating ontology creation and population. • Using the ontology, one can query, draw inferences, and write rules with another tool (Protege). • How to plug in, load, and run. • An example -- CBR-OWL.xgapp • Check ontology and add individuals. Onto Root Gazetteer • Links text to an ontology by creating Lookup annotations which come from the ontology • Richly structured. • Relates textual and ontological information by adding instances. • One richer annotations that can be used for further processes. Onto Root Gazetteer • Add Onto Root Gazetteer plug in. • Add the Ontology Tools. • Create (or load) an ontology with OWLIM. This is the ontology that is the language resource that is then used by Onto Root Gazetteer. Suppose this ontology is called myOntology. • OWLIM can only use OWL-Lite ontologies. • Create processing resources with default parameters: Document Reset PR RegEx Sentence Splitter ANNIE English Tokeniser ANNIE POS Tagger GATE Morphological Analyser Onto Root Gazetteer Create an Onto Root Gazetteer PR and initialise as: • Ontology: select previously created myOntology Tokeniser: select previously created Tokeniser POSTagger: select previously created POS Tagger Morpher: select previously created Morpher. Create a Flexible Gazetteer PR. elect previously created OntoRootGazetteer for gazetteerInst. For inputFeatureNames, click on the button on the right and when prompted with a window, add ‘Token.root’ in the provided text box, then click Add button. Click OK, give name to the new PR (optional) and then click OK. Onto Root Gazetteer • Create an application, right click on Application, New –> Pipeline (or Corpus Pipeline). • Add the following PRs to the application in this order: Document Reset PR RegEx Sentence Splitter ANNIE English Tokeniser ANNIE POS Tagger GATE Morphological Analyser Flexible Gazetteer • Run the application over the selected corpus. • Inspect the results. Look at the Annotation Set with Lookup and also the Annotation List to see how the annotations appear. NAY..... Is not working this way. Onto Root Gazetteer • Editing the ontology (using the tools in GATE to add classes, subclasses, etc). • Annotating the texts manually with respect to the ontology (highlighting a string and hovering bring out a menu). • Adding instances to the ontology (have a choice to add instances). • The ontology can then be exported into an ontology editor (Protege) and used for reasoning. Not shown. Onto Root Gazetteer Onto Root Gazetteer Content in Gazette Notices Not glamorous, but useful. www.london-gazette.co.uk search insolvency Content in Notices C&C/Boxer – Motivations and Objectives • Fine-grained syntactic parsing – can identify not only parts of speech, but grammatical roles (subject, object) and phrases (e.g. verb plus direct object is verb phrase). • Contributes to NL to RDF/OWL translation – individual entities, data and object properties? • Input to semantic interpretation in FOL – test for consistency, support inference, allow rule extraction. C&C/Boxer • C & C is a combinatorial categorial grammar. • Boxer provides a semantic interpretation, given the parse. The semantic interpretation is a form of first order logic – discourse representation theory. • Needs some manipulation. Parser outputs the ‘best’ parse, but that might not be what one wants; the semantic representation might need to be selected. C&C/Boxer • Try it out at: • http://svn.ask.it.usyd.edu.au/trac/candc • Various representations – C&C, Graphic, XML Parse, Prolog. • Not perfect (or even clear), but a step in the right direction and something concrete to build on. C&C/Boxer - Parse C&C/Boxer - DRT Vx [ man’(x) -> happy’(x)] Dynamic, so assignment function can grow with context. C&C/Boxer - Prolog A woman who is born in the United Kingdom after commencement of the act is happy. A woman who is born in the United Kingdom after commencement of the act is a British citizen if her mother is a British citizen when she is born. Other Topics • • Controlled Languages • An expressive subset of grammatical constructions and lexicon. • Guided in put so only well-formed, unambiguous expressions. • Translation to FOL. Machine Learning • Annotating a set of documents to make a ‘gold standard’. • Train the system on the gold standard and unannotated documents. • Test accuracy and adjust. • No information on how the algorithm works. Evaluation • Have given an evaluation sheet at the start. Would be helpful to get feedback with comments, questions, suggestions, ideas.... Conclusions • Different approaches to mark up. • Burdens of initial analysis, coding, and labour. • Top-down is far ahead of bottom-up, but this is a matter of focus of research effort. • Converging, complementary, integrated approaches. • Potential to enrich annotations further for finergrained information. References Manu Konchady (2008) “Building Search Applications: Lucene, LingPipe, and Gate”. Graham Wilcock (2009) “Introduction to Linguistic Annotation and Analytics Technologies”. Bransford-Koons, “Dynamic semantic annotation of California case law”, MSc Thesis, San Diego State University. Thakker, Osman, and Lakin JAPE Tutorial. Thanks • For your attention! • To Phil Gooch, Hazzaz Imtiaz, and Emile de Maat for discussion. • To the London GATE User's Group. • To the GATE Community and discussion list.