Encoding Croatian Corpora Marko Tadić Department of linguistics/Institute of linguistics,
Download ReportTranscript Encoding Croatian Corpora Marko Tadić Department of linguistics/Institute of linguistics,
Encoding Croatian Corpora Marko Tadić ([email protected], www.hnk.ffzg.hr/mt) Department of linguistics/Institute of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr/www.ffzg.hr/zzl/zzl-home.htm) Tübingen, 2001-02-22 Lecture plan Monolingual corpora – Croatian National Corpus (HNK) Bilingual corpora – Croatian-English parallel corpus – Croatian-Slovenian parallel corpus – Acquis translations parallel corpus Croatian National Corpus (HNK) 1 project of the Ministry of Science and Technology of the Republic of Croatia 130718, Computational processing of Croatian language, formally started 1996, actually 1998 theoretical foundations (www.hnk.ffzg.hr/cilj) in 1995, published: – Tadić (1996) Računalna obradba hrvatskoga i nacionalni korpus, Suvremena lingvistika 41-42, 603-612 – Tadić (1998) Raspon, opseg i sastav korpusa suvremenoga hrvatskoga jezika, Filologija 30-31, 337-347 need for the reference corpus of Croatian – 1st step: written – later: some 10% spoken a tentative solution for its composition the size, time-span and structure was elaborated accessibility via WWW service was suggested HNK 2: structure 30m 30-million Corpus of Contemporary Croatian – texts from 1990 until today – different domains and genres – representativeness for contemporary Croatian standard HETA Croatian Electronic Text Archive (Hrvatski Elektronski Tekstovni Arhiv) – whole texts older than 1990 – whole texts of complete publications after 1990 which would disbalance the representativeness of 30m HNK 3: 30m text typology 1. Informative texts/Faction 1.1. newspaper 1.1.1. daily 1.1.2. weekly 1.1.3. bi-weekly 1.1.4. irregular 1.2. magazines 1.2.1. weekly 1.2.2. bi-weekly 1.2.3. monthly 1.2.4. bi/tri-monthly 1.3. books 1.3.1. journalism 1.3.2. crafts etc. 1.3.3. science 2. Imaginative teksts/Fiction 2.1. prose 2.1.1. novels 2.1.2. stories 2.1.3. diaries, travelling notes... 3. Miješani tekstovi 3.1. imaginative-factographic pieces 3.2. essays 3.3. speeches 76 37 22 9 3 3 17 10 1 3 3 22 7 2 13 21 21 13 7 1 3 1 1 1 22800000 11100000 6600000 2700000 900000 900000 5100000 3000000 300000 900000 900000 6600000 2100000 600000 3900000 6300000 6300000 3900000 2100000 300000 900000 300000 300000 300000 HNK 4: corpus on www http://www.hnk.ffzg.hr Testing V 1.0: 1998-12-05 – 30m: 3 mW Testing V 1.1: 1999-02-14 & 1999-07-20 – 30m: 7,67 mW – HETA: 2,9 mW from CD-ROM: Classics of Croatian literature, Naklada Bulaja, Zagreb, 1999 Testing V 1.1 (approx. 10 mW) of corpus is www accessible – text format: quasi HTML, no XML – no POS marking Testing V 1.2 (approx. 17 mW) – being filled right now – no additional retrieval facilities HNK 5: Statistics www.hnk.ffzg.hr/stats Item Hits Total Data Transferred Total Visiting Users Time Period Average Hits per User Average Users per Day Average Data Transferred per Day Hits cached by Client Report generated on Incomplete downloads/file requests Log spans a period of Total failed requests Unique IP Addresses Average Data Transferred per User Average Hits per Day Average Data Transferred per Hit Each user has visited approximately Hits on Pages Hits on Files Hits on Images Value 261182 7.28 gigabytes 28871 November 27, 1998, 08:43 AM to December 31, 2000, 11:47 PM 9.05 37.69 9.74 megabytes 67983 (26.03%) January 11, 2001 at 11:44 AM 3037 (1.16%) 766 days 16574 (6.35%) 9480 264.50 kilobytes 340.97 29.24 kilobytes 3.05 times 123105 18620 102883 Domain Name Croatia/Hrvatska (.hr) Commercial (.com) Germany (.de) Network (.net) Austria (.at) Educational (.edu) Slovenia (.si) Netherlands (.nl) Australia (.au) Czech Republic (.cz) Italy (.it) Canada (.ca) Yugoslavia (.yu) France (.fr) Sweden (.se) Poland (.pl) Bosnia and Herzegowina (.ba) Russian Federation (.ru) United Kingdon (.uk) Japan (.jp) Switzerland (.ch) New Zealand (.nz) Denmark (.dk) Slovakia (Slovak Republic) (.sk) Hungary (.hu) Non-profit Organization (.org) Israel (.il) Belgium (.be) Greece (.gr) Norway (.no) Macedonia (.mk) Spain (.es) Finland (.fi) Kuwait (.kw) Portugal (.pt) United States (.us) Ukraine (.ua) Ireland (.ie) Bulgaria (.bg) Brazil (.br) Estonia (.ee) Hits 119542 22371 18312 11400 2848 2037 1378 990 870 782 770 747 698 682 669 555 502 396 396 373 326 294 276 224 205 171 166 148 145 145 118 96 96 83 71 70 65 64 64 57 57 Percentage 63.01% 11.79% 9.65% 6.01% 1.50% 1.07% 0.73% 0.52% 0.46% 0.41% 0.41% 0.39% 0.37% 0.36% 0.35% 0.29% 0.26% 0.21% 0.21% 0.20% 0.17% 0.15% 0.15% 0.12% 0.11% 0.09% 0.09% 0.08% 0.08% 0.08% 0.06% 0.05% 0.05% 0.04% 0.04% 0.04% 0.03% 0.03% 0.03% 0.03% 0.03% HNK 6: text conversion and encoding XML – XCES (XML version of CES) – Ide, Bonhomme & Romary (2000) DIVs, Ps, Ws S-boundary detection algorithm – problem with ordinal numbers written with punctuation input text formats – WWW: HTML, XML – DTP: RTF, DOC, QXD, WP, TXT etc. conversion – 2XML: custom made software • input: HTML, RTF / output: XML, no header • two-step conversion by user-defined scripts • enables high level of automation HNK 7: corpus format 1 <<?xml version="1.0"?> <!DOCTYPE cesDoc PUBLIC "-//CES//DTD XML cesDoc//EN" "xcesDoc.dtd" [ ]> <cesDoc version="3.19"> <cesHeader type="text" version="3.19"> <fileDesc> <titleStmt> <h.title>Electronic version of Vecernji list, vl990311</h.title> <respStmt> <respType>XCES markup prepared by</respType> <respName>Bosko Bekavac</respName> </respStmt> </titleStmt> <extent> <wordCount>4456</wordCount> <byteCount>25385</byteCount> </extent> <publicationStmt> <distributor>Project MZT RH 130718</distributor> <pubAddress>Institute of linguistics</pubAddress> <telephone>+385 1 6120-142</telephone> <fax>+385 1 6856-118</fax> <eAddress>http://www.ffzg.hr/zzl/zzl-home.htm</eAddress> <idno>76676665676</idno> <availability status="free"> </availability> <pubDate>1999-12-20</pubDate> </publicationStmt> <sourceDesc> <biblStruct> HNK 7: corpus format 2 <BODY> <DIV0 type="article"> <HEAD type="nn">U GORICI SVETOJANSKOJ ODRŽAN 12. FESTIVAL PJEVAČA AMATERA</HEAD> <HEAD type="na">Ivana osvojila županijski Sanremo</HEAD> <HEAD type="pn">* Od 20 natjecatelja žiri je najboljom proglasio Ivanu Erdeljac s pjesmom "Crazy", druga je Antoni <FIGURE>Publici su se najviše svidjeli Marija Šalić i Petar Puhijera</FIGURE> <P>Pod medijskim pokroviteljstvom "Večernjeg lista" i Radio Jaske, a uz pomoć DIR "Rubinić" kao generalnog te još održan je 12. festival pjevača amatera.</P> <P>Prve festivalske večeri, na kojoj su nastupila 22 izvođača do 15 godina, prvu nagradu stručnog žirija odnijela Nikolini Oslaković iz Gornje Reke za pjesmu "Neka mi ne svane", a treća Mariji Jurini iz Desinca za pjesmu "Ginem" nagradu dodijelila Natali Rajnović iz Jaske za pjesmu "Don"t ever cry", a treću Aniti Oslaković iz Desinca za pjes pjesmom "Izdali me".</P> <P>Druga večer - s dvadeset starijih izvođača iz Jaske, Karlovca, Bjelovara, Zagreba i Velike Gorice - bila je oso nije bilo lako odabrati najbolje.</P> <P>Nakon poduže stanke tijekom koje su izbrojani glasovi - a koju su publici kratili gost večeri Ivo Pattiera te s stručnog žirija, prvu nagradu i zlatnu plaketu "Večernjaka" dobila je Karlovčanka Ivana Erdeljac za vrlo dobro otp a treća Kseniji Cvetetić iz Petrovine za pjesmu "Neka mi ne svane".</P> <P>Publika je najviše glasova dodijelila svetojansko-zagrebačkom duetu Mariji Šalić i Petru Puhijeri za interpreta mjesto publika je svrstala "Svetojanske tamburaše" koji su nastupili s pjesmom "Dobro jutro", a na treće Zagrepčan <P>Najboljom debitanticom završne večeri proglašena je Zagrepčanka Marina Posilović s pjesmom "Piši, piši mi", a n suseda, suseda". Čini se da su ovogodišnje nagrade - a bilo ih je doista mnogo, od sedmodnevnog boravka u Opatiji, Oni koji ih nisu dobili, a možda su ih također zaslužili, neka se ovaj put utješe pljeskom publike, a dogodine će županije - nastavlja se.</P> <BYLINE>N. Godrijan-Videc</BYLINE> </DIV0> </BODY> HNK 8: corpus format 3 tokenization – TOKENIZER: custom made software • input: XML • output 1: tabbed file for data-base input • output 2: tokenized XML <BODY> <DIV0 type="article"> <HEAD type="nn"> U GORICI SVETOJANSKOJ ODRŽAN 12 . FESTIVAL PJEVAČA AMATERA </HEAD> <HEAD type="na"> Ivana osvojila županijski Sanremo </HEAD> <HEAD type="pn"> * Od 20 natjecatelja žiri je najboljom proglasio Ivanu Erdeljac s pjesmom " Crazy " , druga je Antonija Mikita s pjesmom vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 vl990301gr01 1 7 28 44 46 53 66 78 80 82 91 104 111 118 134 140 149 165 172 179 195 197 200 203 216 226 229 239 249 255 264 266 275 276 281 282 284 290 293 302 309 311 X X X R R R R B I R R R X X R R R R X X I R B R R R R R R R R R I R I I R R R R R R HNK 9: corpus format 4 output 2: tokenized XML <BODY> <DIV0 type="article"> <HEAD type="nn"> <W type="R">U</W> <W type="R">GORICI</W> <W type="R">SVETOJANSKOJ</W> <W type="R">ODRŽAN</W> <W type="B">12</W> <W type="I">.</W> <W type="R">FESTIVAL</W> <W type="R">PJEVAČA</W> <W type="R">AMATERA</W> </HEAD> <HEAD type="na"> <W type="R">Ivana</W> <W type="R">osvojila</W> <W type="R">županijski</W> <W type="R">Sanremo</W> </HEAD> <HEAD type="pn"> <W type="I">*</W> <W type="R">Od</W> <W type="B">20</W> <W type="R">natjecatelja</W> <W type="R">žiri</W> <W type="R">je</W> <W type="R">najboljom</W> <W type="R">proglasio</W> <W type="R">Ivanu</W> <W type="R">Erdeljac</W> <W type="R">s</W> <W type="R">pjesmom</W> <W type="I">"</W> <W type="I">"</W> <W type="I">,</W> <W type="R">druga</W> <W type="R">je</W> <W type="R">Antonija</W> <W type="R">Mikita</W> <W type="R">s</W> <W type="R">pjesmom</W> <W type="I">"</W> <W type="R">To</W> <W type="I">"</W> <W type="I">,</W> <W type="R">a</W> <W type="R">treće</W> <W type="R">je</W> <W type="R">mjesto</W> <W type="R">osvojila</W> <W type="R">Ksenija</W> <W type="R">Cvetetić</W> </HEAD> <FIGURE> <W type="R">Publici</W> <W type="R">su</W> <W type="R">se</W> <W type="R">najviše</W> <W type="R">svidjeli</W> <W type="R">Marija</W> <W type="R">Šalić</W> <W type="R">i</W> <W type="R">Petar</W> <W type="R">Puhijera</W> </FIGURE> <P> <W type="R">Pod</W> <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W <W type="R">medijskim</W> type="R">pokroviteljstvom</W> type="I">"</W> type="R">Večernjeg</W> type="R">lista</W> type="I">"</W> type="R">i</W> type="R">Radio</W> type="R">Jaske</W> type="I">,</W> type="R">a</W> type="R">uz</W> type="R">pomoć</W> type="R">DIR</W> type="I">"</W> type="R">Rubinić</W> type="I">"</W> type="R">kao</W> type="R">generalnog</W> type="R">te</W> type="R">još</W> type="R">sedamdesetak</W> type="R">drugih</W> type="R">sponzora</W> type="I">,</W> type="R">u</W> type="R">petak</W> type="R">i</W> type="R">u</W> type="R">subotu</W> type="R">u</W> type="R">Gorici</W> type="R">Svetojanskoj</W> type="R">pokraj</W> HNK 10: POS annotation 1 Croatian – morphologically rich language • • • • • • nouns: 7 cases, 2 numbers, 3 genders adjectives: + 2 forms (definite & indefinite), 3 grades in comparation adverbs: 3 grades in comparation pronouns: 7 cases, 2 numbers, 3 genders, 3 persons numbers: 7 cases, 3 genders verbs: – 2 numbers, 3 persons – 3 simple, 3 periphrastic tenses (with difference in 3 genders and 2 numbers in participles) – 2 additional participles – 2 conditionals – imperative – very complex system of aspects (perfect & imperfect/iterative) a lot of syntactic relations coded by morphology – POS annotation and lemmatization more important than for e.g. English HNK 11: POS annotation 2 Croatian morphological lexicon – 36000 headwords – GenOblik2 morphological generator Tadić (1994) MulTextEast MSD recommendation – 6 CEE languages – Croatian specification added in 1998 – Erjavec: MulTextEast recommendation V 2.0 ? matching with corpus = abeceda Ncfsn abecede abeceda Ncfsg abecedi abeceda Ncfsd abecedu abeceda Ncfsa abecedo abeceda Ncfsv abecedi abeceda Ncfsl abecedom abeceda Ncfsi abecede abeceda Ncfpn abeceda abeceda Ncfpg abecedama abeceda Ncfpd abecede abeceda Ncfpa abecede abeceda Ncfpv abecedama abeceda Ncfpl abecedama abeceda Ncfpi = abolicija Ncfsn abolicije abolicija Ncfsg aboliciji abolicija Ncfsd aboliciju abolicija Ncfsa abolicijo abolicija Ncfsv aboliciji abolicija Ncfsl abolicijom abolicija Ncfsi abolicije abolicija Ncfpn abolicija abolicija Ncfpg abolicijama abolicija Ncfpd abolicije abolicija Ncfpa abolicije abolicija Ncfpv abolicijama abolicija Ncfpl abolicijama abolicija Ncfpi = abrazija Ncfsn abrazije abrazija Ncfsg abraziji abrazija Ncfsd abraziju abrazija Ncfsa abrazijo abrazija Ncfsv abraziji abrazija Ncfsl abrazijom abrazija Ncfsi abrazije abrazija Ncfpn abrazija abrazija Ncfpg abrazijama abrazija Ncfpd abrazije abrazija Ncfpa abrazije abrazija Ncfpv abrazijama abrazija Ncfpl HNK 12: POS annotation 3 HNK 13: POS annotation 4 automatically anotate 1Mw corpus manual correction use it as training data for tagger TNT Parallel corpora Croatian-English parallel corpus Slovene-Croatian parallel corpus Acquis translations corpus HR-EN parallel corpus 1 source: Croatia Weekly – like USA today: different domains • politics, economy and finance, tourism, ecology, culture, art, events, sports – 12 pages, A3 – prepared in Croatian then translated by professional translating office availability – 118 numbers – started January 1998, finished May 2000 – access to all texts in electronic form in both languages HR-EN parallel corpus 2 Articles: Sentences: – HR – EN 4,343 67,694 75,390 (15.59 s/article avg.) (17.36 s/article avg.) 1,490,964 1,796,744 3,287,708 (22.03 w/s avg.) (23.83 w/s avg.) Tokens: – HR – EN – Total HR-EN parallel corpus 3 HR-EN parallel corpus 4 Sentence marking – </S><S> insertion after punctuation followed by capital letter – filtered for known exceptions: Mr., Mrs., Miss., dr., St. etc. – problem of ordinal numbers written with punctuation by Croatian orthography Vanilla aligner alignments – – – – – – 0:1 1:0 1:1 1:2 2:1 2:2 310 25 56783 8611 1391 379 Total alignments: in in in in in in 235 articles 0.45% 12 articles 0.04% 4143 articles 84.12% 3288 articles 12.76% 1012 articles 2.06% 345 articles 0.56% 67499 in 4143 articles HR-EN parallel corpus 5 encoding problem: How to store alignments? Tadić (2000): LREC2000 (X)CES way: – each language in a separate document – <S id=“...”> – pointers to IDs of aligned sentences in 3rd document HR-EN parallel corpus 6 Acquis translations parallel corpus Croatia is on the way of becaming a Candidate country for EU Translation of AC = only task equal to all Candidate countries translating 200.000 pages of EU OJ into Croatian (ca 60 Mw) translating 100.000 pages of Croatian legislation in English/French... Ministry of European integration of the Republic of Croatia – organizing the translation process – 200 freelance translators or translation companies – existing on-line lexical dBases (CELEX...): no Croatian terms and/or TE mantain the consistency of translations? EuroVoc = translated in Croatian – thesaurus of European Commision terms Institute of linguistics – proposal for joint project of preparation of AC texts for translation – term marking found in EuroVoc and TE suggestion AC translations parallel corpus 3 AC translations parallel corpus AC translations parallel corpus 5 AC translations parallel corpus 6 AC translations parallel corpus 7 if we put <S>s and </S>s and give them ID-attributes in both original and translation we can use the whole of AC as a huge Translation memory parallel corpus aligned at the <S> level = TM – just a matter of encoding • alignment and/or <TU> marking term marking – <W>-level marking needed – several encoding solutions AC translations parallel corpus 8 solution 1: term tags intermixed with corpus data <P> <S> <W id=845>The</W> <term><W id=846>European</W> <W id=847>Parliament</W></term> <W id=848>may</W> <W id=849>ask</W>... </S>... </P>... problem: non-contiguous multi-W terminological units AC translations parallel corpus 9 solution 2: term marking in stand-off annotation i.e. in other XML document linked to corpus data <P> <S> <W id=845>The</W> <W id=846>European</W> <W id=847>Parliament</W> <W id=848>may</W> <W id=849>ask</W>... </S>... </P>... <W <W <W <W <W <term_unit id=en122> <link xtargets="846 ; 847"> </term_unit> <term unit id=hr345> <link xtargets="765 ; 767"> </term unit> id=765>Europski</W> id=766>bi</W> id=767>parlament</W> id=768>mogao</W> id=769>tražiti</W> allows marking of non-contiguous terms AC translations parallel corpus 10 solution 3: term marking with translation equivalent suggestion <P> <S> <W id=845>The</W> <W id=846>European</W> <W id=847>Parliament</W> <W id=848>may</W> <W id=849>ask</W>... </S>... </P>... <W <W <W <W <W id=765>Europski</W> id=766>bi</W> id=767>parlament</W> id=768>mogao</W> id=769>tražiti</W> <term_unit id=en122> <link xtargets="846 ; 847"> </term_unit> <term unit id=hr345> <link xtargets="765 ; 767"> </term unit> <tu><link xtargets="en122 ; hr345"></tu> AC translations parallel corpus 11 XLink – W3C Working Draft, 2000-02-21 (http://www.w3.org/TR/xlink) – XML’s powerful linking tool – allows stand-off annotation (Ide et al. 2000) • no changes in corpus data <= annotation of read-only data • multimodal corpora annotation – time-line links – links of language data with audio or video (paralinguistic data) Systems using XLink intensively – MATE workbench (McKelvie et al. 2000) – LDC (Bird & Liberman 2000) – ... Some methodological remarks 1 some skepticism what do we do exactly by putting annotations in corpora? – adding the secondary data to our primary data in order to able to retrieve information later – adding categories selected from the prepared list and applying them to our corpus data not concerned here with meta-description (usually in headers) secondary data = result of interpretation of primary data by adding already prepared categories – we get a lot of information which could not be collected any other way – could we miss some phenomena which we haven’t forseen in the stage of category preparation? Some methodological remarks 2 example on the very basic level of word boundary nmkojo, zam. pridj. nijedan, nikakav (Anić, Vladimir: Rječnik hrvatskoga jezika, 1991) Ni u kojem se slučaju ne smiješ okrenuti! oligo- and poly-sacharids... Ivan je Šikić radosno krenuo nizbrdo. – How many words do we have here? – Is it a trivial question? – opposition between “graphic words” and lemmas not to mention syntax and/or semantics Some methodological remarks 3 putting only one kind of secondary/interpretive data in corpus – filtering only those linguistic phenomena which we are able to grasp by our already prepared categories – missing phenomena for which we are not prepared keeping our secondary/tertiary/... data apart from basic resource data – allows other researchers to have their own secondary etc. data and different interpretations – allows us to compare different interpretive data interpersonally and/or automatically XML and concept of stand-off annotation gives us a tool for that Encoding Croatian Corpora Marko Tadić ([email protected], www.hnk.ffzg.hr/mt) Department of linguistics/Institute of linguistics, Faculty of philosophy, University of Zagreb (www.ffzg.hr/www.ffzg.hr/zzl/zzl-home.htm) Tübingen, 2001-02-22