Corpus of manually lemmatised Polish noun and adjective phrases
Licence and credits
The dataset is licensed under Creative Commons Attribution 3.0.
The corpus is based on a random subset of syntactically anotated documents taken from Polish Corpus of Wrocław University of Technology (KPWr) version 1.1 (http://nlp.pwr.wroc.pl/kpwr).
Annotators: Marcin Oleksy and Jan Wieczorek.
Project coordinator: Adam Radziszewski (name dot surname at pwr.wroc.pl).
Some rights reserved. Wrocław University of Technology, 2013.
The corpus consits of those documents taken from KPWr corpus that had already been annotated with syntactic chunks. Two chunk types were considered: NP and AdjP. NP are in fact both actual noun phrases and prepositional phrases (preposition + NP). AdjP are adjective phrases, which are annotated only when not part of a larger NP.
For details on the assumed syntactic annotation principles please consult the paper available at KPWr website.
AdjP are infrequent, thus the number of instances may be too small to perform reliable experiments.
Phrase lemmatisation is understood as assignment to each phrase instance (of NP and AdjP type) their base forms (lemmas). Phrase lemma is an instance of same-type phrase that could appear in a dictionary or as a keyphrase.
(With some exception regarding proper names) lemmatisation requires that the syntactic head of the phrase be in nominative case. Often number is changed to singular, sometimes the gender is changed to nominative. Correct lemmatisation ofter requires changing of more word forms than just head, e.g. head adjective modifiers.
In the case of prepositional phrases, lemmatisation requires removal of phrase-initial prepositions, thus prepositional phrases are “lemmatised to real noun phrases”.
The corpus consists of a number of documents. This package preserves our division into development (dev) and evaluation (eva) data. Both directiories contain a few subdirectories, corresponding to original directory structure of KPWr (e.g. there is a subdirectory named blogi that contains documents belonging to blogs subcorpus of KPWr). The documents are stored in XML files.
Each XML file is stored in the CCL format (specs here) and contains the unchanged original KPWr annotation enhanced with lemmatisation information. This package uses the same file naming scheme as in KPWr 1.1, thus filenames (document ids) may be mapped to the original KPWr documents 1:1. The original annotation that was present in the CCL files is kept intact. Even if we spotted mislabelled chunk boundaries, we did not correct this at this stage. Instead, the annotators were told to come up with a lemma that would correspond to the actual boundaries.
Note: the original KPWr 1.1 package also contains .rel.xml files that describe inter-chunk and inter-NE relations. We did not copy the files here, although if you need the relation-level annotation, just copy the original .rel.xml files — they will still be valid.
Lemmatisation information is stored in token-level properties. Phrase lemmas are assigned to those tokens that are marked as chunk heads. E.g, in the following fragment, the token “otwarciu” is marked as NP head (<ann chan="chunk_np" head="1">1</ann>) and it is assigned NP lemma (<prop key="chunk_np:lemma">otwarcie WTZ</prop>).
<ann chan="chunk_agp" head="1">1</ann>
<ann chan="chunk_np" head="1">1</ann>
<prop key="chunk_np:lemma">otwarcie WTZ</prop>
Besides information on lemmatisation, we decided to keep automatically induced transformations (under the better working configuration — without the ‘lem’ transformation). The trasformations for both NP and AdjP chunks are stored under the lem_pattern key (they should be prefixed with chunk/channel name, but aren't in this version, sorry for that). Note: the transformations induced in the dev part were subjected to manual correction where the induction procedure completely failed (it doesn't guarantee that they are always valid, anyway). No manual intervention took place in the evaluation data.