IOBBER is a chunker for Polish. Its job is to recognise syntactic phrases (chunks) in Polish text.
The name comes from IOB tags that are assigned to tokens to represent chunks (strictly speaking, we use IOB2 representation).
Here is an example sentence annotated with NP and VP chunks:
IOBBER is a reimplementation of CRF++ chunker available in Disaster. Thanks to the new implementation, it is
[Dziennikarka]NP [zarzucała]VP [Rutkowskiemu]NP [to]NP, że [całe jego działanie ws. zaginięcia]NP [to]VP [„show”]NP
- fast (~4,5k token/s on i7),
- able to read multiple input formats and output in 2 formats for chunked corpora (+ a diagnostic one called
- suitable for pipeline processing,
- easily extensible: features for classification are expressed in WCCL language,
- NEW: IOBBER is also recognising chunks' syntactic heads.
Note: the above processing speed refers to chunking of already morphosyntactically tagged input. We also provide a tool called
iobber_txt that is able to process plain text through a tagger and the chunker. The tool employs the WCRFT tagger to process plain text, although this configuration is slower due to WCRFT speed.
- Adam Radziszewski, Adam Pawlaczek, 2013. „Incorporating head recognition into a CRF chunker”. In: IIS 2013, Warsaw, Poland, June 17-18, 2013. [Full text]
- Adam Radziszewski, 2012. „Metody znakowania morfosyntaktycznego i automatycznej płytkiej analizy składniowej języka polskiego” (PhD Thesis / Doktorat), Politechnika Wrocławska. [Full text]
- Adam Radziszewski, Marek Grác, 2013. „Using Low-Cost Annotation to Train a Reliable Czech Shallow Parser”. In: Text, Speech, and Dialogue, 575–582. Springer Berlin Heidelberg. [Full text]
- Adam Radziszewski, Marek Maziarz, Jan Wieczorek, 2012. „Shallow syntactic annotation in the Corpus of Wrocław University of Technology”. Cognitive Studies 12. [Full text]
KPWr configuration and supplied trained model¶The software comes with a default configuration and a model trained on the Wrocław University of Technology Corpus (KPWr), which defines the following chunks:
- Noun Phrases (
chunk_np) — possibly complex noun and prepositional phrases (both are labelled NP here), limited to clause boundaries. Also, top-level coordination is always split (i.e. if the coordinated elements have no common syntactic superordinate NP, they constitute separate chunks).
- Adjective Phrases (
chunk_adjp) — top-level adj phrases, e.g. annotated only when not modifying any superordinate NP.
- Verb Phrases (
chunk_vp) — (complex) verbs + adverbs that clearly modify the verbs + infinitive modifiers. Nominal arguments are not included, they constitute separate chunks.
- Agreement Phrases (
chunk_agp) — simple noun or adjective phrases based on morphological agreement on number, gender and case, possibly also containing indeclinable elements that modify other parts of a chunk. AgP are based on local accomodations, while NPs, AdjPs and VPs are based on sentence predicate-argument structure.
The above set of chunks is grouped into two layers: one for Agreement Phrases, the other for NPs, VPs and AdjPs together (chunks defined with one layer shouldn't overlap, overlaps across layers do happen).
Input to IOBBER when using this configuration must be morphosyntactically tagged (we recomment the new WCRFT tagger).
After successful installation of the software, the config and model for KPWr will also be available for instant use. For instance, to annotate XCES-encoded tagged input:
iobber kpwr.ini -d model-kpwr11-H my_xces_input.xml -i xces -O ccl_chunked_output.xml
Installation of IOBBER itself is optional. You can also run the underlying main Python module straight away (given the dependencies are installed). In such a case, you have to provide full paths to the config and model directory, e.g. (assuming the IOBBER source tree is in
python workspace/iobber/iobber/iobber.py workspace/iobber/iobber/data/kpwr.ini -d workspace/iobber/iobber/data/model-kpwr11-H/ my_xces_input.xml -i xces -O ccl_chunked_output.xml
To process plain text it is recommended to also install the WCRFT tagger (along with a trained tagger model). Thanks to it, you will be able to use the
iobber_txt tool bundled with IOBBER, e.g.
echo 'Polacy wciąż jadają zbyt mało ryb.' | iobber_txt -
More information may be found in the User_guide.
NKJP configuration and model (NEW)¶
We also prepared a pre-trained model that was trained on similified syntactic data from the National Corpus of Polish.
Read more here.
Training and config files¶
The underlying algorithm is based on Conditional Random Fields and may be trained on corpora with manual chunk annotation. The chunker is able to adapt to various chunk definitions.IOBBER is parametrised with a config file that specifies the following (named
CONFIGis any name):
- Tagset to use (the name should match a
.tagsetfile available in the system or current dir, as sought by the corpus2 library)
- Is the IOBBER input morphosyntactically tagged or just morphologically analysed (i.e. multiple tags per token, not disambiguated)
- Layers to use; each layer corresponds to a separate chunking problem that may involve one or a few chunk types. Chunks at a layer may not overlap with each other but may overlap with chunks at other layers. Layer definition consists of names of chunk types (so-called channels, e.g.
- Parameters passed to the CRF++ classifier (optional).
- (Defined in
CONFIG.cclfile) WCCL file that specifies all features that may be used for classification. Features common for all the layers should be put under
defaultsection. Note that the feature values will be next processed with CRF++ template files, hence it is not necessary to create separate feature instances for the same feature but referring to different positions (e.g.
class), as it may be done later in the template file.
For each a feature template must be given (
CONFIG-LAYER.txt) using the CRF++ syntax. See kpwr.ini and its referenced WCCL files and feature templates for a working example.
The chunk-annotated documents from KPWr 1.1 merged in to one CCL file are available here (Creative Commons 3.0 Unported Licence, see KPWr website for details). Note that this dump contains only chunks and their heads, while syntactic relations were discarded. If you're interested in full annotation, please download the whole KPWr package.
1 Depending on the config, morphosyntactically analysed input may be sufficient. The default config for KPWr requires fully tagged input.
How to obtain the code and install¶
IOBBER is available under GNU LGPL 3.0 through our public Git repository:
git clone http://nlp.pwr.wroc.pl/iobber.git
The software is written in Python using a couple of C++ libraries with Python wrappers. It has only been tested under GNU/Linux.The following dependencies are needed:
- Corpus2 with Python support
- WCCL installed with Python support (requires swig and Python headers; also requires Corpus2)
- CRF++ with Python support (install CRF++ itself first, then enter the
pythonsubdir and install Python wrappers)
(all these dependencies are also required by the WCRFT tagger ; in case you use WCRFT and it's working, it means that you already have all the dependencies installed correctly)
If you want to be able to process plain text directly, please also install the WCRFT tagger itself (along with its trained model, which can be downloaded from the WCRFT site).
If the above packages have been correctly installed, the installation of iobber is simple:
sudo python setup.py install
This will install the python modules (iobber package), the iobber executable and the default configuration for KPWr and a trained model ready to use.
NOTE: installation is recommended, but not necessary. You can also run the main python module (
iobber/iobber/iobber.py) directly. In such a case, you will need to give full path to the employed configuration name (e.g.
iobber/iobber/data/kpwr.ini) and path to the directory with trained model (e.g.
Prepared VM image: if you want to try out Iobber without having to install its dependencies, we also provide a convenient Virtual Machine image with all the necessary dependencies installed (and more). More information at the bottom of this page.
Contact and bug reporting¶
To report bugs and feature requests, use our bugtracker.
Comments and discussion also welcome, please contact the author (Adam Radziszewski,