DALIA version 2.1 Francesco Zamblera April 25th, 2006 ____________________________________________________________ Table of Contents 1. Introduction. 2. Getting started with DALIA. 2.1 How to install DALIA. 2.1.1 How to install a precompiled binary. 2.1.2 How to compile from the sources. 2.2 The configuration files. 2.2.1 lingvoj.txt 2.2.2 arities.txt 2.2.3 ug_verb.txt, ug_noun.txt, ug_prep.txt 3. How to run DALIA 3.1 The menu 4. Grammars 4.1 Naming conventions 4.2 Lexicon 4.2.1 Paraphrase rules 4.2.1.1 Type of paraphrase rules 4.2.2 Dialect 4.3 .pho files 5. Things to come ______________________________________________________________________ 1. Introduction. DALIA is a parser, generator and translator for minimalist languages. The program is released under the GNU GPL. DALIA's source code is written in PASCAL, and compiled using the GNU FreePASCAL compiler. This document explains how to use DALIA. 2. Getting started with DALIA. 2.1. How to install DALIA. To install the program, you can download a precompiled binary, if there is one for your platform, or compile from the sources. For version 2.1, precompiled binaries are available for FreeDOS, Linux (i386) and Win. 2.1.1. How to install a precompiled binary. To install a precompiled binary, just extract the binary .zip file to the folder you want. Run DALIA from its current directory. To run, DALIA needs that some configuration files be present in the same directory in which DALIA is. These files are alredy present in the package you downloaded. For LINUX, you must remember to create a new folder called "temp" in the same directory, before you run DALIA. If the LINUX version shouldn't run, try changing the rights and permissions of the files. 2.1.2. How to compile from the sources. To compile DALIA's source code, you will need, first of all, to download and install the FreePASCAL compiler for your platform (it has recently been released for MAC too). To compile DALIA: o Get the FreePASCAL compiler and install it following the instructions; o Change the file OS_SPEC.pas according to the needs of your operating system (for Linux, Win and DOS you only have to uncomment the appropriate linex, and comment out those of the other operating systems); o Compile the source code; o For DALIA to work, you need to download the "grammars.zip" file from DALIA's distribution, unfold it, and move the compiled binary to that folder. 2.2. The configuration files. To work, DALIA needs the following configuration files: o ``lingvoj.txt'', a list of the available languages. Each language must have a .lex and a .pho file; o ``arities.txt'', a list of the arity prefixes; o ``ug_verb.txt, ug_nuon.txt, ug_pr.txt'', three files which contain the predicate hierarchy. You can customize the configuration files to suit your needs. These files have the following structure: 2.2.1. lingvoj.txt Each line of this file contains an entry for a language. Each entry begins with the name of the language (there are no restrictions on this), then the language code, between square brackets, for example: English [eng] Quechua (Bolivian) [qub] Codes should be at most three characters long, and are case-sensitive. You can use a complex code: after the language code, you put a colon, then the dialect code. For example, Quechua (Bolivian) [qu:bol] Quechua (Ecuadorian) [qu:eq] See``Dialects'' for more information about how DALIA can handle various dialects of a same language. 2.2.2. arities.txt This file must contain the arity labels 0, 1 and 2, to which you can add the prefixes you choose as aliases for predicate arities. Each predicate used in semantic representations must have an arity, that is, a label specifying the arguments each predicate can take. Predicates are written arity\category:predicate where category can be omitted. The file arities.txt must contain three lines. The first lists the aliases for 0-arity, the second and the third, respectively, for 1- and 2- arity predicates. There can be other (comment) lines, which must begin with the character %. Example: %ARITY 0 e d 0 %ARITY 1 E D p 1 %ARITY 2 r 2 2.2.3. ug_verb.txt, ug_noun.txt, ug_prep.txt These files contain the hierarchy for functional heads, respectively, in the verbal, nominal and adpositional projections. Hierarchy is from the top down (higher elements are written after lower ones), in increasing order (the last element is the highest in the hierarchy, the first is the lowest). Each line contain just one entry, that for the relevant category for the hierarchy. See the single files in the distribution. The hierarchy of functional heads is needed to expand the LF of a sentence for translation ito another language. 3. How to run DALIA To run DALIA, you invoke dalia_21 at a command line prompt (after having switched to the directory where the executable and the configuration files are). For Linux, you will have to type ./dalia_21. In Win, you can simply double-click on the icon. You can run DALIA in verbose mode, invoking the program with option -v. In verbose mode, you will follow the derivations step by step (at each stem, you will have to hit a key to go on). In verbose mode, DALIA prints out a lot of information about the deerivation, both on the screen and on a file output.log. 3.1. The menu When you run DALIA, you get information about copyright (GNU GPL). Then, a series of menus appear, which guide you into the various options (parsing, generation, translation or simple morphological analysis; input from file or from keyboard, output to screen or to file; language selections). You should have no problem in following the various menus. 4. Grammars A grammar of a language consists of two files: a .lex file and a .pho file. 4.1. Naming conventions The name of the .lex file must be the same as the language code you provided in the file ``lingvoj.txt''. If you have a complex code (language:dialect), you must use only the language code (without the dialect extension), because all dialects use the same .lex file. However, each dialect has a separate .pho file, which is named by the full code (language_dialect), with the colon replaced by an underscore. So, if you have two such entries in the file lingvoj.txt Quechua (Ecuadorian) [qu:ecu] Quechua (Bolivian) [qu:bol] you will need the following three files: o qu.lex o qu_ecu.pho o qu_bol.pho 4.2. Lexicon A minimalist grammar for a particular language consists only of a lexicon, as syntax is universal. The lexicon is contained in the .lex file. Each line is an entry (you can leave blank lines, and you can insert comments - the comment begins when a % is found, and extends to the end of the line). Each entry has three parts 1. The semantic representation of the lexical entry (henceforth LF); 2. The phonetic form (which can be empty) of the lexical entry (PF); 3. the morphological features (MF) Braces, brackets and parentheses are used to delimitate, respectively, PF, LF and MF. So, a lexical entry looks like: {A\ACT:EAT} [eat] (=k v) where: o [eat] form of the entry; o A\ACT:EAT EAT, belonging to the category ACT, and having arity 2 (we must define A as an alias for arity 2 in the file ``arities.txt''). o (=k v) (=k) and is a verb (v). 4.2.1. Paraphrase rules If a lexical entry has empty MF, the system interprets it as a paraphrase rule. In a paraphrase rule, braces enclose a string of predicates (eventually, one predicate), and brackets delimit a second string of predicates. Paraphrase rules are scanned at the beginning of the generation, and if the semantic representation matches the first string, the second is substituted for it. Paraphrase rules are applied in the order they are found. When DALIA is started and the language is selected, all the paraphrase rules are extracted from the lecicon and written to a special file (which has the same name of the lexicon, with suffix "$p"). Then, another file, called inverse paraphrase file, is built, with the paraphrase rules put in the inverse order. While the paraphrase file is used in the generation process, the inverse paraphrase file is used in the parsing. Paraphrase rules whose second string is empty are not inserted in the inverse paraphrase file. 4.2.1.1. Type of paraphrase rules o Automatic paraphrase rules are inserted by the system. They consist in pronominalization rules: for each lexical item with arity prefix "e" (entity), a paraphrase is inserted which replaces {e\PRED p\PN} [d\PN] o Paraphrase rules are characterized by empty MF. If the empty MF is written () in th paraphrases file before the automatic paraphrases. o If the empty MF is written as (#) the paraphrase is inserted after the automatic paraphrases. o If the empty MF is written as (^) corresponding paraphrase is inserted only in the paraphrase file, and not in the inverse paraphrase file. That is, it will be applied only by the generator, and not by the parser. o If the empty MF is written as (*) then the paraphrase is a neutralization rule: the generator will apply it like a simple paraphrase. The parser, however, when it encouters a neutralization rule, will derive two final LFs: one in which the neu- tralization rule has been applied, and one without the application of that rule. 4.2.2. Dialect Dialects use the same .lex file. All entries are common to every dialect. In order to restrict an entry to a particular dialect, the entry must be followed, in the same line, by two asterisks, followed by the dialect code. Example: suppose we are writing a lexicon for both Bolivian and Ecuadorian quechua. Then if the language codes are, respectively, qu:bol and qu:ecu, we can have entries like these: {d\I} [nyuka] (n:1) **ecu {d\I} [nuqa] (n:1) **bol 4.3. .pho files A .pho file is interpreted by the finite-state transducer which starts the parsing process (by splitting the input string into morphemes, which are then fed into the chart) and ends the generation process (by deriving the final phonetic form from the morpheme string). The finite-state transducer will be released separately with its own documentation. Here, some hints for the DALIA user follow: o A .pho file is a list of patterns. Each line is a pattern (blank line can be inserted, as well as comments, started by %, as for the lexicon). Each pattern defines a correspondence between a lexical and a surface form. The lexical form is the one used in the phonetic form of the lexical entries, while the surface form is the actual form we want to derive (in the generation process) or from which we start (in the parsing process). o A .pho file can also contain variable declarations. A variable declaration starts with the symbol & followed (without any space) by the name of the variable, a space, a "=" sign, a space, then a list of the possible values of the variable, separated by comma (without spaces). For example, &C = p,t,k,b,d,g,m,n,N,y,w &STOP = p,t,k,b,d,g &V = a,i,u o In the patterns, variables can be used with numerical indices. For example, you can write V1kV2 V1gV2 to mean that lexical "k" corresponds to surface "g" between vowels. If you write the previous pattern as VkV VgV then "k" is realized as "g" only between two identical vowels. o Example: English will contain two entries with phonetic form [-s] morpheme and the plural morpheme). The transducer will contain the following correspondence, to ensure that -s is realized as "es" after a sibilant: s-s ses * o An asterisk is used, like in the previous example, to make a pattern obligatory: that pattern will always be applied when it matches the input, and if alternative patterns would be present which match that same input, they would be discarded. Without that asterisk, you would get both "houses" and "houss" as surface form for the lexical "hous-s". o Right contexts can be expressed in both the leical and surface column of the pattern, separating them with a slash. The previous English example could be also written as s\-s se * o A wildcard, written "." or "...", can be used in the contexts of the patterns to match every (sequence of) character. Example: s-\n...sh sh * this means that lexical "s" is realized as surface "sh" in the context n ... sh. The dots don't match a word boundary. o A word boundary is written as "#" in the lexical part of the pattern, and as a double underscore in the surface. 5. Things to come Here are some desiderata for DALIA's next release (2.2): o More extensive documentation, with more examples; o A series of options on the command line, in order to run DALIA in non-interactive mode; o fuller grammar modules (the demos included here are just that).