4. Grammars

 

A grammar of a language consists of two files: a .lex file and a .pho file.

 

 

4.1. Naming conventions

 

The name of the .lex file must be the same as the language code you provided in the file ``lingvoj.txt''.

If you have a complex code (language:dialect), you must use only the language code (without the dialect extension), because all dialects use the same .lex file.

However, each dialect has a separate .pho file, which is named by the full code (language_dialect), with the colon replaced by an underscore. So, if you have two such entries in the file lingvoj.txt

 

Quechua (Ecuadorian) [qu:ecu]

Quechua (Bolivian) [qu:bol]

 

you will need the following three files:

 

o qu.lex

o qu_ecu.pho

o qu_bol.pho

 

 

4.2. Lexicon

 

A minimalist grammar for a particular language consists only of a lexicon, as syntax is universal.

The lexicon is contained in the .lex file. Each line is an entry (you can leave blank lines, and you can insert comments - the comment begins when a % is found, and extends to the end of the line).

Each entry has three parts

 

1. The semantic representation of the lexical entry (henceforth LF);

2. The phonetic form (which can be empty) of the lexical entry (PF);

3. the morphological features (MF)

 

Braces, brackets and parentheses are used to delimitate, respectively, PF, LF and MF. So, a lexical entry looks like:

 

{A\ACT:EAT} [eat] (=k v)

 

where:

 

o

[eat]

 

 

form of the entry;

 

o

A\ACT:EAT

 

 

EAT, belonging to the category ACT, and having arity 2 (we must define A as an alias for arity 2 in the file ``arities.txt'').

 

o (=k v)

 

(=k) and is a verb (v).

 

 

4.2.1. Paraphrase rules

 

If a lexical entry has empty MF, the system interprets it as a paraphrase rule. In a paraphrase rule, braces enclose a string of predicates (eventually, one predicate), and brackets delimit a second string of predicates. Paraphrase rules are scanned at the beginning of the generation, and if the semantic representation matches the first string, the second is substituted for it.

Paraphrase rules are applied in the order they are found.

When DALIA is started and the language is selected, all the paraphrase rules are extracted from the lecicon and written to a special file (which has the same name of the lexicon, with suffix "$p").

Then, another file, called inverse paraphrase file, is built, with the paraphrase rules put in the inverse order. While the paraphrase file is used in the generation process, the inverse paraphrase file is used in the parsing.

Paraphrase rules whose second string is empty are not inserted in the inverse paraphrase file.

 

4.2.1.1. Type of paraphrase rules

 

o Automatic paraphrase rules are inserted by the system. They consist in pronominalization rules: for each lexical item with arity prefix "e" (entity), a paraphrase is inserted which replaces

 

{e\PRED p\PN}

 

[d\PN]

 

o Paraphrase rules are characterized by empty MF. If the empty MF is written

 

()

 

in th paraphrases file before the automatic paraphrases.

 

o If the empty MF is written as

 

(#)

 

the paraphrase is inserted after the automatic paraphrases.

 

o If the empty MF is written as

(^)

 

corresponding paraphrase is inserted only in the paraphrase file, and not in the inverse paraphrase file. That is, it will be applied only by the generator, and not by the parser.

 

o If the empty MF is written as

 

(*)

 

then the paraphrase is a neutralization rule: the generator will apply it like a simple paraphrase. The parser, however, when it encouters a neutralization rule, will derive two final LFs: one in which the neu- tralization rule has been applied, and one without the application of that rule.

 

 

4.2.2. Dialect

 

Dialects use the same .lex file. All entries are common to every dialect. In order to restrict an entry to a particular dialect, the entry must be followed, in the same line, by two asterisks, followed by the dialect code.

Example: suppose we are writing a lexicon for both Bolivian and Ecuadorian quechua. Then if the language codes are, respectively, qu:bol and qu:ecu, we can have entries like these:

 

{d\I} [nyuka] (n:1) **ecu

{d\I} [nuqa] (n:1) **bol

 

4.3. .pho files

 

A .pho file is interpreted by the finite-state transducer which starts the parsing process (by splitting the input string into morphemes, which are then fed into the chart) and ends the generation process (by deriving the final phonetic form from the morpheme string).

The finite-state transducer will be released separately with its own documentation. Here, some hints for the DALIA user follow:

 

o A .pho file is a list of patterns. Each line is a pattern (blank line can be inserted, as well as comments, started by %, as for the lexicon). Each pattern defines a correspondence between a lexical and a surface form. The lexical form is the one used in the phonetic form of the lexical entries, while the surface form is the actual form we want to derive (in the generation process) or from which we start (in the parsing process).

 

o A .pho file can also contain variable declarations. A variable declaration starts with the symbol & followed (without any space) by the name of the variable, a space, a "=" sign, a space, then a list of the possible values of the variable, separated by comma (without spaces). For example,

 

&C = p,t,k,b,d,g,m,n,N,y,w

&STOP = p,t,k,b,d,g

&V = a,i,u

 

o In the patterns, variables can be used with numerical indices. For example, you can write

 

V1kV2 V1gV2

 

to mean that lexical "k" corresponds to surface "g" between vowels. If you write the previous pattern as

 

VkV VgV

 

then "k" is realized as "g" only between two identical vowels.

 

o Example: English will contain two entries with phonetic form

 

[-s]

 

morpheme and the plural morpheme). The transducer will contain the following correspondence, to ensure that -s is realized as "es" after a sibilant:

 

s-s ses *

 

o An asterisk is used, like in the previous example, to make a pattern obligatory: that pattern will always be applied when it

matches the input, and if alternative patterns would be present which match that same input, they would be discarded. Without that

asterisk, you would get both "houses" and "houss" as surface form for the lexical "hous-s".

o Right contexts can be expressed in both the leical and surface column of the pattern, separating them with a slash. The previous English example could be also written as

 

s\-s se *

 

o A wildcard, written "." or "...", can be used in the contexts of the patterns to match every (sequence of) character. Example:

 

s-\n...sh sh *

 

this means that lexical "s" is realized as surface "sh" in the context n ... sh.

 

The dots don't match a word boundary.

 

o A word boundary is written as "#" in the lexical part of the pattern, and as a double underscore in the surface.