4. Grammars
A grammar
of a language consists of two files: a .lex file and a .pho file.
4.1. Naming conventions
The name of
the .lex file must be the same as the language code you provided in the file
``lingvoj.txt''.
If you have
a complex code (language:dialect), you must use only the language code (without
the dialect extension), because all dialects use the same .lex file.
However,
each dialect has a separate .pho file, which is named by the full code
(language_dialect), with the colon replaced by an underscore. So, if you have
two such entries in the file lingvoj.txt
Quechua (Ecuadorian) [qu:ecu]
Quechua (Bolivian) [qu:bol]
you will
need the following three files:
o qu.lex
o qu_ecu.pho
o qu_bol.pho
4.2. Lexicon
A
minimalist grammar for a particular language consists only of a lexicon, as
syntax is universal.
The lexicon
is contained in the .lex file. Each line is an entry (you can leave blank
lines, and you can insert comments - the comment begins when a % is found, and
extends to the end of the line).
Each entry
has three parts
1. The semantic representation of the lexical
entry (henceforth LF);
2. The phonetic form (which can be empty) of
the lexical entry (PF);
3. the morphological features (MF)
Braces,
brackets and parentheses are used to delimitate, respectively, PF, LF and MF.
So, a lexical entry looks like:
{A\ACT:EAT} [eat] (=k v)
where:
o
[eat]
form of the
entry;
o
A\ACT:EAT
EAT,
belonging to the category ACT, and having arity 2 (we must define A as an alias
for arity
o (=k v)
(=k) and is a verb (v).
4.2.1. Paraphrase rules
If a
lexical entry has empty MF, the system interprets it as a paraphrase rule. In a
paraphrase rule, braces enclose a string of predicates (eventually, one
predicate), and brackets delimit a second string of predicates. Paraphrase
rules are scanned at the beginning of the generation, and if the semantic
representation matches the first string, the second is substituted for it.
Paraphrase
rules are applied in the order they are found.
When DALIA
is started and the language is selected, all the paraphrase rules are extracted
from the lecicon and written to a special file (which has the same name of the
lexicon, with suffix "$p").
Then,
another file, called inverse paraphrase file, is built, with the paraphrase
rules put in the inverse order. While the paraphrase file is used in the
generation process, the inverse paraphrase file is used in the parsing.
Paraphrase
rules whose second string is empty are not inserted in the inverse paraphrase
file.
4.2.1.1. Type of paraphrase rules
o Automatic
paraphrase rules are inserted by the system. They consist in pronominalization
rules: for each lexical item with arity prefix "e" (entity), a
paraphrase is inserted which replaces
{e\PRED p\PN}
[d\PN]
o Paraphrase rules are characterized by empty
MF. If the empty MF is written
()
in th
paraphrases file before the automatic paraphrases.
o If the empty MF is written as
(#)
the
paraphrase is inserted after the automatic paraphrases.
o If the empty MF is written as
(^)
corresponding
paraphrase is inserted only in the paraphrase file, and not in the inverse
paraphrase file. That is, it will be applied only by the generator, and not by
the parser.
o If the empty MF is written as
(*)
then the
paraphrase is a neutralization rule: the generator will apply it like a simple
paraphrase. The parser, however, when it encouters a neutralization rule, will
derive two final LFs: one in which the neu- tralization rule has been applied,
and one without the application of that rule.
4.2.2. Dialect
Dialects
use the same .lex file. All entries are common to every dialect. In order to
restrict an entry to a particular dialect, the entry must be followed, in the
same line, by two asterisks, followed by the dialect code.
Example:
suppose we are writing a lexicon for both Bolivian and Ecuadorian quechua. Then
if the language codes are, respectively, qu:bol and qu:ecu, we can have entries
like these:
{d\I} [nyuka] (n:1) **ecu
{d\I} [nuqa] (n:1) **bol
4.3. .pho files
A .pho file
is interpreted by the finite-state transducer which starts the parsing process
(by splitting the input string into morphemes, which are then fed into the
chart) and ends the generation process (by deriving the final phonetic form
from the morpheme string).
The
finite-state transducer will be released separately with its own documentation.
Here, some hints for the DALIA user follow:
o A .pho file is a list of patterns. Each line
is a pattern (blank line can be inserted, as well as comments, started by %, as
for the lexicon). Each pattern defines a correspondence between a lexical and a
surface form. The lexical form is the one used in the phonetic form of the
lexical entries, while the surface form is the actual form we want to derive
(in the generation process) or from which we start (in the parsing process).
o A .pho file can also contain variable
declarations. A variable declaration starts with the symbol & followed
(without any space) by the name of the variable, a space, a "=" sign,
a space, then a list of the possible values of the variable, separated by comma
(without spaces). For example,
&C = p,t,k,b,d,g,m,n,N,y,w
&STOP = p,t,k,b,d,g
&V = a,i,u
o In the patterns, variables can be used with
numerical indices. For example, you can write
V1kV2 V1gV2
to mean
that lexical "k" corresponds to surface "g" between vowels.
If you write the previous pattern as
VkV VgV
then
"k" is realized as "g" only between two identical vowels.
o Example: English will contain two entries
with phonetic form
[-s]
morpheme
and the plural morpheme). The transducer will contain the following
correspondence, to ensure that -s is realized as "es" after a
sibilant:
s-s ses *
o An asterisk is used, like in the previous
example, to make a pattern obligatory: that pattern will always be applied when
it
matches the input, and if alternative patterns
would be present which match that same input, they would be discarded. Without
that
asterisk, you would get both "houses"
and "houss" as surface form for the lexical "hous-s".
o Right contexts can be expressed in both the
leical and surface column of the pattern, separating them with a slash. The
previous English example could be also written as
s\-s se *
o A wildcard, written "." or
"...", can be used in the contexts of the patterns to match every
(sequence of) character. Example:
s-\n...sh sh *
this means that lexical "s" is
realized as surface "sh" in the context n ... sh.
The dots
don't match a word boundary.
o A word boundary is written as "#"
in the lexical part of the pattern, and as a double underscore in the surface.