|
|
Explanation:
- -grammar
-grammar gra_fn
Grammar file.
Soup looks for map-strings file (gra_fn - .gra) + .map_strings.
Example: -grammar ../Grammars/EngToy.gra
-grammar gra_fn_1 gra_fn_2 ... gra_fn_N
Grammar file followed by grammar parts. Same as appending gra_fn_2 ... gra_fn_N to gra_fn_1.
Example: -grammar $gra_dir/Eng_all.gra $gra_dir/Eng_nouns.gra $gra_dir/Eng_verbs.gra $gra_dir/Eng_rest.gra
-
-grammars
-grammars gra_fn_0 gra_fn_1 gra_id_1 ... file_name_N gra_id_N
Grammar name followed by pairs of (grammar file, grammar ID). Soup attaches : + ID i to each nonterminal
defined in grammar i.
Example: -grammars EngTravel.gra EngHotel.gra HTL EngTransportation.gra TPT EngEvents.gra EVT
appends :HTL to all LHSs defined in EngHotel.gra, e.g. [room] becomes [room]:HTL.
-
-shared
-shared shared_gra_fn_1 [... shared_gra_fn_N]
Shared grammar file(s). Each shared grammar file
must be self-contained, i.e. all LHSs referenced must be defined in the same file, and
cannot contain top-level LHSs, i.e. no LHS can be marked with an s.
Also, note that LHSs defined in a shared grammr file are not editable through G·Soup.
-
-allow_as_ANY
-allowed_as_ANY in_vocabulary_wilcard_matchable_fn
File containing those (in-vocabulary) words that are allowed to match the wilcard _$any$_.
Note that _$any$_ is always able to match out-of-vocabulary words.
-
-map_strings
-map_strings map_strings_fn
Where map_strings_fn is a file containing the global search and
replace string pairs following the syntax:
"search_string" --> "replace_string"
Example:
" i've " --> " i have "
" isn't " --> " is not "
' m"ochte ' --> ' moechte '
-
-develop
-develop development_fn
Development file.
development_fn should contain the utterances that the grammar is to be developed from. One
utterance per line, comments allowed.
-
-mark_boundaries
-mark_boundaries {ON, OFF}
If ON places input line within <s> (beginning of utterance) and </s> (end of utterance).
Example: okay becomes <s> okay </s>.
This is useful to allow certain rules to match only at the beginning bzw. end of an utterance.
Default value: OFF.
-
-all_TOP
-all_TOP {ON, OFF}
If ON all nonterminals are considered top-level, i.e. able to stand at the root of a parse tree.
Default value: OFF.
-
-case_sensitive
-case_sensitive {ON, OFF}
If ON the match between input words bzw. characters and grammar terminals is sensitive to case.
Default value: OFF.
-
-editor_socket_port
-editor_socket_port port_number
Port number to which a G·Soup client will connect to.
Example: -editor_socket_port 35000.
No default value.
-
-backend_socket_port
-backend_socket_port port_number
Port number to which a backend client will connect to.
Example: -backend_socket_port 25000.
No default value.
-
-train
-train gra_train_fn
Grammar training file.
gra_train_fn should contain parse trees attainable with the given grammar, including
the auxiliary nonterminals. Given
parse trees are considered correct and induce a perturbation of the original uniform distributions
at the grammar node level, i.e. the training procedure increments counts and adjusts probabilities
on the RTN nodes and arcs along the path that leads to the desired parse.
-
-nb_train
-nb_train naive_bayes_train_fn
Naive Bayes subdomain model training file.
naive_bayes_train_fn should contain utterances annotated with subdomain using the same
grammar IDs as given by -grammars.
Example:
.I 3
.C GTR
.T hi i'd like to make a trip to pittsburgh
.I 4
.C HTL
.T i'll be traveling with my family and we're looking for hotel reservations
-
-generate
-generate N
Generates N sentences according to the current stochastic parameters. Useful e.g. to create synthetic data for statistical language modeling. The number at the end of each generated utterance indicates the average arc transition probability along stochastically walked path.
-
-post_cut_parsed_island_up_to
-post_cut_parsed_island_up_to N
Used in combination with -post_cut_if_surrounded_by_at_least specifies the maximum amount of parsed words that can be cut.
- -post_cut_if_surrounded_by_at_least
-post_cut_if_surrounded_by_at_least N
Used in combination with -post_cut_parsed_island_up_to
specifies the minimum amount of unparsed words that an island of parsed words must be sorrounded
by in order for it to be cut.
- -skip_OOV
-skip_OOV {ON, OFF}
If ON filters out out-of-vocabulary terminals present in the input utterance.
Default value: OFF.
- -interpretations
-interpretations N
Shows up to N interpretations per input utterance.
Note that an interpretation is a sequence of non-overlapping parse trees.
- -print_tree
-print_tree {FANCY, ONE-LINE}
If FANCY prints tree with indenting proportional to node depth.
(Superseded by graphical trees in G·Soup.)
Default value: ONE-LINE.
- -print_probability
-print_probability {ON, OFF}
If ON prints probability of arc transitions.
Default value: OFF.
- -print_aux_NT
-print_aux_NT {ON, OFF}
If ON prints auxiliary nonterminals.
Default value: OFF.
- -print_as_DAG
-print_as_DAG {ON, OFF}
If ON prints interpretations as a sequence of parse directed acyclic graphs (DAGs) as opposed to the usual parse trees.
See parsing directed acyclic graphs.
Default value: OFF.
- -debug
-debug N
Sets debug messages level to N.
Default value: 0.
- -verbose
-verbose N
Sets general verbosity level to N.
Default value: 0.
back to top
Soup is the server, G·Soup is the client.
Note that currently the socket connection is not very robust, so make sure both
server and client are running.
5.1 Running Soup
To run Soup with, say, grammar EngToy.gra and communications port 35000:
soup -grammar EngToy.gra -editor_socket_port 35000
5.2 Running G·Soup
To run G·Soup as a client to a Soup process running on, say, a machine named
pong and port 35000:
java GSoup -hostname pong -port 35000
Unless you run G·Soup from the GSoup directory itself, you'll have to set the
$CLASSPATH environment variable, e.g.
setenv CLASSPATH /afs/cs/project/cmt-46/trans/Soup/GSoup
Also, G·Soup can be run as an applet, i.e. within a web browser, or through
appletviewer. If run as an applet you need to provide the host and port as parameters to
the applet's HTML tag, e.g.
<APPLET
CODEBASE="file://localhost/afs/cs/project/cmt-46/trans/Soup/GSoup"
CODE="GSoupApplet.class"
WIDTH=960
HEIGHT=630>
PARAM NAME=socket_hostname VALUE="localhost">
PARAM NAME=socket_port VALUE="35000">
Please enable Java.
</APPLET>
5.3 Binaries
Soup has been compiled and thoroughly tested for the following platforms:
|
Alpha
|
/afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_a
|
|
HP
|
/afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_h
|
|
Sun
|
/afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_s
|
|
Linux
|
/afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_l
|
G·Soup Java classes are available at:
| |
/afs/cs.cmu.edu/project/cmt-46/trans/Soup/GSoup |
back to top
Soup was originally thought as a re-implementation of the
Phoenix parser developed by Wayne Ward as a stochastic analyzer. Although it
preserves most of the grammar formalism, Soup has been built from scratch.
Here are some old notes on Soup and how to convert Phoenix grammars
into Soup grammars.
- Stochastic framework
(CFG) the parser converts it into recursive-transition networks (RTNs) with
the additional information of transition probabilities at each RTN node.
The stochastic parameters consist of probability distributions at each
RTN node (with the probabilities of all outgoing arcs of a node summing
to one) and currently default (before training) to uniform distributions.
- Training the stochastic parameters
Training is achieved by learning those distributions from a corpus of correct
parses: given a set desired (but achievable with the given grammar) parse
trees, the training procedure increments counts and adjusts probabilities
on the RTN nodes and arcs along the path that leads to the desired parse.
- Using the stochastic parameters
There are two main usages of this new stochastic framework:
-
Better heuristics in parsing: arc probabilities are included in
the heuristic function that is used to guide the search in the parsing
stage. More likely paths are thus preferred and explored first.
Generation of synthetic data for Language Modeling: CFGs can be
employed not only for their primordial purpose of parsing but also to
create a corpus of synthetic data that can in turn be used for statistical
language modeling, especially for those tasks or domains for which a corpus
of naturally-occurring text cannot be found. In that case it is important
that the synthetic data reflect natural patterns as closely as possible,
for which this type of stochastic modeling through RTN training is ideal.
- Additional novel features
-
No need to pre-compile the grammar: creation of RTNs is performed
dynamically at run-time (and very quickly e.g. in less than 2 seconds for the
English Spontaneous Scheduling Task (ESST) grammar that consists of 784 nets
and gives rise to 8099 nodes and 12848 arcs).
-
Efficient memory management: since all data structures are dynamically
created and destroyed without any hard-wired constants, there are no
restrictions as to the maximal size of, say, an RTN or a parse tree, and
at the same time only the memory that is really being used is allocated.
-
Maximum number of interpretations to be searched for (i.e. best plus
ambiguities) can be set as a command-line parameter. The value `all' is
also allowed.
-
Some checking of grammar consistency is performed: e.g. all nonterminals
used must have at least one rewrite rule, warnings issued for unreachable nonterminals, etc.
-
Option of case-sensitive matching (of input string tokens against
grammar terminals) is supported.
- Moving from Phoenix to Soup.
If you have a Phoenix grammar and want to use it with Soup
these are the points you need to pay attention to:
- Grammar has to be a single file (e.g. cat *.{gra,nt} > all)
- Grammar file name has to end with .gra, e.g. EngPar.gra
- Map-strings file has to be in the same directory as the grammar file and has to have same name body, with extension .map_strings, e.g. EngPar.map_strings
- Top-level concepts are marked with an s. There is no need for either netsor forms files.
- The grammar does not have to be pre-compiled, just give the ascii file directly to Soup through the -grammar command-line argument.
- Please look at the error and warning messages issued by Soup as it reads in the grammar file.
- Subdomain probabilities integrated into search.
Instead of an inefficient two-pass search (first find parse trees, then
re-rank them according to the subdomain statistical model), Soup now conducts
a single-pass search that takes the subdomain probs into account, as another
knowledge source. This begs the question of how to weight all the factors that
intervene in the calculation of the interpretation score (interpretation =
sequence of non-overlapping parse trees), namely:
number of words covered (maximize)
number of parse trees (minimize)
number of parse tree nodes (minimize)
number of wildcard matches (minimize)
probability of parse trees as paths along grammar arcs (maximize)
probability of words covered given subdomain picked (maximize)
Right now importance follows, approximately, the above ordering, but experiments would be needed to fine tune it, especially on how to best combine the last two probabilities. In fact, a more theoretically-sound model is being developed, namely the maximization of
P(words | domain) · P(interpretation | domain) · P(domain)
- Parsing directed acyclic graphs (DAGs)
Instead of parse trees, Soup now creates parse DAGs to help contain true ambiguities.
Example. Given the grammar:
s[time]
(+[point])
[point]
([hour])
([minute])
[hour]
(1)
(2)
; ...
[minute]
(1)
(2)
; ...
and input sentence 1 2 Soup finds the parse DAG:
i.e.
[time] ( [point] ( { [hour] ( 1 ) | [second] ( 1 ) } ) [point] ( { [hour] ( 2 ) | [second] ( 2 ) } ) )
that, at print time (unless command-line argument -print_as_DAG on is given), gets expanded into the four parse trees:
[time] ( [point] ( [hour] ( 1 ) ) [point] ( [hour] ( 2 ) ) )
[time] ( [point] ( [hour] ( 1 ) ) [point] ( [second] ( 2 ) ) )
[time] ( [point] ( [second] ( 1 ) ) [point] ( [hour] ( 2 ) ) )
[time] ( [point] ( [second] ( 1 ) ) [point] ( [second] ( 2 ) ) )
Note, however, that if [hour] and [minute] have a non-uniform
distribution under [point], less likely combinations may get pruned away.
- More efficient training.
|