Explanation:
  • -grammar
    -grammar gra_fn
    Grammar file.
    Soup looks for map-strings file (gra_fn - .gra) + .map_strings.
    Example: -grammar ../Grammars/EngToy.gra
    -grammar gra_fn_1 gra_fn_2 ... gra_fn_N
    Grammar file followed by grammar parts. Same as appending gra_fn_2 ... gra_fn_N to gra_fn_1.
    Example: -grammar $gra_dir/Eng_all.gra $gra_dir/Eng_nouns.gra $gra_dir/Eng_verbs.gra $gra_dir/Eng_rest.gra

  • -grammars
    -grammars gra_fn_0 gra_fn_1 gra_id_1 ... file_name_N gra_id_N
    Grammar name followed by pairs of (grammar file, grammar ID). Soup attaches : + IDi to each nonterminal defined in grammari.
    Example: -grammars EngTravel.gra EngHotel.gra HTL EngTransportation.gra TPT EngEvents.gra EVT
    appends :HTL to all LHSs defined in EngHotel.gra, e.g. [room] becomes [room]:HTL.

  • -shared
    -shared shared_gra_fn_1 [... shared_gra_fn_N]
    Shared grammar file(s). Each shared grammar file
    • must be self-contained, i.e. all LHSs referenced must be defined in the same file, and
    • cannot contain top-level LHSs, i.e. no LHS can be marked with an s.
    Also, note that LHSs defined in a shared grammr file are not editable through G·Soup.

  • -allow_as_ANY
    -allowed_as_ANY in_vocabulary_wilcard_matchable_fn
    File containing those (in-vocabulary) words that are allowed to match the wilcard _$any$_. Note that _$any$_ is always able to match out-of-vocabulary words.

  • -map_strings
    -map_strings map_strings_fn
    Where map_strings_fn is a file containing the global search and replace string pairs following the syntax:
    "search_string" --> "replace_string"
    Example:

     " i've " --> " i have "
     " isn't " --> " is not "
     ' m"ochte ' --> ' moechte '

  • -develop
    -develop development_fn
    Development file.
    development_fn should contain the utterances that the grammar is to be developed from. One utterance per line, comments allowed.

  • -mark_boundaries
    -mark_boundaries {ON, OFF}
    If ON places input line within <s> (beginning of utterance) and </s> (end of utterance). Example: okay becomes <s> okay </s>.
    This is useful to allow certain rules to match only at the beginning bzw. end of an utterance.
    Default value: OFF.

  • -all_TOP
    -all_TOP {ON, OFF}
    If ON all nonterminals are considered top-level, i.e. able to stand at the root of a parse tree.
    Default value: OFF.

  • -case_sensitive
    -case_sensitive {ON, OFF}
    If ON the match between input words bzw. characters and grammar terminals is sensitive to case.
    Default value: OFF.

  • -editor_socket_port
    -editor_socket_port port_number
    Port number to which a G·Soup client will connect to.
    Example: -editor_socket_port 35000.
    No default value.

  • -backend_socket_port
    -backend_socket_port port_number
    Port number to which a backend client will connect to.
    Example: -backend_socket_port 25000.
    No default value.

  • -train
    -train gra_train_fn
    Grammar training file.
    gra_train_fn should contain parse trees attainable with the given grammar, including the auxiliary nonterminals. Given parse trees are considered correct and induce a perturbation of the original uniform distributions at the grammar node level, i.e. the training procedure increments counts and adjusts probabilities on the RTN nodes and arcs along the path that leads to the desired parse.

  • -nb_train
    -nb_train naive_bayes_train_fn
    Naive Bayes subdomain model training file.
    naive_bayes_train_fn should contain utterances annotated with subdomain using the same grammar IDs as given by -grammars.

    Example:

     .I 3
     .C GTR
     .T hi i'd like to make a trip to pittsburgh
     .I 4
     .C HTL
     .T i'll be traveling with my family and we're looking for hotel reservations

  • -generate
    -generate N
    Generates N sentences according to the current stochastic parameters. Useful e.g. to create synthetic data for statistical language modeling. The number at the end of each generated utterance indicates the average arc transition probability along stochastically walked path.

  • -post_cut_parsed_island_up_to
    -post_cut_parsed_island_up_to N
    Used in combination with -post_cut_if_surrounded_by_at_least specifies the maximum amount of parsed words that can be cut.

  • -post_cut_if_surrounded_by_at_least
    -post_cut_if_surrounded_by_at_least N
    Used in combination with -post_cut_parsed_island_up_to specifies the minimum amount of unparsed words that an island of parsed words must be sorrounded by in order for it to be cut.

  • -skip_OOV
    -skip_OOV {ON, OFF}
    If ON filters out out-of-vocabulary terminals present in the input utterance.
    Default value: OFF.

  • -interpretations
    -interpretations N
    Shows up to N interpretations per input utterance.
    Note that an interpretation is a sequence of non-overlapping parse trees.

  • -print_tree
    -print_tree {FANCY, ONE-LINE}
    If FANCY prints tree with indenting proportional to node depth.
    (Superseded by graphical trees in G·Soup.)
    Default value: ONE-LINE.

  • -print_probability
    -print_probability {ON, OFF}
    If ON prints probability of arc transitions.
    Default value: OFF.

  • -print_aux_NT
    -print_aux_NT {ON, OFF}
    If ON prints auxiliary nonterminals.
    Default value: OFF.

  • -print_as_DAG
    -print_as_DAG {ON, OFF}
    If ON prints interpretations as a sequence of parse directed acyclic graphs (DAGs) as opposed to the usual parse trees. See parsing directed acyclic graphs.
    Default value: OFF.

  • -debug
    -debug N
    Sets debug messages level to N.
    Default value: 0.

  • -verbose
    -verbose N
    Sets general verbosity level to N.
    Default value: 0.


to topback to top

5. Running Soup and G·Soup

Soup is the server, G·Soup is the client. Note that currently the socket connection is not very robust, so make sure both server and client are running.

5.1 Running Soup

To run Soup with, say, grammar EngToy.gra and communications port 35000:
  soup -grammar EngToy.gra -editor_socket_port 35000

5.2 Running G·Soup

To run G·Soup as a client to a Soup process running on, say, a machine named pong and port 35000:
  java GSoup -hostname pong -port 35000
Unless you run G·Soup from the GSoup directory itself, you'll have to set the $CLASSPATH environment variable, e.g.
  setenv CLASSPATH /afs/cs/project/cmt-46/trans/Soup/GSoup

Also, G·Soup can be run as an applet, i.e. within a web browser, or through appletviewer. If run as an applet you need to provide the host and port as parameters to the applet's HTML tag, e.g.

<APPLET   CODEBASE="file://localhost/afs/cs/project/cmt-46/trans/Soup/GSoup"
  CODE="GSoupApplet.class"
  WIDTH=960
  HEIGHT=630>
  PARAM NAME=socket_hostname VALUE="localhost">
  PARAM NAME=socket_port VALUE="35000">
  Please enable Java.
</APPLET>


5.3 Binaries

Soup has been compiled and thoroughly tested for the following platforms:

Alpha /afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_a
HP /afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_h
Sun /afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_s
Linux /afs/cs.cmu.edu/project/cmt-46/trans/Soup/soup_l

G·Soup Java classes are available at:

    /afs/cs.cmu.edu/project/cmt-46/trans/Soup/GSoup


to topback to top

6. Moving from Phoenix to Soup

Soup was originally thought as a re-implementation of the Phoenix parser developed by Wayne Ward as a stochastic analyzer. Although it preserves most of the grammar formalism, Soup has been built from scratch. Here are some old notes on Soup and how to convert Phoenix grammars into Soup grammars.

7. Latest developments

  • Subdomain probabilities integrated into search.
    Instead of an inefficient two-pass search (first find parse trees, then re-rank them according to the subdomain statistical model), Soup now conducts a single-pass search that takes the subdomain probs into account, as another knowledge source. This begs the question of how to weight all the factors that intervene in the calculation of the interpretation score (interpretation = sequence of non-overlapping parse trees), namely:

      number of words covered (maximize)
      number of parse trees (minimize)
      number of parse tree nodes (minimize)
      number of wildcard matches (minimize)
      probability of parse trees as paths along grammar arcs (maximize)
      probability of words covered given subdomain picked (maximize)

    Right now importance follows, approximately, the above ordering, but experiments would be needed to fine tune it, especially on how to best combine the last two probabilities. In fact, a more theoretically-sound model is being developed, namely the maximization of

    P(words | domain) · P(interpretation | domain) · P(domain)


  • Parsing directed acyclic graphs (DAGs)
    Instead of parse trees, Soup now creates parse DAGs to help contain true ambiguities.
    Example. Given the grammar:
    s[time]
       (+[point])

    [point]
       ([hour])
       ([minute])

    [hour]
       (1)
       (2)
       ; ...

    [minute]
       (1)
       (2)
       ; ...


    and input sentence 1 2 Soup finds the parse DAG:

    i.e.
    [time] ( [point] ( { [hour] ( 1 ) | [second] ( 1 ) } ) [point] ( { [hour] ( 2 ) | [second] ( 2 ) } ) )
    that, at print time (unless command-line argument -print_as_DAG on is given), gets expanded into the four parse trees:

    [time] ( [point] ( [hour] ( 1 ) ) [point] ( [hour] ( 2 ) ) )

    [time] ( [point] ( [hour] ( 1 ) ) [point] ( [second] ( 2 ) ) )

    [time] ( [point] ( [second] ( 1 ) ) [point] ( [hour] ( 2 ) ) )

    [time] ( [point] ( [second] ( 1 ) ) [point] ( [second] ( 2 ) ) )

    Note, however, that if [hour] and [minute] have a non-uniform distribution under [point], less likely combinations may get pruned away.

  • More efficient training.
    Instead of the generate-and-test approach (first generate interpretations, then compare them against given training example), now the path along the grammar corresponding to the desired parse is directly constructed from the example. This requires that the training examples include the auxiliary nonterminals, e.g. as output by Soup when given -print_aux_NT on.

to topback to top

8. More about Soup


Soup: A parser for real world spontaneous speech

Back to the first page
to topback to top
Site maintained by:
Céline Morel