CS 22

Clab 18: Grammars


Our goal will be to have some fun and to develop a data-directed sentence generating system. Let's say a sentence is an article followed by a noun phrase followed by a verb phrase. You might imagine defining a system to generate random sentences as follows:

(define (nth n x) (if (= n 0) (car x) (nth (- n 1) (cdr x)))) (define nounlis '(dog dean professor student class)) (define verblis '(ran ate slept drank talked)) (define adjlis '(red slow dead nice quietly)) (define advlis '(quickly happily well)) (define (sentence) (append (append '(the) (nounphrase)) (verbphrase))) (define (pickrand lst) (nth (random (length lst)) lst)) (define (noun) (pickrand nounlis)) (define (verb) (pickrand verblis)) (define (adverb) (pickrand advlis)) (define (adjective) (pickrand adjlis)) (define (either a b) (if (= (random 2) 0) (a) (b))) (define (verbphrase) (either (lambda () (list (verb))) (lambda () (list (verb) (adverb))) ) ) (define (nounphrase) (either (lambda () (list (noun))) (lambda () (cons (adjective) (nounphrase))) ) ) Paste these definitions into a scheme window, save as sentences.scm and generate a few sentences. > (sentence) (the slow professor talked quickly) > (sentence) (the red dean drank) > (sentence) (the dog drank) A few of the interesting ones I got were: (the dog ran) (the class drank) (the nice class drank quickly) (the professor ran well) (the slow dog ate happily) Of course I got some strange ones, too: (the red dead student ran quietly) (the slow nice dead dead class ran quietly) We took a description of a sentence (in English) and wrote procedures to generate sentences. If the description of sentence changed, we would have to redefine our procedures. Today we would like to define procedures that can take a formal description of sentences (a grammar) as input data and from that description (=data) generate sentences. Our key goal is this: If the definition of sentence changes, we will change the description of sentences, BUT we will NOT have to change our procedures.

We can formally describe the sentences like those above using the following "context-free grammar":

sentence -> article nounphrase verbphrase nounphrase -> adj* noun adj* -> () | adj adj* verbphrase -> verb adverb noun -> dog | dean | professor | student | class adj -> red | slow | dead | nice verb -> ran | ate | slept | drank | talked adverb -> quickly | happily | well | quietly article -> the Each line is a "rewrite rule". For example, any time verbphrase appears, a verb followed by an adverb can be substituted. The names on the left of -> are called nonterminal symbols. The other words (dog dean professor student class red slow dead nice ran ate slept drank talked quickly happily well quietly the) are called terminal symbols. Nonterminal symbols can be rewritten. Terminal symbols cannot be rewritten. This is a special case of a phrase structure grammar. To generate a phrase: start with the nonterminal for the phrase and keep rewriting it according to the rules of the grammar until you get to something with all terminal symbols. The vertical bar '|' indicates a choice. () indicates the empty string. For example to generate a nounphrase, we may choose the following ('=>' indicates 'is rewritten as') Derivation rule used nounphrase => adj* noun nounphrase -> adj* noun => adj adj* noun adj* -> adj adj* => adj adj adj* noun adj* -> adj adj* => slow adj adj* noun adj -> slow => slow red adj* noun adj -> red => slow red noun adj* -> () => slow red student noun -> student We stop generating here because we have a string of all terminal symbols. Derivations in grammars are inherently nondeterministic. There are other ways that the same nounphrase could have been derived. There are an infinite number of different nounphrases that can be derived from this grammar. There is a lot more to formal grammars than we are discussing here. They are very useful in both linguistics and computer science. They are discussed further in CS courses on compiler design, programming languages, and theory. They are also discussed in a number of Linguistics courses.

We can represent the grammar shown above as a list as follows:

(define simpgrammar '((sentence -> (article nounphrase verbphrase)) (nounphrase -> (adj* noun)) (adj* -> () (adj adj*)) (verbphrase -> (verb adverb)) (noun -> dog dean professor student class) (adj -> red slow dead nice quietly) (verb -> ran ate slept drank talked) (adverb -> quickly happily well) (article -> the) ) ) The representation of the grammar is a list of rules. Each rule consists of a sublist. If the right side of a rule is just a choice of simple objects, each choice is listed with a space in between (like the right side for adj). If a choice is a compound object, then that object is put in a sublist (like the right side of verbphrase ).

Our task is to write a procedure 'generate' that given a phrase, ph, and a grammar, G, will generate a random phrase of type ph from grammar G. In order to do this in a reasonable way, we will introduce some abstraction barriers. Define simpgrammar as above. Then complete the definition of each of the following:

;;; given a grammar rule of the form (ls -> rs) return ls (define (ruleleft rule) ... ;;; given a grammar rule of the form (ls -> rs) return (rs) (define (ruleright rule) ... ;;; given a grammar that consists of a list of rules of the ;;; form (ls -> rs) return a list of the possible rewrites ;;; for the nonterminal nonterm in the grammar specified (define (rewrites nonterm grammar) ... Here are some sample applications: : (ruleleft '(sentence -> (article nounphrase verbphrase)) ) sentence : (ruleright '(sentence -> (article nounphrase verbphrase)) ) ((article nounphrase verbphrase)) : (rewrites 'sentence simpgrammar) ((article nounphrase verbphrase)) : (rewrites 'adj* simpgrammar) (() (adj adj*)) Before we define generate, we need one more abstraction barrier: ;;;nonterm? returns a value that will be considered true ;;; (i.e. not #f) if cand is a nonterminal in the grammar ;;;specified by the second parameter. returns #f otherwise. (define (nonterm? cand grammar) (assoc cand grammar)) We also need a couple of functions that we used before: (define (nth n x) ;;;returns item (n-1) from list x (if (= n 0) (car x) (nth (- n 1) (cdr x)))) (define (pickrand lst) ;;;returns a random element of list lst (nth (random (length lst)) lst)) We are now ready to try to define the procedure generate. generate should return a list that consists of a random phrase of the type specified by the first parameter derived using the grammar specified by the second parameter. Here are three examples. > (generate 'nounphrase simpgrammar) (nice class) > (generate 'sentence simpgrammar) (the quiet nice dog ate quietly) > (generate 'sentence simpgrammar) (the student drank quietly) Here is my first attempt: ;;; generate returns a list that consists of a random ;;; phrase of the type specified by the first parameter ;;; derived using the grammar specified by the second parameter. (define (generate phrase grammar) ;;not quite right (define (gen phrase) ;;; helping function which lets me (generate phrase grammar)) ;;; skip the second parameter (cond ((list? phrase) ;;; if the phrase is a list, apply gen (map gen phrase)) ;;;recursively to each element ((nonterm? phrase grammar) ;;; if the phrase is a nonterminal ;;; pick one of the possible substitutions ;;; for phrase and apply generate recursively (gen (pickrand (rewrites phrase grammar)))) (else (list phrase)) ;;;phrase must be a terminal, return it ) ) Here are the results of two invocations: > (generate 'sentence simpgrammar) ((the) (((slow) ((quiet) ())) (dog)) ((ran) (happily))) > (generate 'sentence simpgrammar) ((the) (((dead) ((slow) ((nice) ((red) ())))) (professor)) ((talked) (quickly))) Unfortunately, I got more parentheses than I wanted. What did I do wrong? Please fix it for me and generate some funny sentences. Hint: recall flatmap (not a standard scheme function but one discussed by your authors on page 123).

Now for the payoff. The new improved generate should work for any context-free grammar represented as specified. So try it on the grammar below.

(define limpgrammar '((sentence -> (nounphrase verbphrase)) (nounphrase -> (article adj* noun pp*) (name) (pronoun)) (adj* -> () (adj adj*)) (verbphrase -> (verb adverb) (verb nounphrase pp*)) (noun -> dog class professor student candidate ball dean table) (adj -> red slow dead big little boring nice quiet long-winded) (verb -> ran ate slept drank hit took saw talked) (adverb -> quickly happily well slowly quietly) (article -> the a) (pp* -> () (pp pp*)) (pp -> (prep nounphrase)) (prep -> to in by with on) (name -> george al kim lee terry robin) (pronoun -> he she it these those that) ) ) Here are a couple of the less silly sentences generated: (robin slept well) (al saw well) (a professor by lee slept happily) Here are a couple of the really bad sentences generated: (a dead slow little slow little long-winded dean talked well) (a table with a big dean to kim slept that) (a student by the class in george took a class to lee to a quiet nice student on robin in a dean with the dean on the boring professor by these with these on a dean to these) You can get some pretty stupid sentences from this grammar. We have not done anything about case (e.g. 'on he') or whether a verb should take a direct object. But you can see that with some refinement, it might be possible to build a sentence generator. You may want to try to make up better grammars. Take a linguistics course to learn more.

Now try to generate some arithmetic expressions using the following expression grammar. Save anything that is valuable BEFORE you try this.

(define exprgram '((expr -> (ident) ( "(" expr ")" ) (numb) (expr + expr) (ident) ;;;repeats to make sure pickrand can stop (expr * expr) (numb) (expr - expr) (ident) (expr / expr)) (ident -> x y z cat force mass) (numb -> 1 2 3 40 50 60))) We have to make the parentheses strings because parentheses have special meaning in scheme. We have to repeat (numb) and (ident) on the right side of expr in order to lessen the probability of very long sequences of recursive calls to expr. : (generate 'expr exprgram) (x + cat) : (generate 'expr exprgram) ("(" cat - 3 * 1 - y - "(" y ")" ")") You can get rid of the quotes around the parentheses by invoking display, e. g. : (display (generate 'expr exprgram)) (( 2 ) - z) The procedure generate is another example of data directed programming. The data (in this case the grammar) tells the program what to do. This data consists of a set of rules so this is also an example of a "rule-based production system".