At the time of writing this response (3am), I'm less than halfway through the paper, and I haven't really gotten to the good part yet. It sounds like a really cool experiment, and I hope I have the chance to read the rest of the paper over lunch. Unfortunately, I must now go to bed, because I have a midterm exam in the morning. My thoughts thus far:
-The title had me hoping for a connectionist architecture that really implemented recursion somehow. I was slightly disappointed when it was just a regular old SRN. ;-) But that's just my interest in exploring the computing capacity and capabilities of NNs-as far as the experiment goes, the SRN architecture seems reasonable. After all, it appears that language performance (as opposed to competence) is not truly recursive either-at least, not infinitely so.
-p. 166 Help! I do not understand the statistical benchmark thing. Hopefully that will be explained in more detail later in the paper. I need that cleared up, either by the paper as I read further, or by someone who understand that paragraph better than I did.
Linguistic question: does English have cross-dependency recursion? The cross-dependant examples seemed very foreign to me-more so than the other kinds of recursion. If English exhibits this structure, could someone give me an example? If it doesn't, what languages do?
In a similar vein: the corpus constitution struck me as a bit unnatural. This may be reasonable and necessary for initial experimentation, or I may have a misunderstanding of the linguistic reality here, but it seems to me that a corpus with one recursive structure mixed with right-branching isn't very true to real natural language. If my instinct is correct, that's a pretty severe simplification. I would think that natural language would exhibit a more complex mixture of structures, and probably not in such clean ratios. Am I wrong about this? Will a given language generally only exhibit one of these three recursive mechanisms?
I have a general concern about the Christiansen & Chater paper: the network's ability to predict the next input unit is compared to grammaticality judgements of human subjects. While the grammaticality judgements are (at least in theory) a reflection of competence, I'm not so sure the same can be said about the ability to predict the next input.
Another thing on Christiansen and Chater, they briefly talk about different languages having different kinds of (combinations of) recursion. It would have been interesting to specifically model some different combinations, and see whether performance depends on what other kinds of recursion appear in the language.
I was surprised that Allen & Seidenberg chose to lesion nodes between the semantic and hidden layers when modelling Broca's aphasia. The reason is that at least to my understanding Broca's aphasia is above anything else a production disorder, not a comprehension disorder (although receptive impairment may be part of it). However, the results show that production is actually less impaired than comprehension of syntactic structures. The area in the brain traditionally associated with comprehension is Wernicke's area. So either function is not as localized as was thought, or in some cases of agrammatism maybe it is more the connection between Broca's and Wernicke's that gets injured than Broca's itself, resulting in comprehension problems, or they lesioned the wrong part of the network....
My biggest problem in Allen and Seidenberg so far is that there is no explanation of their concept of modelling 'comprehension.' From their very brief and mysterious explanation of the 'comprehension task,' it sounds like the network was only supposed to 'understand' individual words, which it did by showing the desired state of the 'semantic' nodes 11 'ticks' after the word was entered... in other words, if I'm getting it right, all it does (if it learns correctly) is spit a word back out (though in different form) 11 steps after the word is input. I'm not clear how this would be any kind of 'comprehension,' how it would depend at all on sentence structure or show anything about the network capturing any sense of the interactions of the meanings of different items in a sentence. The delay and the arbitrary nature of the association from one representation to another are both impressive, and I don't understand how it would work. But it's not clear to me that it requires, even metaphorically, understanding 'the semantics of a sequence.'
The same seems to apply to the 'production task' the input here is 'meaning sequences,' in other words semantics is here understood as really consisting of strings of words, structured identically to the desired syntactic output. So the work the network has to do in using syntax is zero. Yay for the network. If this were to show anything about a network simulating the proper linkage between 'meaning' and syntactic form, this would only be on the condition that 'meaning,' e.g. 'mentalese,' or 'logical form,' already has syntactic structure which happens to be identical to that of English. In which case, there would be nothing significant that the authors would be trying to model... and it would indeed be a success that the author's have modelled nothing significant. Of course I understand that the way the meanings of semantic units really interact in language 'binding' and so on is complicated and difficult to represent linearly and concisely, and I don't know how to do it. But it seems to me this is the whole point. If we solve this problem for the network, what problem are we really left with?
Oh, this might be a simple error, but do their average 'percentages' supposedly 0.95 percent (?) of words (not sentences of course) correctly comprehended and 0.93 percent correctly produced really show that these are 'simple tasks' for the network? The last class I got a 0.95 percent in, I failed.
As I was saying: seems to me, given that success in these the tasks is measured only by mapping the form and meaning of individual words, if both 'form' and 'meaning' are given identical syntactic structure ('patterns'), what we are left with will be trivial and no longer show anything relevant to the actual relationship between syntax and semantics. And in this case, not even by broad analogy. Unless... the section on 'grammaticality judgment,' they suggest that the network has come to associate formal and semantic 'patterns,' not just the individual words, and that inputting a 'pattern' on either level of representation stimulates the production of the corresponding pattern on the other. It would be useful to know how this works, and how they know. If the task was 'trivial' and the network was always getting the right word, how do we it was significant the network always got the right category of word? (Their 'patterns' were defined in terms of categories like ACTION, MANNER and so on right?) They say the correspondence of 'patterns' between the two layers is 'a property of the network.' All that's clear to the reader at this point is: they WANTED this to be 'a property of the network'; but they haven't demonstrated in what way it in fact is, or that it isn't just an emergent property of a task which could be seen more simply as delayed rote association of words: input and output do have the same structure, but so far there is no evidence that the network knows that this is even structure.
Hmmm I guess the point is that the correspondence of patterns on the two levels IS an emergent property of the delayed word association task, BUT that this can mirror grammaticality, in that if the input deviates from the expected pattern, word association is deranged. Now we know that the delayed word association task is an extreme and odd abstraction, and furthermore that we do understand the words in real ungrammatical sentences, so I'm still confused about what they think they're showing. At most, the possibility 'that an ungrammatical sentence results in a deviation from the normal course of processing' however, I think formal linguists would agree with this. They would just want to know, how and why? In this model, the answer seems to be that that which is 'normal' is what the network is used to, or was trained on, and that there are no rules governing the process, so that to ask these questions is useless. In this model, it would be totally impossible for syntax to be underdetermined by children's linguistic exposure. I wonder how much has been written trying to prove this last it certainly gets repeated all over the place without evidence or clarification
Finally, the Linebager discussion they quote at they end, to show formal linguistics in a maximally bad light, is weird because: 1) he's wasting time trying to figure out what the syntactic structure of the sentence would be IF the grammar could generate such a syntactic structure I don't know if anybody really thinks you should do this, if asked directly. 2) He inexplicably ignores that English never has [trace] (i.e. when an element is seemingly 'missing' in syntactic structure this IS grammatical in 'Who did Frank think [X] was going to get the job?' and '...the man Frank thought [X] was going to get the job') unless an earlier element in the sentence (relative or interrogative) tells you it may be coming. After relative or interrogative followed by an NP, as in the two examples I just gave, it's a foregone conclusion that [trace] is coming: the following NP (subject) shows that the relative or interrogative isn't playing subject role; it must be playing some other role that is normally represented later in the clause, where we put a [trace]. 3) If you know that 'think' takes a full sentential clause as a compliment, that trace works as I have just said it does, and that 'Was going to get the job' is ungrammatical in English (we need a subject, however you want to explain this), it shouldn't be mysterious that Linebager's sentence is ungrammatical. Possibly, what we need is not an overthrow of linguistics, but fewer rambling and overly technical linguists.
The connectionist model proposed by Allen and Seidenberg seems like a much more robust version of what Elman(1990) consisted of. Interestingly, the article tackles unforseen issues with the previous models of language, and provides a good basis for the network they create.
The issue of competence and performance distinction seems valid, although the argument presented is somewhat hard to comprehend. As for the non-linguistic components of speech, I totally agree to the importance of every detail and seemingly redundant piece of information in bolstering a child's ability to acquire language, but I simply have an issue with the idea that these bits of info are 'non-linguistic'. i'd think that every piece of datum coming from a person's mouth relates to linguistics in some form, and just because a few linguists in the past haven't dealt with them because they were extraneous doesn't mean that they don't have a linguistic quality to them.
The approach that they take on to learning grammar is an interesting one... by having the model produce sentences and comparing them to the input. however, what happens when the sentence is inputted as passive, and the network obviously creates the active form? wouldn't the distance between them be great even though the passive form is just as grammatical?
Anyway, one most compelling aspect of the model is that the results account for the colorless green ideas and the agrammatical patient data. This stuff should definitely be account for when trying to model language acquisition.
Not being the cynic that Sean is, I found the Christiansen and Chater article very convincing. Having the same kind of confusion and forgetfulness emerge from their models as from humans when their models are so simple is surprising.
I don't know anything about human syntax, and I thought of a couple of questions while reading C & C. First, could someone give me an example of an English rule that cannot be expressed as a set of context-free rules? I can't think of one. Second, are there rules in any language that cannot be expressed as context-sensitive rules? (i.e. Are all real grammars context-sensitive?)
A relatively minor point on C & C p.172 caught my eye. They mention experimental data that seems to suggest that ``constraints on complex recursive structures ... may derive from non-linguistic constraints.'' I was just wondering how what Pinker promotes in The Language Instinct relates to this. Would he argue that it all derives from the same working memory constraints? If anything, the SRN models seem to connect the constraints to the processing structure, and this data seems to imply that the same underlying structure processes more than just language.
One result very non-intuitive to me is how little changing the number of hidden layer nodes affected performance. One would think that increasing the number of hidden nodes would allow more information to be stored, thus dramatically improving results at depths 3+. Any ideas as to why the additional nodes weren't used by the network? Would tweaking learning parameters help?
I'm kind of confused about the Allen and Seidenberg article. Maybe I'm just confused about the setup. First, the simple form-to-meaning stuff. If I'm reading p. 12 correctly, they have 97 input nodes, only one of which is on at a time. These nodes are connected to 50 hidden nodes (p. 10) which are connected to 297 output nodes. Both the input and output nodes are connected to 2 15-node clean-up hidden layers. Then they present results based on novel data (pp. 15-16). So do the novel utterances contain new words or just new sequences of words? If just new sequences, why is it weird that each word produces the same output as during training? How is that novel data? Am I just misreading the entire thing?
Next, the grammaticality checker. So, using the same network, they activate the form nodes, then wait a bit and see if the network stabilized to an equilibrium in which the form nodes are the same. If so, grammatical. If not, ungrammatical. Is this even close to right? They wait for stabilization, then check the distance? Or do they just wait ``several ticks'' and see what happens? Are all of the results based on one network, or did they train many times and average? I'm not quite sure what's going on.
What's this ``continuous activation function'' (10) of which they speak? When they set the form nodes during training and test runs, do they maximize/minimize it?
In general, C & C are a lot more thorough in their handling of potential criticisms, whereas A & S seem mostly interested in throwing a model out, saying it's viable, and seeing what people think.
I'm a little confused by their "counting recursion" constructions. The first time they mention them, they give the example sentence of "if if if S1 then S2 then S3 then S4" and "if if S1 then S2 then S3." I can't imagine anyone actually SAYING those, though, and kept waiting for them to give an example of counting recursion that would be at least slightly more likely to be uttered in normal speech. They gave sample English equivalent sentences again when they described their benchmarks, but this time they didn't even attempt to give an example for Counting Recursion. Instead, they said that they would have the pattern "aabb" be equivalent to "NNVV" -- a decision that seemed somewhat arbitrary without any sample sentences. They also said it was expected that the network would perform best on the counting recursion and that people perform well on that kind of recursion, too (I think) -- that seems surprising, though, if there aren't any good natural language examples of it that sound like they would ever be spoken. Their only examples sounded really confusing even with one level of recursion . .
Christiansen and Chater
The competence/performance debate is an extremely important battleground for connectionists trying to model linguistic behavior. Since full-blown recursion requires a computational device with the power of a Turing machine, a modeler hoping to give a subsymbolic account of language must argue convincingly that a human’s comprehension and production of phrases relies only on "quasi-recursive" cognitive processes. To this end, Christiansen and Chater begin their paper with one example after another where human performance does not seem in line with the capacity for recursion described by Chomsky.
The traditional way of explaining poor human performance at tasks involving recursion while still retaining the symbol-processing account – namely, positing processing constraints that supposedly mimic human working memory limitations – seems ad hoc and incapable of explaining the richness of experimental data. While the network simulations run by Christiansen and Chater have their own share of problems, I believe that such a quasi-recursive account ultimately provides a better explanation of how people actually handle embedded grammatical structures.
Daniel Fairchild
In general, I found the parallels between the performance of SRNs and humans that the article described to be pretty compelling. In particular, the evidence that it was the general structure of the SRN, rather than the architecture of a specific network, that produced that performance was cool.
One problem: on page 191-192, they seem to be making predictions about human performance on embedded if-then constructions based solely on the nature of the constructions (and they seem to call this empirical, too). This doesn't seem like proper science to me.
I left for the weekend without my readings, so I was not able to read the one we received in class.
Allen and Seidenberg's paper was a welcome read in that it is the first model we have encountered that deals with complex syntactic phenomena. It also attempts to incorporate a limited semantic representation. These two features, in addition to the fact that it attempts to model both production and comprehension, makes its findings a lot more palatable then some of the other papers we have read. The hypothesis that syntax may be acquired through statistical means is more believable when the training sequences represent advanced structures as opposed to Tarzan speak.
But, this paper lacked some specification that I would have liked. What was there input representation like? The paper seemed to indicate that the types were represented orthogonally, but I'm not sure. Also, what about how functional words were represented made them more susceptible to error in processing in impaired networks. Since the semantic representations of functional words are not immediately obvious, a more explicit representation would have been nice.
Mostly, though, this approach appears to be a step up from Elman's approach, which is interesting to me because my midterm group tried to replicate and extend Elman's model. Thus, I am interested in testing this model's capabilities directly, though I think that it is too complicated to handle before the end of the semester though.