I found the Rumelhart and McClelland article to be very compelling, though I had a few questions about their process. Why do they start with all of their connections set to zero (p. 201) instead of using random starting weights? On p221, they compare their giving the model the target output to a parent correcting their child's incorrect words. I'm not sure if it would make a difference here or not, but I was a little curious as to the effect of the fact that kids are NOT always corrected...if they're speaking and no adults are around or adults are around but don't correct them (because they think the mistake is "cute" or for whatever other reason), would this reinforce the form that the kid used, since feedback is usually only given when the child makes a mistake? Or would it neither strenghten nor weaken the same response in the future?I also wondered why the more frequent irregulars were not still presented to the model more often than the less frequent words during the second phase of the training (don't the 10 most common verbs continue to be the most common words that the child hears/uses? And shouldn't that have an influence on the learning rate?
Then I read the Pinker and Prince article and ended up feeling like I"d been completely duped by Rumelhart and McClelland. :) I thought they raised some excellent points that made me question the validity of the results of the R&M article. They point out that the R&M model doesn't explain why certain processes like mirror image reversal does not occur in any language while stem reduplication (which does occur in natural languages) cannot be described generally by the R&M model. Even more important, to me, is their argument about generalization. I find it very disconcerting that a model like that posited by R&M could not make simple voicing generalzations, or generalize across the perfect/passive participle and verbal adjective forms that are the same for many words. Finally, they point out that the model does not explain why homophonous verbs could have different past tenses, since they are identical phonologically. One question I did have about the Pinker and Prince article was why they felt that the model did not in any way represent the relatedness of words like write and written, in which the written vowel remains the same but the spoken sound changes some. In these cases, it seems like the model would be unchanged for some of the vowel featurs and changed for others, representing a shift in the vowel without a complete change to a new, unrelated vowel.
Rumelhart and McClelland:
p. 196 mentions "perceptual facilitation" of letters in made-up words in R&M's early model. This reminded me of our discussion of this very phenomena in Intro to CogSci last year. It's neat to see this phenomena come back, explained in a connectionist manner more overtly or explicitly than it was then. I wonder if similar success is found when examining the visual (image) or auditory analogs to this phenomenon.
I've also been reading Elman in preparation for the midterm project. I noticed that Elman seems to be dealing with complex networks with an advanced ability to simulate performance, whereas the network in the R&M paper seems a simpler model, and they focus more on how well it simulates learning or development. I think it's important to make this distinction, and to decide what level of attention you wish to give to development and performance when conducting an experiment on an artificial NN. Have studies analogous to this acquisition of the past tense study been done, examining the acquisition of syntax?
It seemed to me that the biggest weakness of the model presented in R&M is that the model is always presented with the root verb and the correct past tense (in any single trial). This seems like a rather gross simplification-and R&M acknowledge this to some limited extent-and not very true to real life and the actual circumstances in which children have to learn the past tense. It isn't clear to me how much considering the true learning circumstances (as described briefly in the paper) would alter the results of the simulation-I don't know enough about developmental psychology or linguistics to hazard a guess. But it seems like rather a lot of complication to discount. While assuming a certain level of abstraction has already been achieved seems appropriate for models designed to simulate performance, I think a model designed to simulate development should be truer to developmental circumstances.
p. 237-are type VII verbs actually learned relatively quickly by children?
p. 239-240 predicts that verbs with relatively little change in the final phonemes from base to past tense will tend to exhibit more frequent past + ed regularization than verbs with greater root-to-past phonemic change. Has this prediction been tested (on real people) since the paper was written? What results?
p. 241 uses proportion of Wickelfeatures correctly generated, averaged over all verbs, as a measure of overall performance of the network. This doesn't seem to me to be a very useful test. I want to know what percentage of the verbs it got totally right, not what percentage of each verb it got right averaged over all the verbs. Couldn't a large part of the latter figure be accounted for just by the usual similarity between the root and the past tense. Suddenly 90% doesn't seem like such an impressive figure.
p. 244-this double marking-has this been found (scientifically) in real life?
I have not yet skimmed the (monsterous) Pinker and Prince article apologies, etc. I'll try to take a look at it in the morning
On Pinker and Prince: Their main conlusion seemed to be something like "Rumelhart and McClelland's model doesn't properly account for the evidence, so our model must be right!" I wasn't impressed by their logic.
One of the things that they harped on a lot was the fact that the RM model easily models rules, like the reverse-everything rule, that occur in no known language. But a rule-based system has no restrictions of any kind on what kind of rules can be put into it. There is no reason why their model could not have just such a rule, and they offer none. In fact, this is a general characteristic of the article, that Pinker and Prince point out something that the RM model does or cannot do, but offer no indication of how a symbolic rule-based system would be different.
Another thing I noticed was seemed to be a lack of understanding of the functioning of connectionist systems on the part of Pinker and Prince: they talk about the difference between generating the past tense of regular verbs by rule and simply memorizing the forms of irregular verbs, but in a connectionist system the distinction between memory and processing is lacking.
Pinker and Prince also seemed to spend a lot of time trashing the Wickelphones and Wickelfeatures. I agree with them that the representation is not ideal, but it is still just a representation. Maybe it has more importance than I think, but they seem to be attaching too much to it.
I had some comments on the Rumelhart and McClelland too, I think, but I can't find the article right now.
I am somewhat torn over Rumelhart and McClelland's use of Wickelphones and Wickelfeatures. On one hand, I like the direct influence of context in the input. It seems natural that as we interpret phonemes in relation to the sounds surrounding them, and not as solitary units. On the other hand, I'm divided over the atemporal nature of the input. We definitely hear words as a chain of sounds; is it really reasonable to assume that we process a word as an amalgamation of all of its component sounds? Also, throwing all of the Wickelfeatures together deemphasizes the importance of proximity beyond the neighboring phonemes. If the network finds it necessary to look beyond the neighboring phonemes for context, it has to make a complicated set of transitive associations. To some extent, temporal information still exists in that there's probably only one way the Wickelphones (and to a large extent the Wickelfeatures) can tie together, but to extract that temporal information would require a great deal of overhead. I guess I'm wondering if it's really fair to ignore the temporal aspects of language and focus the network on immediate context. Would some string of Wickelphones be more accurate? (Like using a group of input nodes representing the first Wickelphone, a group representing the second, etc.)
I'm utterly confused by the output. The section on Generating Overt Responses completely lost me. I think I understand what they were doing with tying Wickelfeatures to Wickelphones in the first part of the section (216 to mid 219), but I'm confused by how they justify the constraint that the Wickelphones ``fit together'' (218). On one hand, the representation of a word as a whole bunch of Wickelphones/Wickelfeatures at once promotes the idea of atemporal interpretation of the word, but the fitting together re-imposes temporality by only allowing Wickelphones that chain together.
The second part of the Generating Overt Responses section (mid 219 on) also confused me. Were the two overt response translation mechanisms tied to the Wickelphones or the Wickelfeatures? They talk about both ``translating the pattern of Wickelphone activations'' and ``allow[ing] the Wickelfeature units to excite consistent response units'' (219). Pinker and Prince describe the mechanisms as being tied to the Wickelfeatures (95), but then how are they translating Wickelphone patterns? Why not tie them to Wickelphones? I'm confused. Also, what did they use to train the network? The overt responses or the output described in the first part of the section? Was the overt response model (e.g. the whole-string binding network) just used for analysis?
The Pinker and Prince article, in addition to being extremely helpful in deciphering the Rumelhart and McClelland, seems accurate in its criticisms of the RM model. The lack of lexical and morphological input, the over simplification of the input representation, etc. are all valid arguments against the model, but as Pinker and Price admit, ``The RM model is just one early example of a PDP model of a language, and Rumelhart and McClelland make it clear that it has been simplified in many ways and that there are many paths for improvement ...'' (172). What I find confusing about the article is how they generalize arguments against the RM model to PDP modeling in general.
Some examples:
I'm only halfway thru the Pinker and Prince. I can't believe how long the article is – 3 times longer than any other I've read in Cognition.
p. 98. To some extent, it should not matter that Wickelphone/Wickelfeature does not allow a network to see the spelling similarity between the present and past tense of such word pairs as write-written, since literacy is not necessary for language.
p. 112. I wish Pinker and Prince would preserve the distinction between a formal regularity in languages (such as "use in a name strips a morpheme of its original content," as in the case of the "Toronto Maple Leafs"), and what speakers of the language actually do with novel cases. For example, to the average person unfamiliar with the team, "Toronto Maple Leaves" might seem better than "Toronto Maple Leafs." Without empirical support, Pinker and Prince are begging the question if they are asserting that people follow this rule for a particular instance without prior experience with the instance or highly similar instances.
Similarly, it should not matter that the RM model allows for such arbitrary maps as S -> S^R not found in the world's languages, unless one assumes a priori (as Pinker and Prince do) that an innate, universal grammar exists that prohibits such maps. One of the major accomplishments of connectionism is that a general-purpose network can create very specific, non-arbitrary behavior by focusing on the regularities it discovers in its input. Keep in mind that what seems arbitrary in one language is a persistent rule in another.
While it's fair to attack particulars of the RM model since Rumelhart and McClelland base strong claims on its performance, a more interesting question is whether any eliminative connectionist or revisionist-symbol-processing connectionist model can tackle the problem of language acquisition. Obviously, the implementation of such a model would have to be orders of magnitude more complex than RM in order to handle the richness of language Pinker describes. Among other modifications, the model would have to include at least one layer of hidden units to solve such problems as differential inflection for homophones (which reads like a problem of linear inseparability), as well as a more life-like input representation that would allow for access at the lexical level.
Wickelphones--while a good idea, perhaps, in theory (the trigrams apparently obviate the problem of ordering data), can I just say that, if you were doing this research and had to refer to Wickelphones and Wickelfeatures all day long, would you not eventually burst out laughing? Okay enough digression, but aside from the humor value inherent in their name, these Wickelwhatevers seem like they would only work for so long. The more words you introduce into the system's vocabulary, the higher the likelihood that a Wickelphone will repeat itself (or so it would seem to me). Perhaps they're unique enough that that won't happen--but even so, the trigrams seem to me to be another ridiculous abstraction away from the point. Yeah, it's a neat analytical tool, but what the heck do secondary german-sixth pivot chords in d-flat have to do with e major? I mean, what do Wickelthings have to do with actual speech perception? (Help, my classes have stopped making sense within their own frameworks!) Also--okay, it's a neat analytical tool, given that all the data have to go unordered into the network, but perhaps i missed a basic point here, but why does it have to be unordered? This again seems kind of weird and artificial obviously, going into someone's ear, phonemes are ORDERED. That's kind of, oh, the BASIS what makes up speech, and to destroy that by abstraction (especially by something named Wickelphones!) seems to kill the experiment in the beginning.
Like Julie, I wondered why R&M started their network with weights all at zero--hadn't we been told that sometimes that initially impedes learning? Then Pinker et al. said that that was essentially the same as having it randomly weighted in the beginning, which I don't quite understand.
Andrew pointed out that a huge fault of R&M seems to be that the network is presented with the present and past forms of a verb in a single trial, and that's really not what happens in real life. Two responses to this: One, perhaps it is what happens, in a way, after the initial stage (Stage 1 or whatever R&M called it)after learning the present forms, children must know what verb they want to use in the present before they try to produce the past form, yes? In order to get from the idea I want ice cream towhat happened yesterday, the child has to think want -> wantED but maybe this is only after they've= generalized the rule and stopped memorizing separate lexical entries, which IS the initial stage, and therefore kind of takes down whatever rebuttal i just made, and provides me a segue to my second response. Two: If indeed presenting the network with both forms of the verb simultaneously is a veryartificial way of going about things, how the heck are we supposed to do it? Reading R&M and then the (tome of a) response, I'm wondering how mygroup can even attempt to model past tense for the midterm But I supposethose are questions for a different format and a (slightly) differentaudience.
The fact that homophony isn't accounted for by the RM model is weird. Thenagain, the only way TO account for it would be to incorporate other aspectsbesides phonological representation (syntax, semantics), which would lead to a much larger and much more complex system, which was not the point of the R&M model hm, is there any end to any of this without modeling the whole brain? The more I read about this, the more it seems NO'. I just realized that my sophomore paper's due at noon, so I have to jet.