The Protopapas article was very informative albeit odd... it was very much just an overview of different neural network approaches to natural language processing. We were told it was going to be a more general paper than the Nakisa/Plunkett, but I kept on waiting for him to give me his take on things. I probably should have known better, given that his abstract states as much.
All this being said, I liked his overview. His coverage of the major projects in the field were useful as a comparison to the Nakisa/Plunkett model. Of the approaches he discusses, I found the ART model most intriguing, even though the fix for its limitations negate its biological plausibility.
The Nakisa/Plunkett model was clearly the most interesting model described in either paper. What does Pinker think of GA approaches to developing competent neural networks? Also, how prevalent are GA approaches to phonetic processing currently? I would think that allowing evolution and competition of networks would be the most biologically plausible way to mimic a faculty of a highly evolved brain.
The article by Nakisa and Plunkett raised some interesting points. The idea of using several different learning rules in a single network was interesting, as was the combination of genetic algorithms and neural networks. Their results were also fairly interesting, and I was surprised to see that they were able to train their networks so successfully with such a small language sample to train off of. Certainly, children are exposed to a much greater language sample than was needed for the networks to distinguish phonemes, so it does not seem unlikely in light of their results that children could learn a surprisingly large part of their language knowledge even if they were subjected to slightly 'impoverished' input...
The Protopapas article seemed to offer a lot more theories and such, and its layout was somewhat bothersome to me at times. Questions were raised that weren't addressed or acknowledged until much later in the article...so I found myself reading about the TRACE network and jotting down notes about all of the glaring shortcomings of the network that weren't being addressed in the article, only to find that he mentioned all of them several pages later (after leaving me to fume over his oversights for a while :) A couple of questions that I was left with:
I'd like to talk about some of the experiments Protopapas mentions that he suggests show successes of neural networks teaching themselves to recogni= ze phonemically distinctive features. Protopapas says, of this type of study in general, that 'successful applications provide an existential proof of the power of statistical generalization in the absence of predetermined structures, whereas patterns of failure illustrate the need for specific [that is, I assume, pre-existing, innate] linguistic constraints on the system.'
Yet it at least
sounds like what he goes on to cite as 'successful applications' are
really models of systems with some built-in constraints. =46or
example, 'Watrous (1990) achieved remarkable performance on the
three-consonant forced choice task...' Here a neural network had to
learn to hear a distinct= ion in place of articulation of voiced
stops, to classify sounds correctly
as either /b/, /d/ or
/g/. This actually is not the success from scratch= , without built-in
features, that he says we are looking for. Though the net had no
preexisting conception of the categories it was to find, it knew that
it had to sort the data into three categories.
In fact this is more specificity than we really find with human language= s. While three places of articulation for stops=F3labial, coronal, velar=F3are very common, the system of how many place of articulation distinctions there are is pretty varied. At least one language, Hawaiian, doesn't distinguish [t] from [k] or [n] from [ng] (and here there are no phonemically voiced stops)=F3 rather these sounds appear as variants of the single phonemes writ= ten /k/ and /n/. Far more common is to have more than these three places of articulation. For example most languages in South Asia have two places of articulation (dental and retroflex) in our t/d area, and some have a third alveolar distinction. Some languages have palatal (on the roof of the mouth) or uvular (at the back of the palate) distinctions in what Englis= h considers parts of the velar region.
In order to 'provide an existential proof of the power of statistical generalization in the absence of predetermined structures,' we would need a study to show not that neural networks can learn certain distinctions that researchers try to teach them, but that neural networks can decide, with different results based on which data they hear, how many distinctions to make in the data and of what kinds they should be. Now to a linguist, it already appears that this is what humans do. Voicing, or the specific places of articulation used by Watrous, are not universally distinctive features. They are just statistically common among the world's languages, and this may have as much to do with our mouths and throats as with our brains. Of course our brains, ears, and mouths and throats have been evolvin= g together, and it would stand to reason that the kinds of distinctions our mouths can make are by now the same distinctions our ears are good at pickin= g up, but this need not say anything about an innate structure in the brain.
Of course the kind of study I'm suggesting we would need for neural nets to really show 'success' may be impossible without a much more complicated, multi-function net. I say this because humans do not learn phonemic distinctions in isolation from the meaning of language. Rather they try to hear distincti= ons of sound where there seem to be distinctions of meaning, and what are 'minim= al pairs' for speakers of one language may be in 'free variation' (and not even heard as distinct) to speakers of another language; speakers learn distinctions based on which acoustic cues seem also to be cues to meaning.
Nakisa and Plunkett
The "white noise" condition interested and puzzled me the most in this
article. The mean fitness for the English-evolved networks trained on white
noise was, according to the authors, significantly lower than the fitness
for those trained on English and other languages. Nevertheless, I’m
surprised at just how well these white noise networks perform (7.17% fitness
versus 7.85% for English). By looking at Fig. 7 (p.120), it is easy to see
that the network gets better with the "number of training sentences"
(whatever that might mean for white noise), exhibiting the same curve as
networks trained with English. How can this be?
Protopapas
The state of the art in connectionist speech perception is less impressive
than I had imagined. There does not appear to be any network sophisticated
enough to parse a spoken sentence into its constituent words, even for
sentences that are syntactically unambiguous (I’d like a network to be able
to parse the sentence "Give me the catalogue," which cannot syntactically be
broken down as "Give me the cat a log", but which Shortlist, nevertheless,
interprets as the latter.). We’re thus nowhere near being able to decipher
sentences about time flies.
The Protopapas reading was fairly straightforward--mostly historical--, so I don't have alot of reaction notes. One question that did strike me has to do with something said on p. 412, about learning the sound system of one's native language "at the cost of losing the ability to perceive acoustic differences that are important for making distinctions in other languages." In Intro to Cognitive Science we discussed a "window" for learning foreign languages, after which it seems the brain gets much less malleable. My question is, is the "window" the result of biological considerations--i.e., hormone levels, maturation, cells dying/system decline--or is this a characteristic of neural networks? If you train a NN early with one set of data, it will have trouble later on integrating a second set into what it knows; if you train it on both sets initially, it achieves fluency in both...is this the explanation for that "window", or are there more dominating biological and economic factors?
As I was reading I found myself wondering what TRACE would do with a word pronounced really ssssssslllllllllloooooooooowwwwwwwllllllllllllyyyyyyyy. This question was most addressed later on in the discussion of TRACE's shortcomings, but actually, a bit of explanation about how both TRACE and ARTs work would be helpful to me--I read about them and looked at the diagrams, but I'm still a bit unclear.
I was also wondering how backprop would handle a node that activated itself, and two nodes that activate each other. There seems to be the potential for an infinite loop with the generalized delta rule, but I assume I'm just missing something...
I also must confess that I found Protopapas' poor writing habits somewhat distracting at times...he seemed to forget, occasionally, to end his sentences, and the clauses just kept coming. It was at times a bit hard to follow. (I claim exception to the people-in-glass-houses policy, because I'm writing at 4a.m....)
When reading Protopapas' mention of directionality, one question I had was how much impact unlimited connectivity could have on speech perception if one included syntactic and semantic models in the structures and tried to evolve the complete structure as one network. He addresses the issue briefly in the intro, saying that ``a good deal of progress can be achieved by breaking down the ... problem'' (410), but I wonder how valid it is to remove the more abstract elements. Intuitively, one might think that the problems with learning words could be helped by an integrated model. Also, the vast number of lexical candidates could be reduced by simply knowing the part of speech of the next word. Going a bit further, does anyone think conversational context helps parse speech sounds? If we're talking about building a house, do we use that context to expect one of ``carpet, carpenter, or car pollution'' (423)?
Also, I'm curious as to why all of the connectionist models mentioned by Protopapas have such rigid structure. Do less structured models become too complex to process in reasonable time? My first instinct when thinking about this problem would be to have as many connections as possible and have the learning algorithm choose the ones that need any weight. The network described in the Nakisa/Plunkett article, for example, seems to have much less structure. When trying to decipher words instead of phonemes do the less structured networks become too complex to handle?
(Protopapas p. 417) I thought the bit about noisy inputs actually making it easier for the network to learn things because it formed better generalizations was really interesting. It seems much better than current systems, which tend to break down completely if there is any sort of background noise. And it's more realistic, too, since humans certainly don't learn language in isolation from other sounds.
At a few points later in the article, after the section on TRACE, Protopapas seems to be criticizing TRACE for sometimes hearing phantom phonemes. But this is just what humans do; for example, there was the experiment in which an [s] was removed from a word and a cough spliced in in its place, and the listeners could not tell that anything was wrong. And humans frequently mishear words in similar ways.
(Protopapas p. 423) There wasn't much discussion of this, but there was a brief mention of an argument that phonetic representations could map directly to semantic representations, without a localist word level in between. This seems like it might make a great deal of sense.
(Protopapas p. 430, bottom of first column) He used the word "perseveration". The main OED entry for said word is "persevering, perseverance". Just some additional evidence for both the productivity and the redundancy of English morpholgy. Not to mention the "necessity of learning a new word frequently" theory (I had also never seen the word "orosensory" before).
(Nakisa and Plunkett) I don't quite understand the consept of the term "fitness". What exactly does the number represent? It seems odd that higher is better, but 7.85% is considered very good, as well as much better than 7.17%.
(Nakisa and Plunkett) General stuff that I thought interesting and useful: the combination of different learning rules into a single equation, and its use to allow different networks to use different learning rules. The variability of network architecture, changing the number of subnetworks and the ways in which they are connected. And the variability of the time constant, allowing different subnetowrks within a single network to learn at different rates; I don't quite understand what this does specifically, but it does seem like it would produce interesting results.
The "Evolution of a Rapidly Learned Representation for Speech" paper was specifically interested in the innateness of language in infants and how connectionist models with innate architecture and learning rules could model such learning. It is interesting that they concluded by saying that innate architecture and learning rules are not enough. Environment is also important in learning the correct speech sounds. This makes sense and seems to call for an intermediate level between nature and nurture.
What is even more interesting, though, is their claim that only a brief exposure of two minutes is needed for setting the categorical boundaries for phonemic discrimination. Is this true of biological systems as well? Can an infant learn to distinguish speech patterns even before they leave the womb? What if the infant is growing up in a bilingual environment? How can he/she distinguish categorical boundaries for the two separate languages he must discern?
From the two readings, competition seems to play a very important role within any model. Both make reference to the TRACE model although it is hardly similar to biological processes. Learning is also a big factor in good connectionist models and ART seems like a move in the right direction. The work done to differentiate LTM and STM is most innovative, and it goes along really well with what is known about psychology today.
It seems like there is much more work to do in the field of speech perception. Right now, it seems like there are too many issues to even tackle.
A. Protopapas: p. 411 - This whole thing about recognizing speech sounds based on the sound wave being really unreliable both makes sense and not. In Phonetics and Phonology a year and a half ago, one of the units we did was acoustics, where half of our homework assignments were to pair up pictures of the sound waves generated by speech with what was being said. Generally, with a little decoding they weren't that hard--the "sh" sound was tall and relatively sustained; consonants all were relatively distinctive. Then again, one of the assingments I'm remembering was matching transcriptions of the ten numerals, which I recall Pinker having specifically cited in a recent reading as being one of the easier tasks speech recognition people have. Also, stuff like later on the page (right-hand column, about "acoustic variability in the realization of lexical items"--based on emotion of the speaker, rapidity of speech, level of background noise, &c.), kind of destroys that. As with the example of DragonDictate last week, this makes sense if you've ever played around with your uncle's transcription program and had to train it to hear your voice, and quickly lost interest with what should be a cool toy. :-) Page 415 swings back in moderate support of my original idea, however:
"In principle, the acoustic signal alone contains sufficient cues for phonetic discrimination"--and of course it has to go on to say:--"with a single speaker and a single speaking rate"--but!--"and that these cues can be extracted automatically by statistical generalization of the sort an artificial neural network can be trained to do."
So--even though it's crazy that it has to be one person speaking at a constant rate, which is ridiculously sub-human in level of performance, speech sounds can still be parsed at a satisfactory level from an acoustic representation! Isn't that somehow relevant and cool?
Another point I wanted to respond to--starting on p. 412, and continuing later, the idea that A phonetic analysis is affected by known or possible words. Protopapas asks,
"is phonetic analysis based on properties of the acoustic signal only (and knowledge of phonetics, of course), or does information about known (or possible) words affect how the acoustic signal is interpreted phonetically?"
How can this be a question? And if it is not one, how can it continue to affect connectionist models? (This is another problem I'm having; see below for that.) When you hear a word for the first time that you've never heard before, chances are that you'll interpret it as a known one, parsing it in some way that makes sense with the vocabulary you already have. You will recognize the individual phonemes, it's true, but only after repetition will you realize that the collection of noise is a separate lexical entry, after which point you can integrate it into your vocabulary. Classical examples of this kind of subjective hearing are Pinker's "mondegreens" (again from last week's reading)-- "a girl with colitis goes by" (instead of "kaleidescope eyes"). While the listener presumably had entries for both "colitis" and "kaleidescope" in his/her lexicon, "a girl with colitis" makes a lot more sense to the casual, non-affected-by-the-acronym-for-the-title-to-that-song hearer than does a reference to "kaleidescope eyes"--a poetic description that is more out of the ordinary than a reference to an inflammatory bowel disease. How can what you know not affect how you hear it?
p. 418, top of left column, talking about TRACE: "In order to overcome the problem of temporal representation, each unit is repeated many times--once for each time slice." In other words, because phonetic material is usually presented to this thing as one big chunk to TRACE, in order to differentiate what comes where, things have to be repeated. I thought this was one of the first big no-nos of this kind modelling (at least, that's what Pinker led me to believe, and perhaps I should stop believing him on all matters of cognitive science, because he's proven himself to have a less-than-comprehensive grasp of the subject sometimes)--that you can't repeat nodes / chunks of code / &c., like in his basic sentence parser with "if the happy happy girl eats ice cream, then the happy boy eats hot dogs" (sorry, I don't have my book on me and I forget what page). In order to satisfactorily render the above sentence with all its possibilities, the chunk of code for the protasis and the apodasis has to be duplicated--once after the if statement, once after the then. Pinker pointed out that this was clumsy, unweildy, and NOT what actually happens in the brain--and I thought we were largely concerned with what DOES happen in the brain. --This problem is acknowledged on p. 420 by the author himself: "the requirement that each unit on each level be repeated for each time slice leads to a very inelegant representation with no known homologue in biological systems." And TRACE, as they concede later, is showing its age.
Reading this all, especially all the problems with TRACE on page 420, I begin to wonder if there really is some kind of viable model for speech perception. Sure, the best systems, like the one cited above, can get a hugely accurate rate, but that's with no alteration in tone, background noise, &c. As i said above, that is ridiculously sub-human--is this kind of thing even viable? On that note, I must confess I haven't yet had time to read the second article, and that my answer my questions.