CS/PSYCH 129 Week 5 Reactions

CS/PSYC 129 Week 5 Reactions


Julie Corder

I found the research described in the PARSER article to be very interesting; it seemed to address a problem that is usually ignored by other models (the initial segmentation of words), however it also ignored the problems of representation (by assuming that phonemes already had some sort of internal representation) and assumed an overly-simplified syllable structure (Always CV). It wasn't clear to me whether the test non-words had any sequences in them that disobeyed the sonority hierarchy or not; if so, it seems possible that test subjects were using their familiarity with English to recognize non-words. It was interesting, though, that the model was able to perform so well, even with only a small percentage of the input.

One thing that bothered me while reading several of the articles was that I seem to remember being told at one point that some words shift their word boundaries over time; what was once a separate word may become an affix and vice-versa. How, then, should a model be expected to determine if a phrase like "has seen" is two separate words or just a verb stem, "seen," with a tense marker "ha"? Also, the example in the Shortlist article of parsing "catalog" seems to assume that there is something inherently wrong with parsing the string as "cat a log" -- but assuming that all three of these words are in the lexicon, wouldn't you need to be making use of some higher-level processing of the meaning of the word in syntactic/semantic terms to be able to say that "catalog" is a better interpretation than "cat a log"?

I also realized, especially in the article by Christiansen, Allen, and Seidenberg, that my memory of statistics and analyzing experimental data is pretty fuzzy after a couple of years of not using it, and I could really use a quick refresher on what z-scores and statements like (x^2 = 11.51, P<0.001) mean -- I still have a vague understanding of them but I have a feeling I'm missing a lot by skimming over them because I don't real remember what they mean :)


Will Quale

Brent:
Baby research continues to wow me, such as how 7-m.o. kids can't identify non-initial-stress words, but 10.5-m.o.'s can.

"Most models also forbid embedded words." But what about the one English infix? :> I remember an amusing episode of Babylon 5, where Delenn (an alien), enthusiastically learning English, spends the better part of a scene inserting it into nearly every word she speaks....

"Although forthedoggy is not a word...." I often parse unfamiliar speech passages into non-words such as this, and upon repeated hearings or further thought, suss out the intended words. There's no provision given here for updating the lexicon once later processing fixes wrong interpretations, but it seems like something that could improve results in this model.

What if this model already knows the words "car" and "pet": will it ever parse "carpet" as a single word? I don't see why it would; this seems like a shortcoming. Likewise, there's little provision for learning about prefixes and suffixes independently of specific words they appear in; at some point, these become helpful cues, but these models don't take them into account.

"Human memory appears to be subject to interference and decay...." What? Oh, right, thanks, I needed reminding of that. I was tickled by the "forgetting mechanisms".

70% success -- wow!

Christiansen, Allen, & Seidenberg:
Mothers actually say "boobababoobaba"? Why?

What scale are the MSE values (237) measured on? What's a "desired" value here?

Again, wow at the success rates! Should we expect either Brent's or Christiansen's method to ever produce much better than 70% results? it seems like, at some point, word-recognition would be able to kick in and do pretty well, given a good start by these separating methods; especially if these two could work in tandem or parallel.

"it would seem premature and ill-advised to assume that knowledge thereof must necessarily be innate." Christiansen 1, Pinker 0.

Perruchet & Vinter:
"One cannot claim that these regularities are learned inductively without falling into circular reasoning, with word knowledge being simultaneously the prerequisite and the consequence of knowledge of statistical regularities." Huh? I think searching for patterns to help create order and sense out of chaos is basic human nature; it makes sense to me that, confronted with long strings of sound, the young human mind would try to segment it using logic and statistics.

"When asked to read nonsense consonant strings, we read the material not on a regular rhythmic, letter-by-letter basis...." bDan and I learned this the hard way a few weeks ago, when we were forced to memorize "BRFXXCCXXMNPCCCCLLLMMMNPRXVCLMNCKSSQLBBLLLL6" in an odd party game/communal torture experience we subjected ourselves to. Likewise, we unwittingly experimentally proved the bit about the memorizing of syllables, in that we had to memorize such names as "Keoua Kalanikupuapaikalaninui Ahilapalapa" and "Culhwch ap Cilydd ap Cyleddon Wledig, gan Gdeuddydd merch Anlawdd Wledig" (not that any of us are bitter or anything :>).

Somehow, Parser's success seems a lot more mystical to me than the other models. It's just less intuitive. Once again, now that we've seen still another strategy work pretty well in isolation from other methods, I'd like to see strategies combined, and see how good those results would be. Are such collaborations going on, or is each team just continuing to perfect its own methods?


Aaron Carlisle

First, let me note that any questions I have regarding the reading may or may not have been addressed in the Shortlist article. Every time I tried reading the article, I started getting a headache within a couple of minutes.

One thing that confuses me is how all of the researchers to whom we have been exposed seem eager to quantify speech into a featural representation. I was wondering if it could potentially be helpful to use an approached based on raw input being directly connected to the other structures involved, as opposed to either going through a feature layer, having feature output, or, as we read about this week, having feature input.

Box 1 in the Brent article is convincing, although brief, in its establishment of the phenome being a valid unit to specify. However, as the Christiansen, et al article seems to establish, broadening the input to other cues assists in word delineation. Why, then, does no one seem to take that to an extreme? It could be that there exist hidden, subtle cues that we have yet to recognize in speech, or that the best cues are some odd amalgamation of known cues. Why not take a ``power spectra'' input like that described in Nakisa and Plunkett last week and hook it up to word segmentation output? Intuitively, it seems that any time we quantify information we lose detail, and this seems especially true when we quantify into easily-understandable chunks. Could the emergent behavior of a less constrained system conceivably lead us to improved categorizations? All of this is speculation, of course, but it would be interesting to see what a less bounded system could produce.

Regarding the Brent article, I was wondering what, if any, ``head-to-head'' studies have been done regarding the three seemingly accurate strategy categories. Could you potentially deathmatch the three by involving them all in one connectionist network? Would the network learn to favor the most valuable of the three methodologies?


Nori Heikkinen

Was it just me, or was this week's reading extremely=20 long? I didn't have a chance to finish all of it (and not through lack of= =20 valiantly trying), so following are my reactions to the Brent and=20 Christiansen et al. articles.

Speech Segmentation ... - Michael R. Brent
p.= =20 297 -- under Word Recognition Strategy, they talk about the=20 "no-overlap principle"--that "if cde is a recognized occurrence of a= =20 familiar unit in the utterance abcde then there are only three=20 potential units." If a, b, c, d, and e= =20 each represent phonemes, that's not necessarily so, or at least it suggests= =20 that the No-Overlap principle may be misguiding. Take, for example, the=20 utterance "lout" -- if the utterances cannot overlap, the word "ablaut" may= =20 never get recognized as a whole word; same with "or" and "snore."

p. 298, top paragraph, about the sentence "that ...=20 isforthedoggy" -- 'that' is clearly a separate utterance, and is= =20 then filed away in the child's lexion, but 'isforthedoggy' is=20 not. The article then says, however, that "the remaining contiguous=20 strings, is and forthedoggy, would be segmented as though=20 they were separate utterances." --why? Is the child assumed to already has= =20 'is' in his/her vocabulary? And sure, it doesn't hurt a=20 child's lexicon when the word "forthedoggy" is stuck in there=20 wholesale, but it can't potentially help it, either. They may parse it=20 later, as is discussed in the discussion of incremental versus batch=20 modelling below in the column. The article says, "Because humans segment=20 each utterance as they hear it, batch algorithms are deemed=20 impossible." Last week in discusssion we were talking about=20 this--sometimes someone says something really fast to you, and you have to= =20 stop and play back in your head what they just said before you can realize= =20 what they meant. It's perhaps not the most often way to go when you're=20 trying to understand someone, but batch algorithms defintely have a model=20 in reality. This article, though, doesn't seem to think that batch=20 algorithms would be a good model of what's in the brain--they're viewed as= =20 "psychologically impossible." The penultimate paragraph in the right-hand= =20 column on page 298 backs up this view. This raises the (seemingly-)obvious= =20 question: If they don't work to model human speech, why are they even being= =20 considered?

p. 299 -- i got a little confused on the memory-loss=20 function of the algorithms being worked with in this paper. Are we=20 shooting so carefully to model humans so that, by the time our networks are= =20 80 years old, they should be senile?

Learning to Segment Speech ... - Christiansen,=20 Allen, & Seidenberg
p. 225 - The authors state, "it seeems reasonable to= =20 assume that children ... have neither a lexicon nor a knowledge of the=20 phonological or rhythmic regularities underlying the words in the language= =20 being learned." -- they don't? Is there anything to be said for pre-natal= =20 noise? If it's anything it's certainly not much (right?), but it might be= =20 enough to start giving babies linguistic cues early.

p. 230 - how relevant really is the idea of word-initial= =20 and -final syllable types? It seems that it plays a large role in this=20 linguistic-cues theory of segmentation ad learning, but what are some=20 examples? They only give a few, and much later on. I'm curious.

p. 232 - here i start having a problem with the amount of= =20 abstraction from the real thing--a speech wave with no white spaces in=20 it. These guys weren't using actually phonetic material, were they? The=20 were just feeding the network symbolized and transcribed=20 representations of speech! I know you have to start somewhere, but if= =20 you succeed with something synthetic, how does that account for the obvious= =20 complexity of the task of speech perception which initially complicates the= =20 task you've been working on? Couldn't you be missing something huge and=20 fundamental in the process of abstraction, especially when some assumptions= =20 (~ footnote, previous page) take things to be true that we know to=20 be false (such as a particular word always receiving a particular stress=20 pattern--thir.t=E9en, but th=EDr.teen men)? it's confusing.

All throughout the article, they keep mentioning that the= =20 percentages of correct word boundaries is around 40%, which is so=20 ridiculously below what humans can do. They make excuses for it all=20 over--but still, that's so low.

The article throughout challenges the standing assumption= =20 in linguistics that we are hard-wired for language acquisition. The last=20 paragraph warns that we should no assume this "until wee have exhausted the= =20 possibilities of such integration processes as the basic for learning=20 'linguistics structures for which there is no evidence'." Yet I'm loth to= =20 just throw that idea away. What about the fact that people regularly=20 creolize languages, invent them themselves (Simon whose ASL was better than= =20 his parents; that entire sign language the kids younger than age 17 that=20 Pinker mentioned somewhere? Doesn't that somehow suggest that they're not= =20 just extrapolating a grammar from the complete language they have, but=20 recreating it every time? And if they do that, how can such a complex=20 mechanism be completely mechanical?


Jeff Wu

From all four articles, the major themes in speech segmentation seem to be the use of multiple cues, psychological plausibility, and various useful strategies to approach the problem. The use of multiple cues has been endorsed by all models except for the PARSER model. Through several models in the Christiansen text, and the shortlist model, we see that the integration of more and more cues leads to better results.

The Norris text was the most interesting read for me. Norris does a good job of stating the overall advantages shortlist has over the TRACE model. Interestingly, the shortlist model has many forms of competition working within the model, vying for the best result. The race model method and inhibition between words are good examples of this competition. Another powerful thing about the shortlist model is that it models humans pretty well when dealing with mismatches and ambiguity. Instead of relying on backprop to reduce ambiguity in sentences, the role of right-context increases with increasing ambiguity.

The PARSER model, although more psychologically plausible, is ironically less productive when it comes to natural language acquisition. Another thing I find interesting about the model is that it doesn't handle cues well, even though the use of cues in babies would seem to help them overcome poverty of stimulus as noted in the Christiansen text. It seems that if multiple cues were integrated into the model, the language acquisition problem wouldn't be such a problem. Also, I don't see the point of memory decay when the data set is clearly smaller than human brain capacity. It seems to be more detrimental to the model than positive.


Jeff Ebert

I’m interested in the Saffran-Perruchet and Vinter debate over how word segmentation can proceed given a continuous stream of audio input without stress cues. I agree with P & V that Saffran’s findings can be interpreted in a way other than that infants are computing transitional probabilities. Consider the following two utterances:

doggybarks
doggyruns

On Saffran’s account, since every time "dog" appears in the speech stream, "gy" appears in the speech stream, and since "barks" and "runs" each follow "gy" only half the time, a statistically-minded infant would parse the first phrase as "doggy barks," and the second as "doggy runs."

An alternative and equally plausible explanation of Saffran’s findings can be offered by the implicit learning camp. In the two utterances above, the phrase "doggy" occurs twice, whereas the phrases "gybarks" and "gyruns" occur only once each. When it comes time to test an infant in a familiarity paradigm, "gybarks" and "gyruns" will be more novel than "doggy," and so the infant will attend these stimuli for a longer time. Note that on this account, no transitional probabilities need ever be calculated. Rather, the mechanism is that the more the infant is exposed to a particular substring of speech, the more familiar it becomes.

PARSER, developed by Perruchet and Vinter, implements this implicit memory model well, including improvements such as interference and decay. It seems, however, that with PARSER’s simple algorithm it would be relatively easy to construct languages that PARSER cannot handle.


Sean Lewis

This weeks articles were a relief in terms of progress over the TRACE model. In fact, the articles sometimes addressed themselves to the shortcomings of TRACE directly.

In this vein, the coverage of Shortlist by Norris eased some of the worries that TRACE had created. While searching and indexed list is not necessarily an accurate representation of the the mind's vocabulary storage, the paper made clear the modular nature of Shortlist. The network served one function and left the rest abstract. Although it didn't take raw sound as input, it also didn't classify sounds by featural description either (which is an approach that I did not understand). The focus on solving one bit of the problem better than TRACE and leaving the rest for later is fine with me. The bottom-up approach was a plus as well.

Christiansen, Allen and Seidenberg were cool because they attempted to incorporate stress into their model with varying results. The idea that multiple cues will also curtail overgeneralization by the network seemed reasonable to me. But, unless I misread, the stress information was garnered from a symbolic representation rather than being discerned from a sound stream. While Shortlist also does this, I am more confident in our ability to isolated phonemes from sound than stress. I am somehow of the understanding that stress is actually quite hard to distinguish acoustically. I could be wrong though.

The PARSER model was definitely the coolest in my opinion. I was unsure as to why one had to incorporate degradation of memory in, but by doing so, PARSER does not overburden itself and is allowed to learn incrementally. I'd be interested in seeing how this and simalir approaches are performing today, 3 years after the article was published.


Henrike Blumenfeld

I was surprised by Norris's opposition to an interactive model such as TRACE on grounds of bidirectional flow of information not being psychologically plausible. I had gotten the idea from previous course work that the that top-down information flow is very likely in speech recognition.

Another question about plausibility: the idea that larger chunks of speech are first looked at as a single unit, until there is evidence that it can be analyzed into smaller units makes lots of sense. But do children necessarily do that? The one-word stage is characterized by single words, unless one would argue that they go through the stage of the bigger chunks before they actually speak.