Lab 06

Due 11:59am Monday October 29, 2018

Introduction

In this lab you will investigate the Lexical Sample Task data taken from Senseval 3 and you will build a supervised decision list classifier, trained on the provided training data and tested on the provided test data.

Answers to written questions should be added to a file called Writeup.md in your repository.

Examining the data

The data provided by the Senseval 3 competition is in /data/senseval3/. Training and testing data are in separate subdirectories. Before you begin work on the lab, you need to spend a few minutes looking at the source data. In particular, examine the files /data/senseval3/train/EnglishLS.train and /data/senseval3/test/EnglishLS.test.

When you look through the training data file, you will see the text <lexelt item="activate.v">. This indicates that until the next lexelt tag, all of the training examples are for the ambiguous verb “activate”. There are 228 training examples provided for the lexelt ‘activate.v’. Each training example, called an instance, is given an instance id. The first instance has the id ‘activate.v.bnc.00024693’. The next line in the file indicates the sense label for the word “activate” in the text that follows. The first instance is labeled with sense ‘38201’.

After the instance id and answer are provided, the there is a short piece of text labeled with the tag <context>. This text contains the word “activate” and provides a few sentences worth of context to help you determine which sense is the correct sense. In addition, the context indicates the word that you are trying to disambiguate. In the first context, you will find <head>activate</head>. It might seem strange to label the “head” word like this, but there are a few good reasons to do so. First, sometimes the word in the context isn’t the same as the lexelt. For example, in the second instance, the head word is “activated”, not “activate”. Second, it is possible that the word you are trying to disambiguate appears multiple times in the same piece of text, so marking the head word let’s you know which one of the words you should be determining the sense of. Finally, sometimes there are multiple head words marked in the same context, indicating that those words all share the same sense.

Spend some time looking through the training data file. You should find that some instances are labeled with two senses, indicating that annotators either disagreed on what the correct sense was, or they believed that both senses were represented. In addition, some instances are labeled with the sense “U”, meaning “Unclear”. This means that the annotators weren’t clear which sense to assign to the word. There are even some cases where the word was given two sense labels but one of them as a “U”, meaning that some of the annotators agreed on one sense but other annotators were unclear on how to assign a sense to it.

Parsing the data

The source data for the Senseval task is tricky to parse correctly. The starter code includes a file, parse.py, that will do this parsing for you. While you are welcome to modify the parse.py file, you should not need to do so. You can use this file directly as such:

$ python3
>>> from parse import get_data, get_key
>>> train_fp = open('/data/senseval3/train/EnglishLS.train', 'r')
>>> train_data = get_data(train_fp)
>>> train_data.keys()
dict_keys(['activate.v', 'add.v', 'appear.v', 'argument.n', ...])
>>> train_data['add.v'].keys()
dict_keys(['add.v.bnc.00000134', 'add.v.bnc.00000242', 'add.v.bnc.00000837', ...])
>>> train_data['add.v']['add.v.bnc.00000242']
{'words': ['New', 'approaches', 'are', 'needed', ...], 'heads': [36]}
>>> heads = train_data['add.v']['add.v.bnc.00000242']['heads']
>>> heads
[36]
>>> train_data['add.v']['add.v.bnc.00000242']['words'][heads[0]]
'added'
>>> key_fp = open('/data/senseval3/train/EnglishLS.train.key', 'r')
>>> train_key = get_key(key_fp)
>>> train_key['add.v']['add.v.bnc.00000242']
['42603']

Exploring the training data

You should write the solutions to questions 1-12 of the lab in warmup.py and include your answers in your Writeup.md file.

  1. How many different “lexelts” (lexical elements) are there? The lexelts represent the polysemous word types that you are being asked to disambiguate.
  2. What is the breakdown of part of speech (verbs, nouns and adjectives) of these lexelts? You can determine the part of speech by looking at the suffix attached to the lexelt: activate.v is a verb, shelter.n is a noun, and hot.a is an adjective.
  3. How many instances (training examples) are there for the lexelt organization.n?
  4. The introduction to the lab mentioned that an instance can have multiple head words and how an instance can have multiple answers. Make a small table showing the breakdown on the number of head words per instance. Make another table showing the breakdown on the number of answers per instance. Don’t break this down per lexelt - just one small table that summarizes all of the data.
  5. How many senses of activate.v are represented in the training data? (Do not count “U” as a sense for this question.)
  6. One common baseline for this task is to assume that you guess randomly. However, rather than actually making random guesses, you can just assume that you will get $\frac{1}{n}$ of the instances for a particular lexelt correct, where $n$ is the number of senses for that lexelt. For example, if there 5 senses for a lexelt $w$, the random baseline will get $\frac{1}{5} = 20\%$ of the instances of the lexelt $w$ correct. It doesn’t matter if the number of instances for $w$ is not a multiple of $n$: if there are 37 instances for a lexelt with 5 senses, you will say that a random guess will give you $\frac{37}{5} = 7.4$ of them correct. The question for you is as follows: what percentage of the instances in the training data would you be able to label correctly if you just guessed randomly? Two important notes:
    • You should ignore “U” when counting how many senses a lexelt has for this question.
    • You aren’t actually checking the answer key in this question. It doesn’t matter what the actual answer is: you are just trying to get an estimate for how many you could get right if you just guessed randomly.
  7. What is the most frequent sense for the lexelt activate.v? Specify this in the form of how the answer is written in the training data, e.g. “38202”. Note that other lexelts have different naming conventions. For example, the senses for the lexelt organization.n look like this: “organization%1:04:02::”.
  8. Another common baseline is the “most frequent sense”, or MFS, baseline. In the MFS baseline, you find the most frequent sense for each lexelt and the label every instance with that most frequent sense. What percentage of the instances in the training data would you label correctly using the most frequent sense baseline? You may need to spend some time thinking about how to actually implement this before you dive in and try to write your solution. Is the MFS baseline higher or lower than you expected? Discuss.

Exploring the test data

  1. How many instances of activate.v are there in the test data?
  2. There are two similar questions that are not repeats of one another:
    1. Are there any senses of organization.n that are represented in the training data but not in the test data? Assuming the answer is yes, is this a problem? Discuss briefly.
    2. Are there any senses of organization.n that are represented in the test data but not in the training data? Assuming the answer is yes, is this a problem? Discuss briefly.
  3. For each lexelt in the training data, find the most frequent sense. Repeat this for the test data. How many (and what percentage of) lexelts have the same most frequent sense in the training data and the test data? You can choose how to handle ties but state the method that you used. Discuss the implications of the disparities you found between the MFS in the training and the MFS in the test.
  4. Use the MFS sense from the training data to label the test data. What is your accuracy labeling the test data using the MFS sense baseline? Compare to the random baseline (Question 6) and the MFS baseline (Question 8) performed when applied to the training data. How does this match your intuition? Discuss. Important scoring notes:
    • For words in the test data with multiple sense labels, count it as correct if you predicted at any one of the labels.
    • If an instance is labeled with “U”, then “U” is the correct labeling for the sense. Your system is supposed to be able to predict when “U” is appropriate for the instance, so if you predict an actual sense label (e.g. ‘organization%1:04:02::’) but the true label is “U”, you are incorrect.

Feature extraction

In this week’s lab, you will be writing a decision list classifier. In upcoming labs, you will be using scikit-learn, which has many prebuilt classifiers for you to use. One of your goals this week is build your decision list classifier with an API similar to that of the classifiers in scikit-learn so that you can compare the performance of your classifier against other classifiers you could have used, and to familiarize yourself with the scikit-learn API.

One very important thing to keep in mind as you work through this lab is that your are not training a classifier for all of the lexelts at once. Rather, you will train a classifier on each lexelt (e.g. activate.v) and then use that classifier to label the test instances for that lexelt. You can then throw that classifier away and train a new classifier on the next lexelt (e.g. add.v) and use that classifier to label the test instances for that lexelt, throw it away, etc.

Feature vectors and labels

In order to train a supervised classifier, you will need training data. For this task, you are provided with multiple instances for each lexelt. You will represent each instance as a feature vector. A feature vector is just a numerical representation for a particular training instance. Each feature vector in the training data will have an accompanying target, a numerical representation of the sense that the instance was labeled with.

In order to conform with the scikit-learn API, you will put all of your feature vectors (for a single lexelt, e.g. 'activate.v') into a numpy array, where each row of the array is an individual feature vector. In the API for scikit-learn, this array of feature vectors is denoted by the variable $X$. If there are $n$ training examples and $m$ features, $X$ will be an $n{\times}m$ array.

Similarly, you will put all of your targets (for a single lexelt) into a 1-dimensional numpy array. The API for scikit-learn refers to this array of targets as $y$. Each element $y[i]$ is the target (often called the class label) for $i^{th}$ training example in $X$.

Feature extraction

For the word sense disambiguation problem, you will extract two types of features:

Bag of words features are simply the count for each word within some window size of the head word. For this lab, we will set the window size large enough to include all of the tokens in the context. Don’t throw away things like punctuation, numbers, etc. If your classifier finds those tokens useful, it can use them. If they are just noise, your classifier should be able to ignore them.

For example, for instance id ‘activate.v.bnc.00044852’, the bag of words features would show that the token ‘receptors’ appeared 4 times in the context, the token ‘(‘ appeared 2 times, and the token ‘dimensions’ appeared 1 time. The full expected output for the get_BoW_features function for this instance is provided.

Collocational features are the n-grams that include the head word and their counts. For this lab, you will extract the bigrams and trigrams that include the head word. If there are multiple head words in a single context, you will extract bigrams and trigrams for each one. The instance above, activate.v.bnc.00044852, contains the following text (where activated is the head word):

… upon which parts of the sensory system are activated : stimulation of the retinal receptors …

The collocational features in this instance are the bigrams “are activated” and “activated :”; the trigrams “system are activated”, “are activated :”, and “activated : stimulation”. You can represent these n-grams as a single “word” by joining the tokens in the n-gram together using underscores, such as “system_are_activated”. Just like with bag-of-words features, you’ll want to keep track of the count of each of these n-grams. In this case, the collocation “system_are_activated” appears 1 time. The full expected output for the get_colloc_features function for this instance is provided.

In declist.py you should write two functions: get_BoW_features(instance) to extract the bag of words for the instance, and get_colloc_features(instance) to extract the collocational features. Each function should return a dictionary (or defaultdict or Counter) that maps features to their counts, e.g. 'receptors': 4 and 'system_are_activated': 1.

You should then write a wrapper function, get_features(data, lexelt), that calls get_BoW_features and get_colloc_features on each instance of the specified lexelt. You should combine the features for each instance into a single data structure. Then, create a dictionary that maps the instance id to the features. Your get_features function should work something like this:

>>> features = get_features(train_data, 'activate.v')
>>> features['activate.v.bnc.00044852']
{'receptors': 4, '(': 2, 'system_are_activated': 1, 'dimensions': 1, ...}

The full expected output for the get_features function for one instance is provided.

Storing features in a numpy array

Now that you know what features to extract, you need to store them in a numpy array. To do that, you first need to know the complete set of features that you have (for one lexelt!). The first step, therefore, is to go through each instance of a lexelt and collect the features using your get_features(data, lexelt) function.

Once you have the features for all of the instances of one lexelt, you will need to map each feature to a column in the feature vector. For example, you could map 'receptors' to column 0, '(' to column 1, 'system_are_activated' to column 2, 'the' to column 3, etc. In declist.py, write a function index_features(features) that takes a data structure similar to the one shown above and returns a dictionary that maps individual features to their column number in the feature vectors. A partial expected output for the index_features function for one lexelt is provided. If you want to match the sample ouptut exactly (and to save you headaches when debugging), index your features in alphabetical order, so the first feature alphabetically gets index 0.

Once you have this index, write a function called create_vectors(features, index) that takes your features and the index you just built and returns the numpy array of features. Be sure to have the create_vectors function order the vectors such that the first row of the array corresponds to the first instance id alphabetically, etc. For activate.v, the first row should correspond to instance id 'activate.v.bnc.00024693', the second row to 'activate.v.bnc.00044852', etc. You will want this property to hold so that a) you can debug things if needed, and b) you can be sure that you align your feature vectors properly with the answer key.

In the output below, the features are indexed alphabetically in the feature index and the instance ids are ordered alphabetically in the array $X$:

>>> features = get_features(train_data, 'activate.v')
>>> features['activate.v.bnc.00044852']
Counter({'the': 20, 'of': 14, ',': 10, 'to': 7, 'in': 6, ...}
>>> sorted(features.keys())
['activate.v.bnc.00024693', 'activate.v.bnc.00044852', 'activate.v.bnc.00044866', ...]
>>> findex = index_features(features)
>>> findex['system_are_activated']
6380
>>> X = create_vectors(features, findex)
>>> id44852 = X[1]   # since 'activate.v.bnc.00044852' is 2nd alphabetically
>>> id44852
array([0., 0., 0., ..., 0., 0., 0.])
>>> id44852[findex['receptors']]
4.0
>>> id44852[findex['system_are_activated']]
1.0
>>> id44852[findex['the']]
20.0

A partial expected output for the create_vectors function for one lexelt is provided.

Storing answers in a numpy array

The data that we are working with occasionally has instances labeled with multiple correct answers. That’s a challenge to work with, so we’re going to simplify the problem and keep only the first listed answer for each instance. Like the array $X$ you created above, you’re going to need the complete set of possible answers first. Once you have that, you’re going to map each answer to an integer and then store those integers in a one-dimensional numpy array $y$, such that the first entry in the array $y$ is the target for the first vector in $X$.

You’ll want to create two functions:

Be sure to include “U” as one of the targets. Note: You will only be keeping the first listed answer for each instance, but where you filter out the extra answers is up to you. The output below assumes that the filtering is done in create_targets.

>>> targets = train_key['activate.v']
>>> tindex = index_targets(targets)
>>> tindex
{'38201': 0, '38202': 1, '38203': 2, '38204': 3, 'U': 4}
>>> y = create_targets(targets, tindex)
>>> y
array([0, 0, 2, 0, 2, 2, 0, 0, 0, 1, 0, ...]

A full expected output for the index_targets function and the create_targets function for one lexelt is provided.

How does test data fit into this?

The four functions you just wrote above (index_features, create_features, index_targets, create_targets) were designed to take new training data and make the $X$ and $y$ arrays needed for training.

Once your classifier is trained, you’ll need to pass in test data into the classifier. And we’ll need to convert the test data into feature vectors, too. Fortunately, you’ve written almost all of the code needed to handle the test data. The one important thing is that when you read in the test data, you have to use the feature index that you already made. For any test feature that you haven’t seen in the training data, just throw it away. Your classifier doesn’t know how to classify instances based on that new feature anyway, so just ignore it. You might need to modify your create_features function so that it handles that properly, but it should be a very quick fix!

The scikit-learn API also calls the test data $X$, which is a bit unfortunate for trying to remember which variable is holding the training data and which is holding the test data.

A partial expected output from these functions on the test data for one lexelt is provided.

How supervised decision list classifiers work

Supervised decision list classifiers are trained to recognize the features that are high-quality indicators of a particular sense. Once trained, the classifier can be used to label unknown senses in test data.

Making decisions

At this point, you should have already extracted the features from training data and stored them in $X$, and extracted the targets and stored them in $y$ as described above.

First, let’s look at a simplified version of the problem where you are trying to decide between only two senses. When a classifier only has to decide between two choices, it is called a binary classifier. In a binary classifier, you either need to decide if an unknown head word should be labeled as $\text{sense}_1$ or $\text{sense}_2$.

For each feature $f$, which could be a word in the bag of words or a collocation, you compute the log-odds ratio of the probabilities of each sense given the feature to see how good that feature is in distinguishing between the two sense:

\begin{align} \text{score}(f) &= \log \frac{P(\text{sense}_1|f)}{P(\text{sense}_2|f)}
\end{align}

A high magnitude score (e.g. +10 or -10 vs +.02 or -.02) for feature $f$ indicates that the feature is good at distinguishing the senses. A low magnitude score indicates that the feature is not good at distinguishing the senses. A positive score means that $\text{sense}_1$ is more likely; a negative score means that $\text{sense}_2$ is more likely.

Probabilities can be estimated using maximum likelihood estimation as:

\begin{align} P(\text{sense}_1|f) &= \log \frac{\text{count(}f\text{ is present in sense}_1)}{\text{count(}f\text{ is present in either sense)}}
\end{align}

Since you’re going dividing the two probabilities by one another and their denominators will be the same, the denominators cancel each other out, leaving us with:

\begin{align} \text{score}(f) = \log \frac{P(\text{sense}_1|f)}{P(\text{sense}_2|f)} &= \log \frac{\text{count(}f\text{ is present in sense}_1)}{\text{count(}f\text{ is present in sense}_2)}
\end{align}

Smoothing

In some cases, the numerator or denominator might be zero, so you will use Laplace smoothing (it’s easy!) but instead of $+1$, you will use $+\alpha$ smoothing, adding $\alpha = 0.1$ to both the numerator and denominator in order to avoid division by zero errors, and to avoid taking the log of 0.

\begin{align} \text{score}(f) &= \log \frac{\text{count(}f\text{ is present in sense}_1) + \alpha}{\text{count(}f\text{ is present in sense}_2) + \alpha}
\end{align}

Classification with more than two senses

In the lexical sample data you have, you can’t directly use the formulation above because you don’t have a binary classification: you have to decide which sense label to assign when there might be many more than 2 choices. Therefore, you don’t use the formulation shown above. Rather, you use the following modification for each sense:

\begin{align} \text{score}(i,f) &= \log\frac{P(\text{sense}_i|f)}{P(\text{NOT sense}_i|f)}
\end{align}

Like before, high positive values will be strong indicators of $\text{sense}_1$. Low positive values will be weak indicators of $\text{sense}_1$. Negative values will be indicators that it is not $\text{sense}_1$, but since we don’t classify words as not being a particular sense, your decision list classifier won’t make use of negative scores.

Similar to what you showed above, you can rewrite the ratio of probabilities as follows:

\begin{equation} \text{score}(i,f) = \log \frac{P(\text{sense}_i|f)}{P(\text{NOT sense}_i|f)} = \log \frac{\text{count(}f\text{ is present in sense}_i)}{\text{count(}f\text{ is present all other senses except sense}_i)} \end{equation}

Finally, you will use Laplace smoothing:

\begin{equation} \text{score}(i,f) = \log \frac{\text{count(}f\text{ is present in sense}_i) + \alpha}{\text{count(}f\text{ is present all other senses except sense}_i) + \alpha} \end{equation}

Writing the decision list classifier

As mentioned earlier, your decision list classifier needs to conform to the scikit-learn API. In order to do that, you will write a class called DecisionList (stored in the file DecisionList.py) that has two methods: fit(X,y), which will be used to train the classifier, and predict(X) which will be used to predict senses in the test data. The constructor (the __init__ method) for the DecisionList will take one optional parameter, alpha, that determines the value to be used for Laplace smoothing. Set alpha to $1$ if it is not specified.

Training the decision list classifier

Your fit(X,y) method will take two parameters: $X$, the numpy array of feature vectors for the training data, and $y$, the numpy array of targets from the training data.

To train a decision list from the features (of a single lexelt!), you will first calculate a score for each (sense $i$, feature $f$) pair. For some senses, a particular feature might be a good indicator; but for another sense of the same word, it will not be a good indicator. Sort these (sense $i$, feature $f$) pairs so that the (sense $i$, feature $f$) pair with the highest positive score first. Throw away any scores that are less than zero.

The fit method doesn’t return anything. All of the state of the decision list is stored in the object.

You are almost done! Let’s look at how you use the decision list on test data first and then come back to wrap up training.

A partial expected output from the fit function for one lexelt is provided.

Using your decision list on test data

Your predict(X) method will take an array of feature vectors representing the test data.

For each instance in the test data (a single row in the matrix $X$), you just walk down the decision list until you find a matching (sense $i$, feature $f$) pair. Since the decision list is sorted from most predictive to least predictive, you will end up classifying your instance based on the presence of the best matching feature. As you walk down the (sense $i$, feature $f$) decision list, if an instance contains feature $f$, then you will label that instance as sense $i$. If the instance does not contain $f$, proceed to the next (sense, feature) pair. Continue this process until you have exhausted all features in your decision list. Your predict(X) method will return a one-dimensional numpy array containing the classification. The resulting numpy array should have the same format as the array $y$ that you passed into the fit method.

A partial expected output from the predict function for one lexelt is provided.

Default classification

What happens if you get to the bottom of the decision list but didn’t match any features? In that case you want to label the instance with some default classification. To complete the training of your decision list classifier, you’ll want to add a (default sense, feature) pair to the bottom of your decision list. Since you need to make sure this feature matches any instance, it should be written as a special case to make sure that if you make it all the way to the bottom of your decision list, you will label that instance with a default classification. Since you have already seen that most frequent sense is a good baseline classifier, use the MFS for that word as the default sense. Pass the default target in as another optional parameter, default_target, to the __init__ method.

Important: In case it was not obvious from all the reminders, you need to create a separate decision list for each lexelt (e.g. one decision list for bank.n, another for activate.v, etc.)

Questions

Continue to put your answers to these questions in Writeup.md.

  1. Implement the decision list described above, apply it to the test data, and report your accuracy. Does your decision list outperform the MFS baseline?

  2. Examine the rules in the decision tree for activate.v.
    • What is problematic about some of the rules you have generated? Give some examples.
  3. Choose two (or more) of the following and report your results. Provide some discussion: were you surprised by the result? Why do you think you got that result?
    • Add code to your decision list generator so that very common words (also known as “stopwords”) are excluded from being used as part of features. How you decide what “very common” means is up to you, but it’s probably best if its automatically generated rather than hand-entered. Does this help or hurt your performance? Report some numbers!
    • Implement case-folding (making everything lower-case) for the bag-of-words and collocations. Does this always make sense to do? Does this help or hurt performance? Report some numbers!
    • What if you used lemmas in your features? When you get the data using the get_data(fp) function, you can pass an optional argument use_spacy=True. This will add a new key to each instance so that in addition to having words, you will now have doc, a spacy document. You can make a new function called get_BoL_features(instance) that gets “bag of lemmas”. To get lemmas from a spacy doc:
      >>> doc = instance['doc']
      >>> lemmas = [token.lemma_ for token in doc]
      
    • Use Tfidf to downweight common features. Assuming your training matrix is called Xtrain and your test matrix is called Xtest:
      >>> from sklearn.feature_extraction.text import TfidfTransformer
      >>> transformer = TfidfTransformer(smooth_idf=False)
      >>> transformer.fit(X_train)
      >>> X_train = transformer.transform(X_train).toarray()
      >>> X_test = transformer.transform(X_test).toarray()
      
  4. Is it possible that stopping before you get to the end of the decision list would improve results? In other words, perhaps instead of making it all the way to the end of your decision list, you stop whenever the score falls below 1.0 -– or perhaps you stop after seeing 100 rules. Once you reach the end of the rules you’re willing to consider, fall back on the MFS baseline. Try a couple of scenarios, see what happens, and discuss what you see.

  5. What if you used another classifier instead of your decision list classifier? If you wrote things to the scikit-learn API, this question is a piece of cake! This example uses the GaussianNB classifier but you can try any classifier in scikit-learn instead. To do so, simply import the classifier you want (e.g. from sklearn.naive_bayes import GaussianNB). Now find the line where you created your DecisionList object. It probably looks something like this in your code:

    decision_list = DecisionList(alpha=alpha, default_target=mfs)
    

    Simply replace that line with one of the classifiers from scikit-learn:

    decision_list = GaussianNB()
    

    You shouldn’t have to change anything else! Your code will now run using the classifier you chose from scikit-learn. Play around and see if you discover anything interesting!

  6. (Optional) Is there a correlation between the score a rule has and the accuracy of the prediction? Do high scoring rules work well? Do low scoring rules work equally well? A nice visualization could show a bar chart of the accuracy of the rules (y-axis) plotted against the score of the rule (x-axis). You’re welcome to come up with whatever you’d like here. It’s optional!