CS21 Lab 9: Sorting

Due Saturday night, April 14


Make sure all programs are saved to your cs21/labs/09 directory. Files outside this directory will not be graded.

$ update21
$ cd ~/cs21/labs/09

Topics for this assignment

Overview

This is a two part assignment:

  1. Sentiment Analysis - In sentiment.py, you will write a program that reads in movie reviews from Rotten Tomatoes and conducts sentiment analysis by sorting words based on how positive or negative they are.
  2. Sort Visualization - In visualize_sort.py, you will write a sorting algorithm with the goal of visualizing how the algorithm sorts a list of numbers.

1. Sentiment Analysis

A popular area of research in natural language processing is sentiment analysis. Typically, the goal is to develop algorithms to read natural text and classify its sentiment (e.g., negative vs. positive; 5-star vs 3-star product). For this lab, you will analyze data extracted from Rotten Tomatoes, a website that aggregates movie reviews from film critics. Your goal is to read a file containing movie reviews (made available by Stanford University’s NLP group and Kaggle) and use sorting to determine which words are most associated with positive reviews and which words are most associated with negative reviews.

Movie Review Data

The file of reviews is located in /usr/local/doc/movieReviews.txt:

$ head /usr/local/doc/movieReviews.txt
1 A series of escapades demonstrating the adage that what is good for the goose
 is also good for the gander , some of which occasionally amuses but none of
 which amounts to much of a story .
4 This quiet , introspective and entertaining independent is worth seeking .
1 Even fans of Ismail Merchant 's work , I suspect , would have a hard time
 sitting through this one .
3 A positively thrilling combination of ethnography and all the intrigue ,
 betrayal , deceit and murder of a Shakespearean tragedy or a juicy soap opera .
1 Aggressive self-glorification and a manipulative whitewash .
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by
 the title character undergoing midlife crisis .
1 Narratively , Trouble Every Day is a plodding mess .
3 The Importance of Being Earnest , so thick with wit it plays like a reading
 from Bartlett 's Familiar Quotations
1 But it does not leave you with much .
1 You could hate it for the same reason .

Each line is one review; the first element is an integer star rating from 0 through 4 with 0 being very negative and 4 being a very positive review. Each element after that is a word in the review.

Program Requirements

[ [word1, score1], [word2, score2], ... ]

or parallel lists

[word1, word2, ...]
[score1, score2, ...]

Processing Words

Your program must read the file and process the data in order to obtain the list of lists described above (word/rating pairs). The next two sections provide some details that will help you design your solution correctly. As you process each word in each line, keep track of the following:

>>> word = 'Film'
>>> newWord = word.lower()
>>> newWord
'film'
>>> from words import getStopWords
>>> ignoreWords = getStopWords()
>>> 'and' in ignoreWords
True
>>> 'film' in ignoreWords
False
>>> word = '.'
>>> word.isalpha()
False
>>> word = 'film'
>>> word.isalpha()
True

Sentiment Score

Your sentiment score is the cumulative rating of all reviews that contain that word. Keep in mind the following details:

For example, assume we have the following three reviews (available in /usr/local/doc/smallReviews.txt):

$ cat /usr/local/doc/smallReviews.txt
1 A terrible waste of talent on a bloated plot
4 The cast is full of talent and memorable performances
3 While the run time is bloated , this independent film is anything but run of
 the mill

You should obtain the following scores:

2 cast
2 full
2 memorable
2 performances
2 run
1 talent
1 time
1 independent
1 film
1 anything
1 mill
0 bloated
-1 terrible
-1 plot
-1 waste

The three reviews have adjusted ratings of -1, 2, and 1. 'run' appears twice in a review with an adjusted 1 score (third review) and thus has a cumulative score of 2. 'talent' appears in two different reviews and has a cumulative score of 1 (-1 from review 1, +2 from review 2). We suggest that use this smaller file (or make your own) for development and testing.

Here is some pseudocode that can help guide you:

for each word, rating pair read from file
  if word has not been seen
     append to unique-word list
     append rating to unique-word-rating
  else
     locate index of word in unique-word list
     add rating to the unique-word-rating[index]

There are many ways to implement this. As mentioned above, you can do this is parallel lists or a list of lists. The above pseudocode matches a parallel list implementation. The second option is to use a list of lists where each item is a list containing a string (the unique word) and a number (the current sentiment score). Finding if a new word is unique isn’t as easy as using the in/index() operators; you’ll have to do a linear search over the list of lists as you did last week.

Here is an example of what your list(s) might look like on the smallReviews.txt example.

Sorting multiple lists

Your sorting algorithm will sort words based on their sentiment value. This is different than what we did in class, where items were sorted based on their own value. You have two options, depending on if you use parallel lists or a list of lists:

  1. If you have parallel lists e.g., unique-words and sentiments, your algorithm will use sentiments for comparisons; e.g., sorting largest to smallest for selection sort will require looking for the index of the largest sentiment. The major change from the algorithm in class is the swap stage: you will swap the sentiment values as dictated in class, but you will at the same time swap the corresponding items in unique-words (i.e., swap in parallel).
#Swap i and j
tempSent = sentiments[i]
sentiments[i] = sentiments[j]
sentiments[j] = tempSent
tempWord = unique_words[i]
unique_words[i] = unique_words[j]
unique_words[j] = tempWord
  1. If you have a list of lists, your comparisons will be the part that changes. Since indices in the list are a list themselves, you’ll need to double index into the part that is the sentiment score (this is very similar to last week’s lab)

Final Output and Extensions

We have provide a total of three files to evaluate on:

We highly recommend testing and debugging in the order of the files listed above. Your code may take anywhere from 30 seconds to 2 minutes on the full data set, which is not practical for developing and testing. That has been provided to see what sentiment is found using all available data. Here are sample outputs for the larger two files (the output of smallReviews.txt is provided above):

Your final submission must use movieReviews.txt. It is okay if you have differences in case of ties (e.g., there are multiple words with sentiment of 19.00).

You’ll notice that the final output has some words that don’t really make sense, particularly amongst negative reviews (e.g., feels, movie, like). This is because we used a fairly naive scoring metric to keep the implementation simpler. As an extension (not required, and should be attempted only if the rest of the lab is complete), try other scoring metrics and see if you can come up with something better. Prof. Wicentowski, our local NLP (natural language processing) expert suggests the following formula for each unique word, w:

That is, for each unique word, keep track of the sum of ratings for that word (as above) and the total number of occurrences of that word. The first part of the equation is the average rating for a unique word (the write-up above only uses the sum) and the right term is the log of the number of occurrences of that word. The average and log help adjust for the frequency of words. Here are the (more sensible) results:


2. Visualizing a Sorting Algorithm

Part 2 of this assignment is more open-ended than our typical lab programs. Your task is to write a program, visual_sort.py, that displays a visualization of how one particular sorting algorithm works. This visualization can work by printing strings to the terminal, or — for the extra challenge — by creating an animation using Zelle graphics. Your program should allow its users to see how a particular sorting algorithm changes a list from unsorted to sorted, one swap at a time.

Requirements

Here are two examples of visualizing selection sort:

We would like to see if you can come up. Note that the graphics example is not the expectation - this is more of a challenge/extension if you are interested. The goal is to help your users understand the inner workings of your chosen algorithm, however you see fit. Along the way, hopefully you will also deepen your own understanding.

TIP: it is a good idea to pause your program after each major step. In the graphics example above, you can use getMouse() or getKey() to make the program wait until the user is ready to proceed. For the command-line versions, you can use input() e.g.,

#output some result

input("Hit Enter to Continue...")

#resume execution of program

It doesn’t matter what the user inputs, so I have not bothered to save the value in a variable.


3. Submit

Once you are satisfied with your programs, fill out the questionnaire in QUESTIONS-09.txt. Then run handin21 a final time to make sure we have access to the most recent versions of your file.

Acknowledgments

Thank you to Prof. Eric D. Manley and Timothy M. Urness at Drake University for the data and inspiration for the sentiment analysis portion of the lab.