CS 33: Lab #10

CS 33: Computer Organization

 

LAB 10: Creating and Using a Suffix Tree

Due 11:59pm Wednesday, December 10.

The program handin33 will only submit files in the cs33/lab/10 directory. (You should run update33 first to set up the directory and create any necessary files.)

Your program must follow these following guidelines:

  • the file should be named using the name provided,
  • your code should be compiled using a Makefile,
  • your code should be adequately commented,
  • your program should follow all input and output guidelines exactly, and
  • your program should gracefully report errors and exit, instead of crashing.
  • your program should run through valgrind with no memory leaks! (We'll talk about valgrind in class on Thursday, 12/04)


In this lab, you will write code which creates and makes use of a suffix tree. We began discussing and creating the suffix tree in class on Tuesday. Given a word (e.g. "banana"), a suffix tree is a tree that stores all of that word's suffixes (e.g. "banana", "anana", "nana", "ana", "na", and "a").

The tree has the following properties:

  • The root node (level 0) of the tree stores the character '^'.
  • Each path from the root of the tree to a leaf of the tree "spells out" the characters of exactly one suffix.
Here is a sketch of the suffix tree for "banana". Notice that each of the leaf nodes in the tree contains the special character '$'.

To implement this tree, we will use a representation called "first child/next sibling". As discussed in class, this method requires that each node in the tree have two pointers: one to its first child (a node below it in the tree) and one to its next sibling (a node on the same level in the tree). Open arrow heads (such as the arrows below and to the right of each '$' node) indicate that the pointer points to NULL. Arrows pointing to the right are "next sibling" pointers. Arrows pointing down (or diagonally down) are "first child" pointers.


Each node is represented using the following structure:
typedef struct node_t {
  char data;
  struct node_t *fc; /* first child */
  struct node_t *ns; /* next sibling */
} Node;
In cs33/labs/10/suffixtree.c (which is the code we wrote in class but some minor tweaks), you will find the above structure defined, as well as the complete definition of these three functions which we wrote in class on Tuesday:
Node *makeSuffixTree(); 
Node *makeNode(char letter); 
Node *getChild(Node *node, char letter); 
void insertSuffix(Node *node, char *suffix);
I am also including two functions to read data which you will use at the end:
char *readFile(char *filname, int size); 
void safe(char *str, int length); 
Comments describing all five of the above functions are in the suffixtree.c file.

Your job is to write the following 6 functions:

void insertAllSuffixes(Node *tree, char *word); 
int countLeaves(Node *node);
Node *findNode(Node *tree, char *search);
int numAppears(Node *tree, char *search);
void freeTree(Node *tree);       /* OK, I LIED, THIS ISN'T OPTIONAL - IT'S NOT THAT HARD... */
void printSuffixes(Node *tree);  /* OPTIONAL, BUT USEFUL */
The comments in the suffixtree.c file will tell you exactly what each function should do.

Once you've implemented those 6 functions, you can substitute your main() function with this one (which is also provided in a comment at the end of the suffixtree.c file):

int main(int argc, char **argv) { 
  Node *tree; 
  char *buffer, *search; 
 
  if (argc != 2) { 
    fprintf(stderr, "Requires a text file as an argument.\n"); 
    exit(1); 
  } 
 
  buffer = readFile(argv[1], MAXBUFFER); 
 
  tree = makeSuffixTree(); 
  insertAllSuffixes(tree, buffer); 
  free(buffer); 
 
  search = malloc(sizeof(char) * MAXBUFFER); 
  do { 
    printf("Type a string to search for (an empty line to quit):\n"); 
    safe(search, MAXBUFFER); 
    if (search[0] != '\0') { 
      printf("Number of occurrences: %d\n", numAppears(tree, search)); 
    } 
  } while (search[0] != '\0'); 

  free(search); 
  freeTree(tree); 
 
  return 0; 
} 
This main() function will read a text file (provided as a command-line argument) into a very large string, then it will form a suffix tree of this string. (Note that this string could have spaces, commas, etc., so it's not just a single word like we've seen so far.) After the suffix tree is formed, it asks you to repeatedly enter a search string. After each search string, the number of occurrences of that string are reported.

Run your program through valgrind and be sure you have no memory leaks. (We will talk about valgrind on Thursday.)


Extra Credit

  1. Assuming you only add one string (and all its suffixes) to the suffix tree (as you do in this assignment), there is a relatively straightforward extension that allows you to not only know how many times your search string appears in the text, but also allows you to report the locations of each of the matches (as offsets into the original text).

    The idea is simple: in each leaf node you will store the starting position of that suffix. For example, let's use our "banana" example from above. In the leaf node (the '$' node) corresponding to the suffix "nana$", you would store 2 since "nana" is the suffix you'd get by starting at position 2 in "banana". You will have to add an extra field to the structure to accomodate this. (For non-leaves, this field is meaningless.)

    Now that you've added that extension, write this function:

      int *posAppears(Node *tree, char *search);
    
    This function returns an array of all of the starting positions of the search string.

    In numAppears, you returned the number of leaf nodes below the search string. Here, you will return a list of the positions stored in the leaf nodes below the search string.

  2. Another straightforward extension is to allow case-insensitive searching. This allows you to match "swarthMORE" to "Swarthmore". To facilitate this, add a second command-line argument which, if it is equal to -i, does case insensitive searching. For example, a user wanting to search Jabberwocky would type:
    ./suffixtree jabberwocky.txt -i