Naive Bayes Example

You can create a truncated data set called small_house_votes.data that consists of the first 10 lines of the original house_votes.data file.

-1,1,-1,1,1,1,-1,-1,-1,1,0,1,1,1,-1,1,-1
-1,1,-1,1,1,1,-1,-1,-1,-1,-1,1,1,1,-1,0,-1
0,1,1,0,1,1,-1,-1,-1,-1,1,-1,1,1,-1,-1,1
-1,1,1,-1,0,1,-1,-1,-1,-1,1,-1,1,-1,-1,1,1
1,1,1,-1,1,1,-1,-1,-1,-1,1,0,1,1,1,1,1
-1,1,1,-1,1,1,-1,-1,-1,-1,-1,-1,1,1,1,1,1
-1,1,-1,1,1,1,-1,-1,-1,-1,-1,-1,0,1,1,1,1
-1,1,-1,1,1,1,-1,-1,-1,-1,-1,-1,1,1,0,1,-1
-1,1,-1,1,1,1,-1,-1,-1,-1,-1,1,1,1,-1,1,-1
1,1,1,-1,-1,-1,1,1,1,-1,-1,-1,-1,-1,0,0,1

Then you can compare your results on this data set to those given below to verify that you are calculating the probabilities correctly. Note that in all cases the probabilities should sum to 1.0

Naive Bayes depends on having accurate estimates of the probabilities of a label within a given data set and of a feature having a particular value. And, because we will take the product of these probabilities to determine the most likely label for a test case, we want to ensure that the probabilities are not too small. Therefore we estimate the probabilities based on the following formula:

                          nc + pm
estimate of probability = -------
                           n + m
where

Below are the probabilities you should be calculating for the labels. Notice that since there are only two possible labels the prior p is 1/2.

Label:  1 Count:  6 Prior:0.50 Prob:0.55
Label: -1 Count:  4 Prior:0.50 Prob:0.45
Sum: 1.00
Thus for the label 1, when we plug in the values to the formula we get:
6 + 0.5*10      11
----------  =  ----  = 0.55 
 10 + 10        20

Similarly we need to compute these probabilities for every possible combination of label, dimension, value in the data set.

Initially, we look for all unique values in the entire data set that are found at a particular dimension and intialize the counts for all possible combinations of label, dimension, value to 0.

Next we update the counts based on just the training data. For some dimensions there is only possible value found in the data, giving a prior of 1/1. For others, we have possible values were found, giving a prior of 1/2, and for still others, we have three possible values, giving a prior of 1/3.

Label: -1 Dim:  0 Value: -1 Count:  4 Prior:0.33 Prob:0.52
Label: -1 Dim:  0 Value:  0 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim:  0 Value:  1 Count:  0 Prior:0.33 Prob:0.24
Sum: 1.00
Label: -1 Dim:  1 Value:  1 Count:  4 Prior:1.00 Prob:1.00
Sum: 1.00
Label: -1 Dim:  2 Value: -1 Count:  4 Prior:0.50 Prob:0.64
Label: -1 Dim:  2 Value:  1 Count:  0 Prior:0.50 Prob:0.36
Sum: 1.00
Label: -1 Dim:  3 Value: -1 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim:  3 Value:  0 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim:  3 Value:  1 Count:  4 Prior:0.33 Prob:0.52
Sum: 1.00
Label: -1 Dim:  4 Value: -1 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim:  4 Value:  0 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim:  4 Value:  1 Count:  4 Prior:0.33 Prob:0.52
Sum: 1.00
Label: -1 Dim:  5 Value: -1 Count:  0 Prior:0.50 Prob:0.36
Label: -1 Dim:  5 Value:  1 Count:  4 Prior:0.50 Prob:0.64
Sum: 1.00
Label: -1 Dim:  6 Value: -1 Count:  4 Prior:0.50 Prob:0.64
Label: -1 Dim:  6 Value:  1 Count:  0 Prior:0.50 Prob:0.36
Sum: 1.00
Label: -1 Dim:  7 Value: -1 Count:  4 Prior:0.50 Prob:0.64
Label: -1 Dim:  7 Value:  1 Count:  0 Prior:0.50 Prob:0.36
Sum: 1.00
Label: -1 Dim:  8 Value: -1 Count:  4 Prior:0.50 Prob:0.64
Label: -1 Dim:  8 Value:  1 Count:  0 Prior:0.50 Prob:0.36
Sum: 1.00
Label: -1 Dim:  9 Value: -1 Count:  3 Prior:0.50 Prob:0.57
Label: -1 Dim:  9 Value:  1 Count:  1 Prior:0.50 Prob:0.43
Sum: 1.00
Label: -1 Dim: 10 Value: -1 Count:  3 Prior:0.33 Prob:0.45
Label: -1 Dim: 10 Value:  0 Count:  1 Prior:0.33 Prob:0.31
Label: -1 Dim: 10 Value:  1 Count:  0 Prior:0.33 Prob:0.24
Sum: 1.00
Label: -1 Dim: 11 Value: -1 Count:  1 Prior:0.33 Prob:0.31
Label: -1 Dim: 11 Value:  0 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim: 11 Value:  1 Count:  3 Prior:0.33 Prob:0.45
Sum: 1.00
Label: -1 Dim: 12 Value: -1 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim: 12 Value:  0 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim: 12 Value:  1 Count:  4 Prior:0.33 Prob:0.52
Sum: 1.00
Label: -1 Dim: 13 Value: -1 Count:  0 Prior:0.50 Prob:0.36
Label: -1 Dim: 13 Value:  1 Count:  4 Prior:0.50 Prob:0.64
Sum: 1.00
Label: -1 Dim: 14 Value: -1 Count:  3 Prior:0.33 Prob:0.45
Label: -1 Dim: 14 Value:  0 Count:  1 Prior:0.33 Prob:0.31
Label: -1 Dim: 14 Value:  1 Count:  0 Prior:0.33 Prob:0.24
Sum: 1.00
Label: -1 Dim: 15 Value: -1 Count:  0 Prior:0.33 Prob:0.24
Label: -1 Dim: 15 Value:  0 Count:  1 Prior:0.33 Prob:0.31
Label: -1 Dim: 15 Value:  1 Count:  3 Prior:0.33 Prob:0.45
Sum: 1.00
Label:  1 Dim:  0 Value: -1 Count:  3 Prior:0.33 Prob:0.40
Label:  1 Dim:  0 Value:  0 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim:  0 Value:  1 Count:  2 Prior:0.33 Prob:0.33
Sum: 1.00
Label:  1 Dim:  1 Value:  1 Count:  6 Prior:1.00 Prob:1.00
Sum: 1.00
Label:  1 Dim:  2 Value: -1 Count:  1 Prior:0.50 Prob:0.38
Label:  1 Dim:  2 Value:  1 Count:  5 Prior:0.50 Prob:0.62
Sum: 1.00
Label:  1 Dim:  3 Value: -1 Count:  4 Prior:0.33 Prob:0.46
Label:  1 Dim:  3 Value:  0 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim:  3 Value:  1 Count:  1 Prior:0.33 Prob:0.27
Sum: 1.00
Label:  1 Dim:  4 Value: -1 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim:  4 Value:  0 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim:  4 Value:  1 Count:  4 Prior:0.33 Prob:0.46
Sum: 1.00
Label:  1 Dim:  5 Value: -1 Count:  1 Prior:0.50 Prob:0.38
Label:  1 Dim:  5 Value:  1 Count:  5 Prior:0.50 Prob:0.62
Sum: 1.00
Label:  1 Dim:  6 Value: -1 Count:  5 Prior:0.50 Prob:0.62
Label:  1 Dim:  6 Value:  1 Count:  1 Prior:0.50 Prob:0.38
Sum: 1.00
Label:  1 Dim:  7 Value: -1 Count:  5 Prior:0.50 Prob:0.62
Label:  1 Dim:  7 Value:  1 Count:  1 Prior:0.50 Prob:0.38
Sum: 1.00
Label:  1 Dim:  8 Value: -1 Count:  5 Prior:0.50 Prob:0.62
Label:  1 Dim:  8 Value:  1 Count:  1 Prior:0.50 Prob:0.38
Sum: 1.00
Label:  1 Dim:  9 Value: -1 Count:  6 Prior:0.50 Prob:0.69
Label:  1 Dim:  9 Value:  1 Count:  0 Prior:0.50 Prob:0.31
Sum: 1.00
Label:  1 Dim: 10 Value: -1 Count:  3 Prior:0.33 Prob:0.40
Label:  1 Dim: 10 Value:  0 Count:  0 Prior:0.33 Prob:0.21
Label:  1 Dim: 10 Value:  1 Count:  3 Prior:0.33 Prob:0.40
Sum: 1.00
Label:  1 Dim: 11 Value: -1 Count:  5 Prior:0.33 Prob:0.52
Label:  1 Dim: 11 Value:  0 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim: 11 Value:  1 Count:  0 Prior:0.33 Prob:0.21
Sum: 1.00
Label:  1 Dim: 12 Value: -1 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim: 12 Value:  0 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim: 12 Value:  1 Count:  4 Prior:0.33 Prob:0.46
Sum: 1.00
Label:  1 Dim: 13 Value: -1 Count:  2 Prior:0.50 Prob:0.44
Label:  1 Dim: 13 Value:  1 Count:  4 Prior:0.50 Prob:0.56
Sum: 1.00
Label:  1 Dim: 14 Value: -1 Count:  2 Prior:0.33 Prob:0.33
Label:  1 Dim: 14 Value:  0 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim: 14 Value:  1 Count:  3 Prior:0.33 Prob:0.40
Sum: 1.00
Label:  1 Dim: 15 Value: -1 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim: 15 Value:  0 Count:  1 Prior:0.33 Prob:0.27
Label:  1 Dim: 15 Value:  1 Count:  4 Prior:0.33 Prob:0.46
Sum: 1.00