*“Your product sucks!”*

It does not take a genius to know this is a bad review.

When I see such a review on Amazon, I would probably be cautious about buying this product.

In this article, we will look at how to teach a machine to classify good and bad reviews using the naive Bayes model.

Let’s get started.

**Data Sets**

In this article, we will use the Sentiment Labelled Sentences data sets provided in the UCI machine learning repository.

Click on the link above and go to: **Download Data Folder**

Download the zip file and unzip it:

There are three data files in the zipped file.

We will look at the **amazon_cells_labelled.txt** file:

Place this file in your SAS University Edition shared folder:

Run the code below to import the data:

data amazon_raw;

infile amazon dsd dlm='09'x;

input text_temp : $1000. label;

run;

The data set is imported!

**Understand the Data**

Before we start building the model, we must first understand the data on hand.

Let's take a quick look at the data set. There are 1,000 observations, with two columns:

The Text_temp column contains the written reviews.

The Label column identifies whether the review is positive (1) or negative (0).

Let's run a frequency table on the Label column:

table label;

run;

**Quick Introduction to the Naive Bayes Model**

Let's look at a very simple example of how it works.

In the above table, there are three positive reviews:

- Great product!
- Excellent stuff
- Work great

There are four negative reviews:

- Did not work at all
- Refund requested
- Not that great
- Bad stuff

There is also one unknown review:

*"Great stuff"*

For a human, we can easily identify this as a positive review.

But how can a machine evaluate the nature of the review?

The machine can estimate how likely the review is positive or negative, based on the seven known reviews that we have.

For example, if "great stuff" is a positive review, we would have expected this phrase or these two words to appear in the other positive reviews, as well.

If that's not the case, we would assume this is not a positive review.

Mathematically, we are looking at this probability:

**P ( Positive | Review = “Great Stuff” )**

Intuitively, this is the probability that the review "great stuff" is positive.

This is also a conditional probability with two components:

**The probability that the review is positive****On condition that the comment is "great stuff"**

Because a conditional probability can be calculated based on the following formula.

The conditional probability of B given that A is written as:

P (B|A)

= P (B and A) / P (A)

= P (A) * P (A|B) / P (A)

Similarly, the conditional probability of a positive review, given that the review is "great stuff", can be written as:

P(Positive | "Great stuff")

= P (Positive and "Great stuff" ) / P ("Great stuff")

**= P (Positive) x P ("Great stuff" | Positive) / P ("Great stuff")**

**Prior probability:**the probability of a positive review**Likelihood:**how likely the review contains the words "great stuff", given it is a positive review**Marginal probability:**the probability that the review is "great stuff"

Note: we do not need to calculate the **marginal probability**. This will be explained shortly.

Let's look at each component, one by one.

**Prior Probability - ***P (Positive) *

The prior probability is the probability of getting a positive review as a whole, regardless of what review it is.

In our example, there are three positive reviews out of seven.

The prior probability, **P (Positive)**, would be 3/7 = **0.4285**.

**Likelihood - ***P ("Great stuff" | Positive)*

The second component is the likelihood that the review is "great stuff", given that it is a positive review.

What does that even mean?

It means, given that it is a positive review, what is the probability that the review contains "great stuff".

How do we know?

We will look at the three positive reviews that we have, and find out how many of them contain the words "great" and "stuff".__Positive reviews:__

**Great**product!- Excellent
**stuff** - Work
**great**

The word "great" appears in two out of three positive reviews, while the word "stuff" appears in one out of three positive reviews.

The likelihood that a positive review contains the word "great" is **2/3.**

Similarly, the likelihood that a positive review contains the word "stuff" is **1/3**.

The likelihood that a positive review contains the phrase "great stuff" is the product of the two which is:

P ("Great stuff" | positive)

= P ("Great" | positive) * P ("stuff" | positive)

= (2/3) * (1/3)

= **0.2222**

**Marginal Probability - ***P ("great stuff")*

The marginal probability is the probability that the review contains "great stuff".

We do not need to calculate the marginal probability when building a naive Bayes model.

Here's why.

The objective of this exercise is to find out how likely it is that the review is positive vs. negative.

The two probabilities can be calculated using the formula we have just discussed:

You will notice that the two probabilities have the exact same denominator.

If our goal is to evaluate which probability is higher, we can simply compare the numerators when the denominators are the same.

Now, let's compare the two numerators:

On the left-hand side, we have already calculated the prior probability and the likelihood:

Similarly, we can calculate the numerator value for the negative-review scenario:

Bingo!

We have a clear winner. The probability of the positive review scenario is higher than that of the negative review scenario (0.0952 vs. 0.0357).

We conclude that this review is positive.

## Do you have a hard time learning SAS?

Take our Practical SAS Training Course for **Absolute Beginners** and learn how to write your first SAS program!

**Classifying Amazon Reviews**

Now that you have a basic understanding of how the naive Bayes model works, let's begin classifying the Amazon reviews.

Before we start implementing the model, there are a number of text processing tasks that we need to do.**Removing Punctuation Marks**

Punctuation marks usually don't have any positive or negative meaning. So, we will remove them from our text.

data amazon_temp1;

set amazon_raw;

** Remove Punctuation marks **;

orig_text = translate(text_temp,' ',',.;:?+=-!@#$%^&*(){}[]\|"/><');

orig_text = ' ' || trim(orig_text) || ' ';

** Replace space by @#@# **;

orig_text2 = upcase(tranwrd(orig_text, ' ', '@#@#'));

drop text_temp;

run;

In this data step, we have done two things:

- We have removed all the punctuation marks using the Translate function
- We have replaced the single space by a simple string '@#@#'

Using '@#@#' in place of a space allows us to easily split the words.**Removing Stop words**

Stop words are the words that have nearly no meaning in the sentences. These are words such as "a", "an", "while", "until", etc.

Removing them from our data will help to reduce the noise and give us a better model.

** Stop words **;

data stopwords;

infile datalines dsd;

input stopwords : $30. @@;

datalines;

I,ME,MY,MYSELF,WE,OUR,OURS,OURSELVES,YOU,YOUR,YOURS,YOURSELF,YOURSELVES,HE,HIM,HIS,HIMSELF,SHE,HER,HERS,HERSELF,IT,ITS,ITSELF,THEY,T

HEM,THEIR,THEIRS,THEMSELVES,WHAT,WHICH,WHO,WHOM,THIS,THAT,THESE,THOSE,AM,IS,ARE,WAS,WERE,BE,BEEN,BEING,HAVE,HAS,HAD,HAVING,DO,DOES,D

ID,DOING,A,AN,THE,AND,BUT,IF,OR,BECAUSE,AS,UNTIL,WHILE,OF,AT,BY,FOR,WITH,ABOUT,AGAINST,BETWEEN,INTO,THROUGH,DURING,BEFORE,AFTER,ABOV

E,BELOW,TO,FROM,UP,DOWN,IN,OUT,ON,OFF,OVER,UNDER,AGAIN,FURTHER,THEN,ONCE,HERE,THERE,WHEN,WHERE,WHY,HOW,ALL,ANY,BOTH,EACH,FEW,MORE,MO

ST,OTHER,SOME,SUCH,NO,NOR,ONLY,OWN,SAME,SO,THAN,TOO,VERY,S,T,CAN,WILL,JUST,DON,SHOULD,NOW

;

run;

proc sql noprint;

select stopwords into : stopwords separated by ' ' from stopwords;

select count(stopwords) into : num_stopwords from stopwords;

quit;

%put &stopwords &num_stopwords;

data amazon_temp2;

set amazon_temp1;

stopwords = "&stopwords";

text = orig_text2;

do i = 1 to &num_stopwords;

word = '@#' || scan(stopwords, i) || '@#';

text = tranwrd(text, trim(word), '');

end;

text = tranwrd(text, '@#', ' ');

text = tranwrd(text, '@', ' ');

drop orig_text2 i stopwords word;

run;

- Label
- Orig_text (original text)
- Text (cleaned text)
- Randno (random number generated to split the data into training/test/real-life sets)

**Splitting Data into Training/Test/Real-life Sets**

Our next step is to split the data into three sets:

- Training (810 records)
- Test (90 records)
- Real-life (100 records)

We will use the training set to build the model, and apply the model on the Test set.

Finally, we will check whether model can be generalized to the 'unseen' real-life data that we have set aside.

proc sort data=amazon_temp2; by randno; run;

data amazon ;

set amazon_temp2;

if _n_ <= 810 then group = 'Training';

else if _n_ <= 900 then group = 'Test';

else group = 'Realdata';

run;

data training_amazon (drop=randno) test_amazon (drop=randno) real_data (drop=randno) ;

set amazon;

if group = 'Training' then output training_amazon;

else if group = 'Test' then output test_amazon;

else output real_data;

run;

- TRAINING_AMAZON
- TEST_AMAZON
- REAL_DATA

## Become a Certified SAS Specialist

Get access to two SAS base certification prep courses and 150+ practice exercises

**Finding Unique Words**

Now we are going to put together the list of unique words in the data set.

We will make use of a macro for this task.

proc sql noprint;

select count(*) into: num_rows from amazon;

quit;

%put &num_rows;

** Getting distinct words **;

%macro sentence (num);

data sentence;

word = 'Random';

dummy = 1;

run;

%do num = 1 %to &num_rows;

data sen#

set amazon;

if _n_ = #

countw = countw(text);

dummy = 0;

do i = 1 to countw;

word = upcase(scan(text, i));

output;

end;

keep word dummy;

run;

proc sql;

create table sentence as

select word, dummy

from sentence union

select word, dummy

from sen&num

order by word;

quit;

proc delete data=sen#

quit;

%end;

%mend;

%sentence (&num_rows);

**Calculating the Likelihood of Each Unique Word**

Now that we have the unique words, we need to calculate the likelihood of each unique word, given it is a positive or negative review.

** Likelihood **;

data sentence_all;

set sentence (where=(dummy = 0));

n = _n_;

run;

%macro likeli (in, out);

proc sql noprint;

select count(*) into: num_words from ∈

quit;

%let num_words = &num_words;

%put &num_words;

data word;

length word $ 1000;

word = ' ';

like_pos = .;

like_neg = .;

dummy = 1;

run;

%do num = 1 %to &num_words;

proc sql;

create table word_temp1 as

select a.word, b.group, b.text, b.label

from &in a, amazon b

where a.n = &num

order by word;

quit;

data word_temp2;

set word_temp1;

by word;

if first.word then do;

pos = 0;

neg = 0;

total_pos = 0;

total_neg = 0;

end;

k = find(text, trim(word) || ' ', 'i');

** Count training group only **;

if group = 'Training' then do;

if label = 0 then do;

total_neg+1;

if k^=0 then neg+1;

end;

else if label = 1 then do;

total_pos+1;

if k^=0 then pos+1;

end;

end;

run;

data word_temp3;

set word_temp2;

by word;

if last.word;

pos = sum(pos, 1);

neg = sum(neg, 1);

total_pos = sum(total_pos, 1);

total_neg = sum(total_neg, 1);

like_pos = pos/total_pos;

like_neg = neg/total_neg;

keep word like_pos like_neg;

run;

data word;

set word word_temp3;

if dummy ^= 1;

run;

proc delete data=word_temp1;

proc delete data=word_temp2;

proc delete data=word_temp3;

quit;

%end;

%mend;

%likeli(sentence_all);

The WORD data set has three columns:

- WORD (unique words)
- LIKE_POS (likelihood of the word given a positive review)
- LIKE_NEG (likelihood of the word given a negative review)

**Classification**

We will now use the likelihood that we have computed, along with the prior probability to classify the reviews in the Test set.

We will make use of the macro below:

proc freq data=&base noprint;

table label / out=freq_base;

run;

proc transpose data=freq_base out=t_freq prefix=B;

var percent;

id label;

run;

data prior;

set t_freq;

length word $ 1000;

word = 'Prior';

like_pos = B1/100;

like_neg = B0/100;

keep word like_pos like_neg;

run;

proc delete data=freq_base; run;

proc delete data=t_freq; run;

proc sql noprint;

select count(*) into: num_rows from ∈

quit;

%put &num_rows;

data pred;

length text orig_text $ 1000;

n = 0;

pred = .;

label = .;

run;

%do num = 1 %to &num_rows;

data sen#

set ∈

if _n_ = #

n = #

countw = countw(text);

do i = 1 to countw;

word = upcase(scan(text, i));

output;

end;

keep text orig_text word n label;

run;

proc sql;

create table pred_temp_&num as

select a.text, a.orig_text, a.n, a.word, a.label, b.like_pos, b.like_neg

from sen&num a inner join word b

on a.word = b.word

order by a.word;

quit;

data pred#

set prior pred_temp_&num end=eof;

retain poss_pos poss_neg 1;

poss_pos = poss_pos * like_pos;

poss_neg = poss_neg * like_neg;

if poss_pos > poss_neg then pred = 1;

else pred = 0;

if eof = 1;

keep text orig_text n pred label;

run;

proc append base=pred data=pred#

run;

proc delete data=sen#

proc delete data=pred_temp_#

proc delete data=pred#

%end;

data &out;

set pred;

if n ^= 0;

run;

proc freq data=&out;

table label * pred / nocol nopercent;

run;

%mend;

%predict(training_amazon, test_amazon, predict_amazon);

We classify the reviews correctly 80% of the time.

Now, let's apply the model to the "unseen" real-life data using our previously created macro:

%predict(training_amazon, real_data, predict_real_data);

The correct classification rate is (38+45)/100 = 83%

Great! We have successfully built a naive Bayes model that can classify reviews correctly 80% of the time.