Master SAS in 30 days!

A jargon-free, easy-to-learn SAS base course that is tailor-made for students with no prior knowledge of SAS.

Classify Product Reviews on Amazon Using Naïve Bayes Model in SAS

“Your product sucks!”
 
It does not take a genius to know this is a bad review.

When I see such a review on Amazon, I would probably be cautious about buying this product.
 
In this article, we will look at how to teach a machine to classify good and bad reviews using the naive Bayes model.
 
Let’s get started.

Software

Before we continue, make sure you have access to SAS Studio. It's free!

Data Sets

 

In this article, we will use the Sentiment Labelled Sentences data sets provided in the UCI machine learning repository.

​Click on the link above and go to: Download Data Folder

Download the zip file and unzip it:

There are three data files in the zipped file. 

​We will look at the amazon_cells_labelled.txt file:

Place this file in your SAS University Edition shared folder:

Run the code below to import the data:

filename amazon '/folders/myfolders/amazon_cells_labelled.txt';

data amazon_raw;
infile amazon dsd dlm='09'x;
input text_temp : $1000. label;
run;

The data set is imported!

Understand the Data
 

Before we start building the model, we must first understand the data on hand.

Let's take a quick look at the data set. There are 1,000 observations, with two columns:

The Text_temp column contains the written reviews.

The Label column identifies whether the review is positive (1) or negative (0).

Let's run a frequency table on the Label column:

proc freq data=amazon_raw;
table label;
run;
There are 500 positive and 500 negative reviews.
Quick Introduction to the Naive Bayes Model
 
The naive Bayes model is a machine learning model that is commonly used in text classifications.

Let's look at a very simple example of how it works.

In the above table, there are three positive reviews:

  • Great product!
  • Excellent stuff
  • Work great

There are four negative reviews:

  • Did not work at all
  • Refund requested
  • Not that great
  • Bad stuff

There is also one unknown review:

"Great stuff"

For a human, we can easily identify this as a positive review.

But how can a machine evaluate the nature of the review?

The machine can estimate how likely the review is positive or negative, based on the seven known reviews that we have.

For example, if "great stuff" is a positive review, we would have expected this phrase or these two words to appear in the other positive reviews, as well.

If that's not the case, we would assume this is not a positive review.


 

Mathematically, we are looking at this probability:

  • P ( Positive | Review = “Great Stuff” )

Intuitively, this is the probability that the review "great stuff" is positive.

This is also a conditional probability with two components:

  1. The probability that the review is positive
  2. On condition that the comment is "great stuff"
Why do we want to think of it as a conditional probability?

Because a conditional probability can be calculated based on the following formula.

The conditional probability of B given that A is written as:

P (B|A)
= P (B and A) / P (A)
= P (A) * P (A|B) / P (A)

Similarly, the conditional probability of a positive review, given that the review is "great stuff", can be written as:

P(Positive | "Great stuff")
= P (Positive and "Great stuff" ) / P ("Great stuff")

P (Positive) x P ("Great stuff" | Positive) / P ("Great stuff")
There are three components to this formula:
  • Prior probability: the probability of a positive review
  • Likelihood: how likely the review contains the words "great stuff", given it is a positive review
  • Marginal probability: the probability that the review is "great stuff"


Note: we do not need to calculate the marginal probability. This will be explained shortly.

Let's look at each component, one by one.

Prior Probability - P (Positive) 

The prior probability is the probability of getting a positive review as a whole, regardless of what review it is.

In our example, there are three positive reviews out of seven.

The prior probability, P (Positive), would be 3/7 = 0.4285.

Likelihood - P ("Great stuff" | Positive)

The second component is the likelihood that the review is "great stuff", given that it is a positive review.

What does that even mean?

It means, given that it is a positive review, what is the probability that the review contains "great stuff". 

How do we know?

We will look at the three positive reviews that we have, and find out how many of them contain the words "great" and "stuff".

​Positive reviews:

  • Great product!
  • Excellent stuff
  • Work great


The word "great" appears in two out of three positive reviews, while the word "stuff" appears in one out of three positive reviews.

The likelihood that a positive review contains the word "great" is 2/3. 

Similarly, the likelihood that a positive review contains the word "stuff" is 1/3.

The likelihood that a positive review contains the phrase "great stuff" is the product of the two which is:

P ("Great stuff" | positive)
P ("Great" | positive) * P ("stuff" | positive)
(2/3) * (1/3)
0.2222

Marginal Probability - P ("great stuff")

The marginal probability is the probability that the review contains "great stuff".

We do not need to calculate the marginal probability when building a naive Bayes model.

Here's why.

The objective of this exercise is to find out how likely it is that the review is positive vs. negative.

We will classify the review as positive or negative, based on which probability is higher.

The two probabilities can be calculated using the formula we have just discussed:

You will notice that the two probabilities have the exact same denominator.

We have the same marginal probability on each side.

If our goal is to evaluate which probability is higher, we can simply compare the numerators when the denominators are the same.

Now, let's compare the two numerators:

On the left-hand side, we have already calculated the prior probability and the likelihood:

Similarly, we can calculate the numerator value for the negative-review scenario:

Bingo!

We have a clear winner. The probability of the positive review scenario is higher than that of the negative review scenario (0.0952 vs. 0.0357).

We conclude that this review is positive.

Do you have a hard time learning SAS?

Take our Practical SAS Training Course for Absolute Beginners and learn how to write your first SAS program!

Classifying Amazon Reviews

 

Now that you have a basic understanding of how the naive Bayes model works, let's begin classifying the Amazon reviews.

Before we start implementing the model, there are a number of text processing tasks that we need to do.

Removing Punctuation Marks

Punctuation marks usually don't have any positive or negative meaning. So, we will remove them from our text.

data amazon_temp1;
set amazon_raw;

** Remove Punctuation marks **;
orig_text = translate(text_temp,' ',',.;:?+=-!@#$%^&*(){}[]\|"/><');
orig_text = '  ' || trim(orig_text) || ' ';


** Replace space by @#@# **;
orig_text2 = upcase(tranwrd(orig_text, ' ', '@#@#'));

drop text_temp;
run;

In this data step, we have done two things:

  1. We have removed all the punctuation marks using the Translate function
  2. We have replaced the single space by a simple string '@#@#'


Using '@#@#' in place of a space allows us to easily split the words.

Removing Stop words

Stop words are the words that have nearly no meaning in the sentences. These are words such as "a", "an", "while", "until", etc.

Removing them from our data will help to reduce the noise and give us a better model.

** Stop words **;
data stopwords;
infile datalines dsd;
input stopwords : $30. @@;
datalines;
 I,ME,MY,MYSELF,WE,OUR,OURS,OURSELVES,YOU,YOUR,YOURS,YOURSELF,YOURSELVES,HE,HIM,HIS,HIMSELF,SHE,HER,HERS,HERSELF,IT,ITS,ITSELF,THEY,T
 HEM,THEIR,THEIRS,THEMSELVES,WHAT,WHICH,WHO,WHOM,THIS,THAT,THESE,THOSE,AM,IS,ARE,WAS,WERE,BE,BEEN,BEING,HAVE,HAS,HAD,HAVING,DO,DOES,D
 ID,DOING,A,AN,THE,AND,BUT,IF,OR,BECAUSE,AS,UNTIL,WHILE,OF,AT,BY,FOR,WITH,ABOUT,AGAINST,BETWEEN,INTO,THROUGH,DURING,BEFORE,AFTER,ABOV
 E,BELOW,TO,FROM,UP,DOWN,IN,OUT,ON,OFF,OVER,UNDER,AGAIN,FURTHER,THEN,ONCE,HERE,THERE,WHEN,WHERE,WHY,HOW,ALL,ANY,BOTH,EACH,FEW,MORE,MO
 ST,OTHER,SOME,SUCH,NO,NOR,ONLY,OWN,SAME,SO,THAN,TOO,VERY,S,T,CAN,WILL,JUST,DON,SHOULD,NOW
 ;
run;

proc sql noprint;
select stopwords into : stopwords separated by ' ' from stopwords;
select count(stopwords) into : num_stopwords from stopwords;
quit;

%put &stopwords &num_stopwords;

data amazon_temp2;
set amazon_temp1;
stopwords = "&stopwords";

text = orig_text2;

do i = 1 to &num_stopwords;
word = '@#' || scan(stopwords, i) || '@#';
text = tranwrd(text, trim(word), '');
end; 

text = tranwrd(text, '@#', ' ');
text = tranwrd(text, '@', ' ');

drop orig_text2 i stopwords word;

run;

The text processing is now complete! ​The AMAZON_TEMP2 data set has three columns:
  • Label
  • Orig_text (original text)
  • Text (cleaned text)
  • Randno (random number generated to split the data into training/test/real-life sets)
Splitting Data into Training/Test/Real-life Sets
 

Our next step is to split the data into three sets:

  • Training (810 records)
  • Test (90 records)
  • Real-life (100 records)

We will use the training set to build the model, and apply the model on the Test set.

Finally, we will check whether model can be generalized to the 'unseen' real-life data that we have set aside.

proc sort data=amazon_temp2; by randno; run;

data amazon ;
set amazon_temp2;

if _n_ <= 810 then group = 'Training';
else if _n_ <= 900 then group = 'Test';
else group = 'Realdata';
run;

data training_amazon (drop=randno) test_amazon (drop=randno)  real_data (drop=randno) ;
set amazon;

if group = 'Training' then output training_amazon;
else if group = 'Test' then output test_amazon;
else output real_data;
run;

The data have now been split into the three data sets:
  • ​TRAINING_AMAZON
  • TEST_AMAZON
  • REAL_DATA

Become a Certified SAS Specialist

Get access to two SAS base certification prep courses and 150+ practice exercises

Finding Unique Words
 

Now we are going to put together the list of unique words in the data set. 

​We will make use of a macro for this task.

proc sql noprint;
select count(*) into: num_rows from amazon;
quit;

%put &num_rows;

** Getting distinct words **;

%macro sentence (num);

data sentence;
word = 'Random';
dummy = 1;
run;

%do num = 1 %to &num_rows;

data sen&num;
set amazon;
if _n_ = &num;
countw = countw(text);
dummy = 0;

do i = 1 to countw;
word = upcase(scan(text, i));
output;
end;

keep word dummy;
run;

proc sql;
create table sentence as
select word, dummy
from sentence union 
select word, dummy
from sen&num
order by word;
quit;

proc delete data=sen&num;
quit;

%end;

%mend;

%sentence (&num_rows);

In total, we have 1,776 unique words:
[note: using the macro is not the most efficient way to get unique words. However, due to the capacity limit of SAS Studio, using such macro is necessary for this task.]
Calculating the Likelihood of Each Unique Word
 

Now that we have the unique words, we need to calculate the likelihood of each unique word, given it is a positive or negative review.

** Likelihood **;
data sentence_all;
set sentence (where=(dummy = 0));
n = _n_;
run;

%macro likeli (in, out);

proc sql noprint;
select count(*) into: num_words from &in;
quit;

%let num_words = &num_words;

%put &num_words;

data word;
length word $ 1000;
word = ' ';
like_pos = .;
like_neg = .;
dummy = 1;
run;

%do num = 1 %to &num_words;

proc sql;
create table word_temp1 as
select a.word, b.group, b.text, b.label
from &in a, amazon b
where a.n = &num
order by word;
quit;

data word_temp2;
set word_temp1;
by word;
if first.word then do;
pos = 0;
neg = 0;
total_pos = 0;
total_neg = 0;
end;

k = find(text, trim(word) || ' ', 'i');

** Count training group only **;
if group = 'Training' then do;

if label = 0 then do;
total_neg+1;
if k^=0 then neg+1;
end;

else if label = 1 then do;
total_pos+1;
if k^=0 then pos+1;
end;

end;

run;

data word_temp3;
set word_temp2;
by word;
if last.word;

pos = sum(pos, 1);
neg = sum(neg, 1);
total_pos = sum(total_pos, 1);
total_neg = sum(total_neg, 1);

like_pos = pos/total_pos;
like_neg = neg/total_neg;

keep word like_pos like_neg;

run;

data word;
set word word_temp3;
if dummy ^= 1;
run;

proc delete data=word_temp1;
proc delete data=word_temp2;
proc delete data=word_temp3;
quit;


%end;

%mend;

%likeli(sentence_all);

The WORD data set has three columns:

  • WORD (unique words)
  • LIKE_POS (likelihood of the word given a positive review)
  • LIKE_NEG (likelihood of the word given a negative review)
Classification
 

We will now use the likelihood that we have computed, along with the prior probability to classify the reviews in the Test set.

We will make use of the macro below:

%macro predict (base, in, out);

proc freq data=&base noprint;
table label / out=freq_base;
run;

proc transpose data=freq_base out=t_freq prefix=B;
var percent;
id label;
run;

data prior;
set t_freq;
length word $ 1000;
word = 'Prior';
like_pos = B1/100;
like_neg = B0/100;
keep word like_pos like_neg;
run;

proc delete data=freq_base; run;
proc delete data=t_freq; run;


proc sql noprint;
select count(*) into: num_rows from &in;
quit;

%put &num_rows;

data pred;
length text orig_text $ 1000;
n = 0;
pred = .;
label = .;
run;


%do num = 1 %to &num_rows;

data sen&num;
set &in;
if _n_ = &num;
n = &num;
countw = countw(text);

do i = 1 to countw;
word = upcase(scan(text, i));
output;
end;

keep text orig_text word n label;
run;

proc sql;
create table pred_temp_&num as 
select a.text, a.orig_text, a.n, a.word, a.label, b.like_pos, b.like_neg
from sen&num a inner join word b
on a.word = b.word
order by a.word;
quit;

data pred&num;
set prior pred_temp_&num end=eof;

retain poss_pos poss_neg 1;
poss_pos = poss_pos * like_pos;
poss_neg = poss_neg * like_neg;

if poss_pos > poss_neg then pred = 1;
else pred = 0;

if eof = 1;

keep text orig_text n pred label;

run;

proc append base=pred data=pred&num;
run;

proc delete data=sen&num;
proc delete data=pred_temp_&num;
proc delete data=pred&num;

%end;

data &out;
set pred;
if n ^= 0;
run;

proc freq data=&out;
table label * pred / nocol nopercent;
run;

%mend;

%predict(training_amazon, test_amazon, predict_amazon);
​The correct classification rate on the Test set is (36+36)/90 = 80%.
This isn't bad!

We classify the reviews correctly 80% of the time.

Now, let's apply the model to the "unseen" real-life data using our previously created macro:

%predict(training_amazon, real_data, predict_real_data);

The correct classification ​rate is (38+45)/100 = 83%

Great! We have successfully built a naive Bayes model that can classify reviews correctly 80% of the time.

Master SAS in 30 Days

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
iconmail

Get latest articles from SASCrunch

SAS Base Certification Exam Prep Course

Two Certificate Prep Courses and 300+ Practice Exercises