Can you tell that the two fish below are different?

Bream

Perch

For a human, it is very easy to spot the difference between the two fish.

Bream have a rounded body and are greyish in color. Perch, on the other hand, have a long body and are dark gold in color.

To a machine, distinguishing the two fish species could be a challenging job.

In this article, we will explain how to train a machine to identify the fish species.

We will build a K-nearest neighbor (KNN) model that enables the machine to correctly identify the fish species.

Let’s get started!

Bream have a rounded body and are greyish in color. Perch, on the other hand, have a long body and are dark gold in color.

To a machine, distinguishing the two fish species could be a challenging job.

In this article, we will explain how to train a machine to identify the fish species.

We will build a K-nearest neighbor (KNN) model that enables the machine to correctly identify the fish species.

Let’s get started!

**Data Sets**

In this article, we will use the Fish data set from the SASHELP library to illustrate the machine learning model.

The Fish data set contains many species. In this exercise, we will look at only bream and perch.

In terms of the variables (or features), only the Weight and Height will be used.

You can copy and run the code below to create the Fish data set on your SAS Studio.

The Fish data set contains many species. In this exercise, we will look at only bream and perch.

In terms of the variables (or features), only the Weight and Height will be used.

You can copy and run the code below to create the Fish data set on your SAS Studio.

data fish;

set sashelp.fish;

where species in (‘Bream’, ‘Perch’);

keep species weight height;

run;

set sashelp.fish;

where species in (‘Bream’, ‘Perch’);

keep species weight height;

run;

**Understand the Data**

Before we start building the model, we must first understand the data on hand.

Let’s run a quick Proc Means to look at the distribution of the data.

proc means data=fish nmiss n mean std min max;

var weight height;

class species;

run;

var weight height;

class species;

run;

This Proc Means data step above computes the summary statistics of the two features that we have:

- Weight
- Height

In the Fish data set, we have 35 bream and 56 perch.

Bream, on average, are bigger with an average weight of 626 grams, and an average height of 15.1 cm.

Perch are smaller. The average weight and height are 382 grams and 7.8 cm, respectively.

Perch are smaller. The average weight and height are 382 grams and 7.8 cm, respectively.

We also noticed a missing weight for a Bream. This will need to be taken care of.

**Visualizing the Data**

Visualizing the data usually helps us to better understand the data.

Let’s plot a scatterplot for the fish.

Let’s plot a scatterplot for the fish.

ods graphics on / attrpriority=none;

proc sgplot data=fish;

scatter y=height x=weight / group=species;

styleattrs datasymbols=(circlefilled Triangle);

run;

proc sgplot data=fish;

scatter y=height x=weight / group=species;

styleattrs datasymbols=(circlefilled Triangle);

run;

The Proc SGPLOT procedure above plots the fish on a scatterplot:

The blue dots represent bream, and the red triangles represent perch.

You can see quite a distinct distribution between the two types of fish.

You can see quite a distinct distribution between the two types of fish.

**Cleaning up the data**

As mentioned earlier, there is a missing weight for a bream. This could happen in practice because of measurement errors.

The K-nearest neighbor model cannot handle observations with missing values. We will replace the missing value with the average weight of bream.

The K-nearest neighbor model cannot handle observations with missing values. We will replace the missing value with the average weight of bream.

data fish2;

set fish;

if species = ‘Bream’ and weight = . then weight = 626;

keep species weight height;

run;

set fish;

if species = ‘Bream’ and weight = . then weight = 626;

keep species weight height;

run;

As discussed earlier, the average weight of a bream is 626 grams.

We have replaced the missing weight with 626:

We have replaced the missing weight with 626:

We are now ready to build the model.

## Do you have a hard time learning SAS?

Take our Practical SAS Training Course for **Absolute Beginners** and learn how to write your first SAS program!

**What is K-Nearest Neighbor (KNN) model?**

K-nearest neighbor (KNN) model is a machine learning model that is commonly used to solve classification problems.

It classifies data based on their k-nearest points.

Let’s run the entire code below on SAS Studio:

It classifies data based on their k-nearest points.

Let’s run the entire code below on SAS Studio:

** Illustration **;

data illu1 illu2;

set fish2;

where 500 < weight < 800;

if _n_ = 9 then do;

Species = ‘Unknown’;

output illu1;

end;

else if _n_ = 19 then do;

Species = ‘Unknown’;

output illu2;

end;

else do;

output illu1;

output illu2;

end;

run;

proc sort data=illu1; by species; run;

proc sort data=illu2; by species; run;

ods graphics on / attrpriority=none;

proc sgplot data=illu1;

scatter y=height x=weight / group=species;

styleattrs datasymbols=(circlefilled Triangle square );

xaxis min=400 max=850;

yaxis min=8 max=18;

run;

ods graphics on / attrpriority=none;

proc sgplot data=illu2;

scatter y=height x=weight / group=species;

styleattrs datasymbols=(circlefilled Triangle square );

xaxis min=400 max=850;

yaxis min=8 max=18;

run;

data illu1 illu2;

set fish2;

where 500 < weight < 800;

if _n_ = 9 then do;

Species = ‘Unknown’;

output illu1;

end;

else if _n_ = 19 then do;

Species = ‘Unknown’;

output illu2;

end;

else do;

output illu1;

output illu2;

end;

run;

proc sort data=illu1; by species; run;

proc sort data=illu2; by species; run;

ods graphics on / attrpriority=none;

proc sgplot data=illu1;

scatter y=height x=weight / group=species;

styleattrs datasymbols=(circlefilled Triangle square );

xaxis min=400 max=850;

yaxis min=8 max=18;

run;

ods graphics on / attrpriority=none;

proc sgplot data=illu2;

scatter y=height x=weight / group=species;

styleattrs datasymbols=(circlefilled Triangle square );

xaxis min=400 max=850;

yaxis min=8 max=18;

run;

The code above creates two graphs.

In this graph, the blue dots and the red triangle represent the bream and perch, respectively.

**Graph 1**

In this graph, the blue dots and the red triangle represent the bream and perch, respectively.

The green square represents an unknown fish.

The unknown fish is either a bream or a perch.

If you were to guess, what would it be?

It would be a bream, right?

Most of its closest points are bream. We would classify this unknown fish as a bream simply because of its proximity to other bream.

If you were to guess, what would it be?

It would be a bream, right?

Most of its closest points are bream. We would classify this unknown fish as a bream simply because of its proximity to other bream.

**Graph 2**

In Graph 2, the unknown fish is surrounded by perch, instead:

If we were to guess, the fish would be a perch.

But here we have a problem.

How do you determine how many neighbors to look at?

For our example in Graph 2, if we look at only the three closest neighbors, we would conclude that the unknown fish is a perch.

However, when looking at the 20 closest fish, there are more bream (14) than perch (6).

We would conclude that the unknown fish is a bream, instead!

In practice, selecting the optimal “k” (i.e. number of nearest neighbors) is not an easy task.

In this example, we will simplify the process, and build the KNN model with k = 3.

This means, we will use the three closest neighbors to determine which fish it is.

In practice, selecting the optimal “k” (i.e. number of nearest neighbors) is not an easy task.

In this example, we will simplify the process, and build the KNN model with k = 3.

This means, we will use the three closest neighbors to determine which fish it is.

**Splitting the data into Training and Test Set**

The first step in building a machine learning model is to split the data into training and test sets.

The training set is used to build the machine learning model and the test set is to check how the model performs.

We will do an 80-20 split for the training set and test set:

data fish_train fish_test_temp1;

set fish2;

rand = ranuni(100);

if rand <= 0.8 then output fish_train;

else output fish_test_temp1;

run;

data fish_test;

set fish_test_temp1;

num = _n_;

run;

set fish2;

rand = ranuni(100);

if rand <= 0.8 then output fish_train;

else output fish_test_temp1;

run;

data fish_test;

set fish_test_temp1;

num = _n_;

run;

We have created the FISH_TRAIN data set, which has 66 observations.

The FISH_TEST data set has 25 observations.

## Become a Certified SAS Specialist

Get access to two SAS base certification prep courses and 150+ practice exercises

**Manual Method: Predicting the Fish in the Test Set**

To illustrate how the KNN model works, we will first build the model manually.

As discussed earlier, we will test the model using k = 3.

For each of the 25 fish in the test set, we will first find the Euclidean distance to every other fish in the training set.

As discussed earlier, we will test the model using k = 3.

For each of the 25 fish in the test set, we will first find the Euclidean distance to every other fish in the training set.

The three closest fish will be used to predict the species.

Let’s run the code below on SAS Studio:

Let’s run the code below on SAS Studio:

proc sql;

create table combine_temp1 as

select a.num, a.species as species_true,

b.species as species_neighbor,

sqrt((a.weight – b.weight)**2 +

(a.height – b.height)**2

) as distance

from fish_test a, fish_train b

order by a.num, distance;

quit;

data combine;

set combine_temp1;

by num distance;

if first.num then i = 0;

i + 1;

if i <= 3;

run;

create table combine_temp1 as

select a.num, a.species as species_true,

b.species as species_neighbor,

sqrt((a.weight – b.weight)**2 +

(a.height – b.height)**2

) as distance

from fish_test a, fish_train b

order by a.num, distance;

quit;

data combine;

set combine_temp1;

by num distance;

if first.num then i = 0;

i + 1;

if i <= 3;

run;

The code above finds the three closest species for each fish in the test set:

If the three closest fish consist of two or more perch, then we predict the fish to be a perch. Otherwise, it would be a bream.

Let’s run the code below:

proc freq data=combine noprint;

table species_neighbor / out = fish_freq;

by num species_true;

run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;

set fish_freq;

by num count;

if last.num;

if species_true = species_neighbor then match = “Y”;

else match = “N”;

run;

table species_neighbor / out = fish_freq;

by num species_true;

run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;

set fish_freq;

by num count;

if last.num;

if species_true = species_neighbor then match = “Y”;

else match = “N”;

run;

The SPECIES_NEIGHBOR column represents the prediction that we make.

The MATCH column represents whether our prediction matches the correct species of the fish:

The MATCH column represents whether our prediction matches the correct species of the fish:

Let’s look at how our model performs:

proc freq data=fish_freq2;

table species_true*match / norow nocol nopercent;

run;

table species_true*match / norow nocol nopercent;

run;

Out of the 25 fish in the test set, we predicted 22 of them correctly.

We predicted 3 fish incorrectly.

Is this a good model? Not really. We can refine it further.

We predicted 3 fish incorrectly.

Is this a good model? Not really. We can refine it further.

**Feature Scaling**

When building a KNN model, it is common to do what we called the feature scaling.

Feature scaling is used to standardize the features so that the Euclidean distance calculation will not be dominated by features with a larger scale.

Let’s look at the formula for the Euclidean distance:

Feature scaling is used to standardize the features so that the Euclidean distance calculation will not be dominated by features with a larger scale.

Let’s look at the formula for the Euclidean distance:

The Euclidean distance consists of mostly two components:

The weight difference looks at the difference in weights between the fish.

In our example, the fish weight ranges from 5.9 to 1100 (grams). The difference could be more than

The height, on the other hand, ranges from 2.112 to 19 (cm) across both species of fish. The maximum height difference is no more than

- The weight difference and
- The height difference.

The weight difference looks at the difference in weights between the fish.

In our example, the fish weight ranges from 5.9 to 1100 (grams). The difference could be more than

**1000**grams.The height, on the other hand, ranges from 2.112 to 19 (cm) across both species of fish. The maximum height difference is no more than

**17**cm.Because of the wider scale, the weight will dominate the distance calculation.

We will need to rescale the features to ensure each feature will have equal or similar weight when calculating the Euclidean distance.

We will need to rescale the features to ensure each feature will have equal or similar weight when calculating the Euclidean distance.

proc standard data=fish2 out=fish3 mean=0 std=1;

var weight height;

run;

** Split into train and test **;

data fish_train fish_test_temp1;

set fish3;

rand = ranuni(100);

if rand <= 0.8 then output fish_train;

else output fish_test_temp1;

run;

data fish_test;

set fish_test_temp1;

num = _n_;

run;

var weight height;

run;

** Split into train and test **;

data fish_train fish_test_temp1;

set fish3;

rand = ranuni(100);

if rand <= 0.8 then output fish_train;

else output fish_test_temp1;

run;

data fish_test;

set fish_test_temp1;

num = _n_;

run;

The Proc Standard step above rescales the Weight and Height columns so that each feature has a zero mean and one standard deviation:

We will then split the data into training and test sets, and repeat what we have done above:

proc sql;

create table combine_temp1 as

select a.num, a.species as species_true,

b.species as species_neighbor,

sqrt((a.weight – b.weight)**2 +

(a.height – b.height)**2

) as distance

from fish_test a, fish_train b

order by a.num, distance;

quit;

data combine;

set combine_temp1;

by num distance;

if first.num then i = 0;

i + 1;

if i <= 3;

run;

proc freq data=combine noprint;

table species_neighbor / out = fish_freq;

by num species_true;

run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;

set fish_freq;

by num count;

if last.num;

if species_true = species_neighbor then match = “Y”;

else match = “N”;

run;

proc freq data=fish_freq2;

table species_true*match;

run;

create table combine_temp1 as

select a.num, a.species as species_true,

b.species as species_neighbor,

sqrt((a.weight – b.weight)**2 +

(a.height – b.height)**2

) as distance

from fish_test a, fish_train b

order by a.num, distance;

quit;

data combine;

set combine_temp1;

by num distance;

if first.num then i = 0;

i + 1;

if i <= 3;

run;

proc freq data=combine noprint;

table species_neighbor / out = fish_freq;

by num species_true;

run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;

set fish_freq;

by num count;

if last.num;

if species_true = species_neighbor then match = “Y”;

else match = “N”;

run;

proc freq data=fish_freq2;

table species_true*match;

run;

We now have 100% accuracy!

Our prediction matches the actual fish species for all 25 fish in the test set.

Our prediction matches the actual fish species for all 25 fish in the test set.

Of course, when solving real-life machine learning problems, it is very rare, if ever, that you can get 100% accuracy.

Our fish species have very distinct features that allows the machine to predict accurately.

Real-life data is usually messier to deal with.

Our fish species have very distinct features that allows the machine to predict accurately.

Real-life data is usually messier to deal with.

**Proc Discrim Method: Predicting the Fish in the Test Set**

In practice, you don’t have to write the code manually just to build the model.

The built-in Proc Discrim can be used to run KNN model with just a few lines of code.

Let’s look at the example below:

proc discrim data = fish_train test = fish_test

testout = _score1 method = npar k = 3 testlist;

class species;

var weight height;

run;

testout = _score1 method = npar k = 3 testlist;

class species;

var weight height;

run;

The Proc Discrim step does the prediction automatically:

It also computes the error rate to give you a sense of how the model performs:

The KNN model is great at solving simple classification problems.

In the next machine learning article, we will look at how to use the KNN model to predict the Titanic survivors.

In the next machine learning article, we will look at how to use the KNN model to predict the Titanic survivors.