AWS Big Data Blog

Building a Multi-Class ML Model with Amazon Machine Learning

Guy Ernest is a Solutions Architect with AWS

This post builds on our earlier post Building a Numeric Regression Model with Amazon Machine Learning.

We often need to assign an object (product, article, or customer) to its class (product category, article topic or type, or customer segment). For example, which category of products is most interesting to this customer? Because of the massive scale of some businesses and the short lifespan of articles or customer visits, it’s essential to be able to assign an object to its class at scale and speed to ensure successful business transactions.

This blog post shows how to build a multiclass classification model that:

  • Helps automate the process of predicting object assignment to one of more than two classes, at scale and speed
  • Can be used in a simple and scalable way to accommodate classes and objects that constantly evolve
  • Requires minimal help from machine learning experts
  • Can be extended to many aspects of your business

In this post, you learn how to address multiclassification problems by using cartographic information to predict the type of forest cover that will occur on a land segment, from among six types. Similar multiclassification machine learning (ML) problems could include determining recommendations such as which product in an e-commerce store or on a video steaming service is most relevant for a visiting user.

As in my previous blog post on numerical regression, I show how to build a multiclassification model based on a data set that is publicly available on Kaggle. Kaggle is a community site on which companies and researchers post their data, which data scientists then use to compete to solve data science problems. While building the model, you should think about how to use Amazon Machine Learning (Amazon ML) to solve similar problems in your domain.

Preparing the data to build the ML model

The most important part of building a successful ML model is finding the most relevant data to feed it. The rule of thumb is GIGO, or Garbage In, Garbage Out (or Gold In, Gold Out, based on your perspective). Domain knowledge helps to identify what might be relevant. It’s also important to have access to good data for training the model and the prediction process.

In this example, you might consider variables such as the elevation of the area, slope, and soil type as good predictors for the type of trees you would find in an area. Other parameters could include the distance to a water source or to a road. The organizers of the forest cover type prediction competition on the Kaggle site prepared this data:

Elevation - Elevation in meters
Aspect - Aspect in degrees azimuth
Slope - Slope in degrees
Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway
Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

You probably wouldn’t modify the data attributes for the first training of the model, but you should shuffle the rows to remove any artificial order that might come from the source of the data because this can cause bias in the model training.

For this example, you need to download the data from the Kaggle competition site after you sign in to the site. If you want to follow the commands in this post, you need a terminal window on a Linux or MacOS machine to run bash commands. For Windows users, you can spawn an EC2 instance and run all commands from this instance.

Run the following commands:

# Download the training data from Kaggle: http://www.kaggle.com/c/forest-cover-type-prediction/download/train.csv.zip
unzip train.csv.zip
# shuffle the lines except for the first header line
tail -n+2 forest_cover_train.csv | gshuf -o forest_cover_train_shuffle.csv
# Add the header line from the original file as the first line of the shuffled file
head -1 forest_cover_train.csv | cat - forest_cover_train_shuffle.csv > temp && mv temp forest_cover_train_shuffle.csv

Then upload the shuffled file to Amazon S3, using the AWS CLI, to create the data source:

aws s3 cp forest_cover_train_shuffle.csv s3://<BUCKET_NAME>/ML/input/ForestCover/ --region us-east-1

Create a new data source and ML model using the AWS Management Console.

Creating a new data source and machine learning model

Continue to follow the wizard, making minimal changes to ensure that binary variables are identified as binary and not as numeric (mainly the soil type variables).

Follow the machine learning wizard

You can filter the long list of 56 variables by the Name prefix and see that you have 40 binary variables for soil type. This representation of the soil type variable, with 40 different Boolean variables instead of a single variable with 40 possible values, is common when preparing data for a specific ML algorithm. It might not be the simplest way of representing the data, but continue with the data as it is for now.

The next step is to choose the prediction target, which is a special attribute in the training data that contains the information that Amazon ML attempts to predict. Choose Cover_Type as the target.

Choosing the prediction target
Because Cover_Type is a categorical variable, Amazon ML automatically identifies the model as a multiclass classification type.

Filter to find the Id variable in the data source and choose it as the row identifier.

Choosing the row identifier

Now you can launch the training process, which can take a few minutes to finish. The time it takes depends on the size of the training data: the number of attributes and rows.

Evaluating model accuracy

By default, Amazon ML splits the data so that 70% is used for training the model and 30% is used to evaluate the accuracy of the model. You can use your own split of training vs. evaluation data, but it is important that you never use training data to evaluate accuracy because the evaluation should estimate model accuracy from new data only.

Let’s see how Amazon ML makes it easy to understand the model that it created.

Seeing how Amazon Machine Learning makes it easy to understand the model

For multiclass models, the service uses a score called the F1-measure, which is a statistical measure of the precision and recall of all of the classes in the model. The F1 score ranges between 0 and 1 and the higher the score, the better the overall accuracy of the model. This ML model received a score of 0.69, which is much better than the random baseline score of 0.03.

But it is relatively hard to understand how the model is behaving for each of the classes. To see this, look at the confusion matrix by choosing Explore model performance.

Confusion matrix

The confusion matrix gives some insights about the performance of the model as a way to visualize the accuracy of multiclass classification predictive models. The confusion matrix illustrates in a table the number or percentage of correct and incorrect predictions for each class by comparing an observation’s predicted class and its true class. For more information about evaluating multiclass models, see Multiclass Model Insights.

The best possible confusion matrix shows all blue diagonally across the matrix, which means that all of the predictions were correct. But such an accuracy rate is very rare in machine learning. In this model, you can see that class 7 is very good with correct classification of 576 out of 629 cases of this class in the evaluation data set. This is more than 91% accurate. You can also see on the right side of the matrix that the F1 score for this class is very high, with a score of 0.89. Class 4 also has a high F1 score.

However, some cells have shades of red, which means that the model confused them with another class. For example, class 6 got a relatively low F1 score of 0.55, with an orange cell in the confusion matrix for class 3. This means that our model is confusing these two classes, and tends to predict class 3, when the correct answer should be class 6.

To understand why the ML model confuses the two, go back to the problem domain, which is classes of trees. Here are the types of trees that you are trying to predict:

1 - Spruce/Fir
2 - Lodgepole Pine
3 - Ponderosa Pine
4 - Cottonwood/Willow
5 - Aspen
6 - Douglas-fir
7 - Krummholz

Class 6 is “Douglas-fir” and class 3 is “Ponderosa Pine.” These trees are very close botanically. You can see the similarity between the two species by comparing the images Google presents for each.

Comparing Google images of the species

When you have classes that are similar, you can expect to find human errors in the classifications used for training. Even small errors in classification can reduce the model’s ability to distinguish accurately between classes. To mitigate this type of problem, you can ask the domain expert to evaluate how important it is to distinguish between very close classes. Maybe you can combine them into a single class (for example, Northwest Coast Pine), recheck the manual classifications you used for the problematic classes, or think about collecting data that distinguishes between the two classes. For example, needle length could help differentiate tree classes.

How can you improve ML model accuracy?

At this stage, you need to decide if you want to improve the ML model or use it as it is. As you may remember, the data set included 40 Boolean variables for each of the soil types, which is a way to improve the performance of some machine learning algorithms. With Amazon ML, you don’t really need to use this kind of trick. Instead, it’s better to focus on adding knowledge to the model.

For example, you can replace the long list of binary numbers with the textual description of soil type and allow Amazon ML to run textual analysis on these descriptions. Review the descriptions of the soil types as they appear on the competition site.

1 Cathedral family - Rock outcrop complex, extremely stony.
2 Vanet - Ratake families complex, very stony.
3 Haploborolis - Rock outcrop complex, rubbly.
4 Ratake family - Rock outcrop complex, rubbly.
5 Vanet family - Rock outcrop complex complex, rubbly.
6 Vanet - Wetmore families - Rock outcrop complex, stony.
7 Gothic family.
8 Supervisor - Limber families complex.
9 Troutville family, very stony.
10 Bullwark - Catamount families - Rock outcrop complex, rubbly.
11 Bullwark - Catamount families - Rock land complex, rubbly.
12 Legault family - Rock land complex, stony.
13 Catamount family - Rock land - Bullwark family complex, rubbly.

Notice that terms like rubbly and stony are repeated across the descriptions. Also notice that there are different levels for the term stony, such as extremely stony, very stony, or simply stony. Because these terms add knowledge to the system, you can assume that they will improve the ML model. Amazon ML can pick up patterns in textual descriptions; for example, soil types that include the word rock may be similar.

Go back to the data files you downloaded from the competition site and modify them to include the textual description for each row. You can use a long but simple AWK script to add the description to each line of the training data (and later, to the evaluation data).

BEGIN { FS = "," } ;
NR == 1 {
    for (i = 1; i <= NF; i++) headers[i] = $i;
    next
}
{
    # Choose the last bit flag
    description = ""
    if ($1 ==1) description = ""1 Cathedral family - Rock outcrop complex, extremely stony.""
    else if ($2 ==1) description = ""2 Vanet - Ratake families complex, very stony.""
    else if ($3 ==1) description = ""3 Haploborolis - Rock outcrop complex, rubbly.""
    else if ($4 ==1) description = ""4 Ratake family - Rock outcrop complex, rubbly.""
    else if ($5 ==1) description = ""5 Vanet family - Rock outcrop complex complex, rubbly.""
    else if ($6 ==1) description = ""6 Vanet - Wetmore families - Rock outcrop complex, stony.""
    else if ($7 ==1) description = ""7 Gothic family.""
    else if ($8 ==1) description = ""8 Supervisor - Limber families complex.""
    else if ($9 ==1) description = ""9 Troutville family, very stony.""
    ...
    else if ($38 ==1) description = ""38 Leighcan - Moran families - Cryaquolls complex, extremely stony.""
    else if ($39 ==1) description = ""39 Moran family - Cryorthents - Leighcan family complex, extremely stony.""
    else if ($40 ==1) description = ""40 Moran family - Cryorthents - Rock land complex, extremely stony.""
    else description = ""
    printf "%s", description
    printf "n"
}

Apply it to the training data file and paste the description file into the original file, using the following commands in your terminal:

# extract the cover_type field and match it to its textual description
cut -d"," -f16- forest_cover_train_shuffle.csv | awk -f soil_type_translation.awk > train_soil_description.csv
# add header line to the new created csv file
echo 'Description' | cat - train_soil_description.csv > temp && mv temp
train_soil_description.csv
# append the textual description as the last field of training data file
paste -d, forest_cover_train_shuffle.csv train_soil_description.csv > forest_cover_train_with_description.csv

Then remove the 40 Boolean fields from the data file before uploading it to Amazon S3.

# remove fields 15 to 56 (soil type bits)
cut -d, -f1-15,56- forest_cover_train_with_description.csv >
forest_cover_train_with_description_no_soil.csv
# copy the new training file to S3
 aws s3 cp forest_cover_train_with_description.csv s3:///ML/input/ForestCover/ --region us-east-1

Defining a custom recipe for the ML model

So far, you’ve used the default values for most of the wizards, but for this model you can use a simple custom recipe. Recipes are preformatted instructions for common transformations. You can use the insight that we got from reviewing the textual description of soil types, that having a single term like stony is not enough, and instruct the service to look at two words, such as very stony or extremely stony, together. This is called n-gram transformation of length 2.

After you upload the new training file to Amazon S3, create the data source as you did before, with the new textual description marked as a textual attribute, not as a category. Now you can use the option to provide a custom recipe.

Providing a custom recipe

Add the following line to the default recipe:

"ngram(lowercase(Description),2)"

This line tells the service to apply lowercase to the Textual Description field and create an n-gram of length 2 for it.

Providing a custom recipe

The rest of the process is similar to the evaluation you performed earlier. When you review the results of the evaluation for the newly created model, you can see that the overall F1 score, 0.71, is slightly better.

Reviewing results for the new model

The confusion matrix and the F1 score for each class are also different, suggesting further trials for adding meaningful attributes to the dataset.

Also note the artificial distribution of the Cover_Type attribute in the training data.

Artificial distribution of Cover_Type in the training data

The administrators of the Kaggle competition ensured that an equal number of samples for each forest cover type were provided, which is not realistic. In most real-life cases, some classes are more common than the others, and it is best to keep this proportion in the training data. Having an equal number of samples for each cover type also adds to the complexity of comparing different ML models because each default 70/30 split creates a random skew of the data. In real data, the inherit skew will most likely be preserved.

What did you learn?

In this exercise, you learned the importance of getting the right data into the simplest form. Examples of simplest form can include downloading data from a competition site and using it as it is, using the textual description of the soil type without the long binary flags array of the different soil types, or not forcing the data to have the same number of observations for each type of class (if this is not the way the classes are distributed in the real world).

For fast and efficient ML model creation with Amazon ML, being a domain expert is more important than being a machine-learning expert. The ability to build a data source and ML model quickly and to evaluate performance, both overall and for each class, allows you to streamline the ML model.

You can consult machine-learning experts to discuss the meaning of F1 scores and the confusion matrix and to get advice on shuffling the training data and using the n-gram function on the textual description. Aside from that, you can do everything yourself.

You also learned to use the evaluation summary to compare a number of different models based on overall F1 scores. You dove deeper into the confusion matrix to see how well each model is able to identify a specific class and to identify common mistakes. You used this insight to tune the model to fit the business requirements better. You might not win the Kaggle machine learning competition, but you can improve business performance with streamlined Amazon ML model creation and usage.

If you have questions or suggestions, please leave a comment below.

———————————————————

Related:

Building a Numeric Regression Model with Amazon Machine Learning