Using machine learning to predict breast cancer

An introduction to breast cancer screening and Fine Needle Aspiration

R Code

You can get the full R code for this project on my github page, here.

The Wisconsin Diagnosis Breast Cancer dataset

Background

The Wisconsin Diagnosis Breast Cancer (WDBC) dataset is an open-sourced dataset computed from digitized images of fine needle aspirate (FNA) of breast masses.

The dataset contains two classes: a benign or maligant diagnosis of the mass. There are ten real-valued attributes, all of which are computed for each nucleus and describe the characteristics of the cell nuclei. These include radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimensions. For each major attribute, the mean, standard error, and "worst" (or largest) values are computed, resulting in 30 overall features predicting mass diagnosis.

Classification algorithms

This tutorial will compare two types of classification algorithms, classification and regression trees (CART) and C4.5 decision trees, to evaluate the effectiveness of machine learning when attempting to predict whether a mass is benign or malignant.

Data Preparation

Data preparation for this dataset involves removing any unnecessary attributes (for example, patient_ID, which has no bearing on the class attribute), evaluating outliers, and exploring potential imbalance in the class attribute.

{An imbalanced class design can impede an algorithm's ability to accurately predict the minority class, resulting in overfitting or low accuracy rates}

There are 357 benign samples and 212 malignant samples in the dataset, which is a fairly imbalanced design (see right, top).

{To counteract this, a technique known as the Synthetic Minority Oversampling TEchnique (SMOTE) can be applied. SMOTE generates a random set of minority class observations, using bootstrapping and (in this case) the data point's 5-nearest neighbours. In this way, the bias towards the majority class is lessened, and yet the 'new' samples in the minority class are representative of the pre-existing values}

The SMOTEd dataset contains 357 benign samples and 356 malignant samples (see right left).

For further downstream analysis using CART and C4.5 trees, the accuracy of both the raw and SMOTEd datasets will be analysed and compared.

CART

Recursive Partitioning ('rpart'), a CART algorithm, will be used in this tutorial, using 10-fold cross-validation.

{rpart trees are built using a multi-step process. First, the single variable is found which best splits the data into two groups. Second, the data are separated, and the process is repeated recursively until the subgroups either reach a maximum size of 5 or no further improvements are made.}

Conducting rpart on the raw dataset produces the following confusion matrix:

Reference

Prediction Benign Malignant

Benign 58.2 6.7

Malignant 4.6 30.6

The overall accuracy of this model is 88.75% with a False Negative rate of 6.7%. Can this be improved by using the SMOTEd dataset?

Running rpart with the SMOTEd dataset results in the following confusion matrix:

The overall accuracy of this model is 89.34% with a False Negative rate of 4.9%. While better than the raw dataset, the accuracy may be improved by using a different classification algorithm.

C4.5

C4.5 decision trees are similar to CART trees, but use a different splitting criterion, ('gain ratio'), and prune using a bottom-up strategy known as 'error-based' pruning (4).

4. https://www.quora.com/What-are-the-differences-between-ID3-C4-5-and-CART

Executing the C4.5 algorithm on the raw dataset with 10-fold cross validation gives a confusion matrix of:

Using a C4.5 tree increases the overall accuracy to 93.67% and decreases the FNR to 3% - a marked improvement over CART. Does the SMOTEd dataset further build upon this improvement?

After executing the C4.5 algorithm on the SMOTEd dataset, using 10-fold cross validation, the resulting matrix is:

Of the four models, this gives us the highest accuracy (94.81%) and the lowest FNR (2.9%). We'll select this model to apply further in-depth analysis.

Exploring the final decision tree

The final step in the analysis is to explore the decision tree composed by the selected model (C4.5 SMOTEd):

Reference

Prediction Benign Malignant

Benign 44.3 4.9

Malignant 5.8 45.0

Reference

Prediction Benign Malignant

Benign 59.4 3.0

Malignant 3.3 34.3

Reference

Prediction Benign Malignant

Benign 48.4 2.9

Malignant 2.3 46.4

Breast cancer diagnosis and machine learning

Breast cancer is a leading cause of death among women in Canada, with healthcare associated costs of upwards of $450 million (CAD) a year . These high mortality rates and rising healthcare costs provide an incentive to develop an ability to effectively and accurately detect malignant tumours.

To detect breast cancer, Fine Needle Aspiration is commonly employed. This technique involves passing a 23-27 gauge needle through the skin and into the area of the breast abnormality. and extracting cells for analysis. After cells are extracted through the needle, they are preserved, spun, smeared, fixed, stained, and evaluated based on cell characteristics by a lab technician. While quick, there are significant false negative rates (FNR) associated with lab results, particularly when technicians are inexperienced .

Machine learning and classification algorithms have the potential to reduce these costly FNRs, resulting in higher accuracy in pathological results, and ultimately may patients begin any necessary medical procedures as quickly as possible.

https://www.ncbi.nlm.nih.gov/pubmed/10762744

http://breast-cancer.ca/2d-biopsy/

(1)

(2)

The most important features are the worst (or largest) concave points, followed equally by the worst (or largest) area and the standard error of the area of the mass. Click on the image above to enlarge and explore the entire tree.

Summary

Through this tutorial, two decision tree algorithms, CART and C4.5, have been explored and the influence of a balanced class on each algorithm has been investigated. The final model, the C4.5 decision tree using the SMOTEd dataset, produced an accuracy of 94.81% and used the largest concave point as the attribute with the highest gain ratio when determining if a mass is malignant or benign in breast masses.

As always, if you have any questions regarding this analysis, please contact me.