Understanding Metrics

Learn what each metric measures, the math behind it, and when to use it in real-world problems.

Regression Metrics Classification Metrics

Regression Metrics

Metrics for evaluating models that predict continuous numeric values.

MSE — Mean Squared Error

Formula

MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2

m = number of samples, yᵢ = actual value, ŷᵢ = predicted value

Measures the average of the squared differences between predicted and actual values. Larger errors are penalized more heavily due to squaring.

How to interpret

MSE is always ≥ 0. Lower is better. MSE = 0 means perfect predictions. The units are squared (e.g., dollars²), making it harder to interpret directly.

When to use

Use MSE when large errors are particularly undesirable. For example, in energy consumption forecasting for power grids, a large prediction error could lead to blackouts or wasted energy — MSE ensures the model is heavily penalized for those dangerous outliers.

When NOT to use

Avoid MSE when you need an interpretable error in the same units as your target. For instance, if you're telling a client their house price estimate is off by '250,000 dollars²', that's meaningless. Use RMSE or MAE instead for human-readable reports.

RMSE — Root Mean Squared Error

Formula

RMSE = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2}

Same as MSE but with the square root, restoring original units

The square root of MSE, bringing the error back to the same units as the target variable. Still penalizes large errors more than small ones.

How to interpret

RMSE is always ≥ 0, in the same units as the prediction. Lower is better. An RMSE of 5,000 on house prices means 'on average, predictions are about $5,000 off'.

When to use

Use RMSE when you want an interpretable error metric that still penalizes outliers. For salary prediction in HR analytics, RMSE tells you the typical dollar amount your model is off by, while still flagging if it wildly misses some predictions.

When NOT to use

Avoid RMSE when your data has many outliers that aren't your fault (e.g., noisy sensor data). A few extreme readings will inflate RMSE disproportionately, making a good model look bad. Use MAE for a more robust alternative.

MAE — Mean Absolute Error

Formula

MAE = \frac{1}{m} \sum_{i=1}^{m} |y_i - \hat{y}_i|

Absolute value treats positive and negative errors equally

Measures the average of the absolute differences between predicted and actual values. Treats all errors equally regardless of size.

How to interpret

MAE is always ≥ 0, in the same units as the prediction. Lower is better. MAE = 0 means perfect predictions. MAE is more robust to outliers than MSE/RMSE.

When to use

Use MAE when all errors matter equally and outliers shouldn't dominate the evaluation. For delivery time estimation, whether you're off by 5 or 50 minutes matters proportionally — MAE gives you a fair average error without letting rare extreme delays skew the picture.

When NOT to use

Avoid MAE when large errors have disproportionate consequences. In predicting structural load capacity for bridges, a large error could be catastrophic — MAE would treat a 1-ton error and a 50-ton error too similarly. Use MSE/RMSE to heavily penalize those dangerous large errors.

R² — Coefficient of Determination

Formula

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}

SS_res = residual sum of squares, SS_tot = total sum of squares, ȳ = mean of y

Measures the proportion of variance in the target variable that the model explains. Compares your model against a baseline that always predicts the mean.

How to interpret

R² ranges from -∞ to 1. R² = 1 means the model explains all variance (perfect). R² = 0 means the model is no better than predicting the mean. R² < 0 means the model is worse than the mean.

When to use

Use R² to compare models on the same dataset. In real estate pricing, R² = 0.85 tells you the model explains 85% of price variation — an intuitive way to communicate model quality to non-technical stakeholders.

When NOT to use

Avoid relying on R² alone when your dataset is small or when comparing models across different datasets. R² always increases when you add more features (even irrelevant ones), which can be misleading. In financial forecasting with few data points, a high R² might indicate overfitting rather than genuine predictive power.

Classification Metrics

Metrics for evaluating models that predict discrete categories or classes.

Accuracy

Formula

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives

The proportion of all predictions that were correct. Simply: how often was the model right?

How to interpret

Accuracy ranges from 0 to 1 (or 0% to 100%). Higher is better. Accuracy = 1 means every prediction was correct.

When to use

Use accuracy when classes are balanced (roughly equal number of examples per class). For sentiment analysis of product reviews where positive and negative reviews are equally common, accuracy gives a straightforward picture of model performance.

When NOT to use

Never use accuracy as the primary metric on imbalanced datasets. Imagine a rare disease screening where only 1% of patients are sick: a model that always predicts 'healthy' achieves 99% accuracy while missing every single sick patient. That 99% accuracy is useless — and dangerous. Use Precision, Recall, or F1 instead.

Precision

Formula

Precision = \frac{TP}{TP + FP}

Of all predicted positives, how many were actually positive?

Of all the cases the model predicted as positive, how many were actually positive? Precision measures the quality of positive predictions.

How to interpret

Precision ranges from 0 to 1. Higher means fewer false positives. Precision = 1 means every positive prediction was correct (no false alarms).

When to use

Use precision when false positives are costly. In email spam filtering, a false positive means a legitimate email goes to spam — the user misses an important message. High precision ensures that when the model says 'spam', it's almost certainly spam.

When NOT to use

Don't rely on precision alone when missing positives is dangerous. In emergency room triage, a model classifying patients for the red room (critical care): high precision means 'when we send someone to the red room, they really need it' — but it says nothing about patients who needed the red room but were sent home. A model that only flags the most obvious cases has great precision but lets critical patients die. You need Recall here.

Recall (Sensitivity)

Formula

Recall = \frac{TP}{TP + FN}

Of all actual positives, how many did the model find?

Of all the actual positive cases, how many did the model correctly identify? Recall measures the model's ability to find all positives.

How to interpret

Recall ranges from 0 to 1. Higher means fewer false negatives. Recall = 1 means the model found every positive case (no misses).

When to use

Use recall when missing a positive case has severe consequences. In cancer screening, a missed diagnosis (false negative) means a patient with cancer goes untreated. High recall ensures the model catches as many true cases as possible, even at the cost of some false alarms.

When NOT to use

Don't rely on recall alone when false positives have high costs. In fraud detection for a bank that freezes accounts on suspicion: maximizing recall catches all fraud, but if it also flags thousands of legitimate transactions, customers are locked out of their accounts. The bank faces angry customers and operational overload. Balance recall with precision using F1.

F1-Score

Formula

F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

Harmonic mean — low if either precision or recall is low

The harmonic mean of Precision and Recall. Provides a single score that balances both concerns — avoiding false positives and avoiding false negatives.

How to interpret

F1 ranges from 0 to 1. Higher is better. F1 = 1 means perfect precision AND recall. The harmonic mean ensures F1 is low if either precision or recall is low.

When to use

Use F1 when you need to balance precision and recall. In content moderation for social media, you want to remove harmful content (high recall) without censoring legitimate posts (high precision). F1 gives a single number that reflects both concerns.

When NOT to use

Avoid F1 when precision and recall are not equally important for your problem. In airport security screening, recall is far more important than precision (missing a threat is catastrophic, extra bag checks are just inconvenient). F1 would unfairly penalize the system for having many false positives. Use recall with a precision threshold instead.

Confusion Matrix

Structure

\begin{bmatrix} TP & FP \\ FN & TN \end{bmatrix}

Rows = actual class, Columns = predicted class

A table showing the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). The foundation from which Accuracy, Precision, Recall, and F1 are all derived.

How to interpret

The diagonal (TP, TN) shows correct predictions. Off-diagonal (FP, FN) shows errors. Ideally, all values are on the diagonal. The matrix reveals WHERE the model fails, not just how often.

When to use

Always examine the confusion matrix when doing classification. In medical diagnosis with multiple conditions, the matrix reveals if the model consistently confuses pneumonia with bronchitis — information that a single metric like accuracy would hide.

When NOT to use

The confusion matrix is not a single metric — you can't easily compare models using it alone. When you need a quick comparison between 10 models, use it alongside summary metrics (F1, AUC). The matrix is for diagnosis, not for ranking.

ROC Curve / AUC

playml.app/metrics/roc-curve

Threshold0.5

TPR: 0.75FPR: 0.18

True Positive Rate (TPR)

TPR = \frac{TP}{TP + FN}

Same as Recall — proportion of actual positives correctly identified

False Positive Rate (FPR)

FPR = \frac{FP}{FP + TN}

Proportion of actual negatives incorrectly classified as positive

AUC

AUC = \int_{0}^{1} TPR(FPR) \, d(FPR)

Area under the ROC curve — the probability that the model ranks a random positive higher than a random negative

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (Recall) against the False Positive Rate at every possible classification threshold. AUC (Area Under the Curve) summarizes the entire ROC curve as a single number.

How to interpret

AUC ranges from 0 to 1. AUC = 0.5 means the model is no better than random. AUC = 1.0 means perfect separation between classes. AUC > 0.9 is considered excellent, 0.7-0.9 is good, < 0.7 is poor.

When to use

Use AUC when you want to evaluate a model's discriminative ability independently of any specific threshold. In credit scoring, the bank will set its own risk threshold later — AUC tells you how well the model separates good from bad borrowers across all possible cutoff points.

When NOT to use

Avoid AUC on highly imbalanced datasets. In rare fraud detection (0.01% fraud rate), the FPR denominator (true negatives) is enormous, so even thousands of false positives barely move FPR. The ROC curve looks great while the model floods the fraud team with false alerts. Use Precision-Recall curves instead.