titanic-gaussian-regression

Gaussian process regression on titanic dataset

Gaussian process regression is a powerful machine learning technique used for various applications, including regression and classification tasks. In this article, we will explore the application of Gaussian process regression on the Titanic dataset. Our goal is to predict the survival status of passengers based on their age and fare.

Dataset

The dataset is from the Titanic prediction competition. We will use the training set which contains data from 891 passengers in a csv format.

Directory

# Column Null Type Definition
0 PassengerId non-null int64
1 Survived non-null int64 Survival
2 Pclass non-null int64 Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd
3 Name non-null string
4 Sex non-null string Gender "male" or "female"
5 Age non-null float64 Age in years
6 SibSp non-null int64 # of siblings / spouses aboard the Titanic
7 Parch non-null int64 # of parents / children aboard the Titanic
8 Ticket non-null string Ticket number
9 Fare non-null float64 Passenger fare
10 Cabin non-null string Cabin number
11 Embarked non-null string Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

Exploration

Let's explore the key metrics of our dataset, with a specific focus on the Age, Fare, and Survived features. Understanding the distribution and statistics of these features can provide valuable insights into the dataset.

Using pandas we can quickly do this using the describe function.

import pandas as pd

df = pd.read_csv("./data/titanic.csv")
df.describe()

This gives us a general description of the dataset.

Metric PassengerId Survived Pclass Age SibSp Parch Fare
Count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
Mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
Standard Dev 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
Min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25th %ile 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
Median (50%) 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75th %ile 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
Max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

We will briefly discuss the metrics on the features we will later use in the regression.

Passenger Age

Fare

Survival

Normalization and transforming

Before applying Gaussian process regression, we need to prepare the dataset. Luckly the dataset is arelady very consitent, only the following minor steps are necessary.

Column: Sex

Currently is stored in string format. We want to remap that to a integer value.

Inconsistent entries

To make sure all the entries have a Age, Fare and Sex column we will delete any that do not match that.

We can implement these steps using python and pandas :

# Normalize "Sex"
df.loc[df["Sex"] == "female", "Sex"] = 0
df.loc[df["Sex"] == "male", "Sex"] = 1
df = df.astype({
    "Sex": "int32"
}, errors="ignore")

# Remove inconsistent entries
df = df[df["Age"].notnull()]
df = df[df["Fare"].notnull()]
df = df[df["Sex"].notnull()]

df.info()

Regression

Data preparation

We will use the "Age" and "Fare" columns to predict passenger survival. Let's prepare the data for regression:

X = np.array([df["Age"], df["Fare"]])
X = np.transpose(X)
y = np.array(df["Survived"])

You will notice here that the regression algorithm requires the input data to have a particular shape of (n_samples, n_features) which we construct using numpy's transpose function.

We also need to split the data into training and test sets for evaluation:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Model

We will use sklern's gaussian process classifier which is based on laplace approximation. We will use RBF also known as squared-exponential as the kernel. Which gives good results using it as a classifier.

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

kernel = 1.0 * RBF(1.0)
gpc = GaussianProcessClassifier(kernel=kernel, random_state=0).fit(X_train, y_train)

Results

Evaluation

To evaluate the model's performance we will use log marginal likelihood, accuracy and log-loss.

from sklearn.metrics import accuracy_score, log_loss

print(
    "Log Marginal Likelihood: %.3f"
    % gpc.log_marginal_likelihood(gpc.kernel_.theta)
)
print(
    "Accuracy: %.3f (full dataset) %.3f (test dataset)"
    % (
        accuracy_score(y, gpc.predict(X)),
        accuracy_score(y_test, gpc.predict(X_test))
    )
)
print(
    "Log-loss: %.3f (full dataset)" % log_loss(y, gpc.predict_proba(X)[:, 1])
)

With our model we get the following metrics.

Matric Value
Log Marginal Likelihood 292.969
Accuracy (Full Dataset) 0.724
Accuracy (Test Dataset) 0.669
Log-loss 0.571

Log Marginal Likelihood

The log marginal likelihood provides insight into the model's fit to the data. In our case, it indicates how well the Gaussian process model fits the given data. A lower log marginal likelihood suggests a better fit. In this scenario, a log marginal likelihood of 292.969 indicates that the model is reasonably well-calibrated to the data.

Accuracy

Accuracy is a commonly used metric for classification tasks. It measures the proportion of correctly predicted labels in the dataset. In this case, the model achieved an accuracy of approximately 72.4% on the full dataset, indicating that it correctly predicted the survival status of the passengers. On the test dataset, the accuracy is slightly lower at approximately 66.9%, suggesting that the model generalizes well to new, unseen data.

Log-loss

Log-loss is a measure of the accuracy of a probabilistic classifier, and it quantifies how well the predicted probabilities match the true outcomes. A lower log-loss value indicates better performance. In this scenario, a log-loss of 0.571 for the full dataset suggests that the model's probabilistic predictions are relatively accurate.

Visualizing survival probabilities

To gain deeper insights into the results obtained from the Gaussian process regression model, we can visualize the survival probability. These visualizations provide a clear representation of how passenger age and fare influence the predicted survival probabilities.

Surface plot

In this visualization, we use passenger Age as the x-axis and Fare as the y-axis. The Gaussian process regression model's predictions for survival probability are mapped to the z-axis. The result is a three-dimensional surface plot that illustrates how the survival probability varies across different combinations of age and fare values.

survival-probability-surface.png

Heatmap

The heatmap is another valuable visualization that simplifies the understanding of survival probability. In this heatmap, we represent passenger age and fare as coordinates. The color intensity in each cell of the heatmap corresponds to the predicted survival probability for passengers with specific age and fare values.

survival-probability-heatmap.png

The heatmap provides a two-dimensional representation of the survival probability. It visually emphasizes areas of interest, such as higher survival rates for specific age groups or fare ranges. For instance, it can reveal that children tend to have a higher survival rate, as indicated by brighter regions on the heatmap.

Discussion

The results of our Gaussian process regression model on the Titanic dataset are promising but leave room for further improvement. The accuracy scores, both on the full dataset and the test dataset, indicate that the model is reasonably effective at predicting passenger survival based on age and fare. However, there are several points to consider:

In summary, while the Gaussian process regression model shows promise in predicting passenger survival, there are avenues for further exploration and enhancement to achieve even better results. This includes feature engineering, hyperparameter tuning, and addressing class imbalance to create a more robust predictive model.