# Predicting who will survive on the Titanic (Take 2)

This notebook is based on a Kaggle competition where the goal is to predict survival on the Titanic, based on real data. We used decision tree to predict survival earlier. Now we will look at using support vector machine to predict survival. The content is adapted from this [notebook](https://github.com/agconti/kaggle-titanic/blob/master/Titanic.ipynb).

## Preprocessing
Perform the same preprocessing steps as the decision tree example, which is to drop samples with NaN values and remove name and ticket variables.

In [None]:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn import model_selection
import matplotlib.pyplot as plt
from sklearn import metrics
%matplotlib inline

titanic = pd.read_csv('data/titanic.csv')
titanic.head()

In [None]:
# let's drop name and ticket
titanic.drop(titanic.columns[[3, 8]], axis=1, inplace=True)
# for ease let's drop na
titanic = titanic.dropna()
titanic.head()

In [None]:
# dummy code the variables
cabinDummies = pd.get_dummies(titanic.Cabin, prefix='Cabin').iloc[:, 1:]
embarkedDummies = pd.get_dummies(titanic.Embarked, prefix='Embarked').iloc[:, 1:]
sexDummies = pd.get_dummies(titanic.Sex).iloc[:, 1:]
# concatenate the dummy variables and drop the duplicates
titanicDF = pd.concat([titanic, cabinDummies, embarkedDummies, sexDummies], axis=1)
titanicDF.drop(titanicDF.columns[[3, 8, 9]], axis=1, inplace=True)
titanicDF.columns

In [None]:
# split into 60-40 train/test
y = titanicDF.Survived.values
X = titanicDF.drop(titanicDF.columns[[1]], axis=1)
trainX, testX, trainY, testY = model_selection.train_test_split(X, y, test_size=0.4, random_state=5)

### Explore the effect of the four standard kernels

In [None]:
kernelTypes = ['linear', 'rbf', 'poly', 'sigmoid']

results = []

for kernel in kernelTypes:
    # select some value
    clf = svm.SVC(kernel=kernel)
    clf.fit(trainX, trainY)
    yTrainHat = clf.predict(trainX)
    yTestHat = clf.predict(testX)
    trainACC = metrics.accuracy_score(trainY, yTrainHat)
    testACC = metrics.accuracy_score(testY, yTestHat)
    results.append([trainACC, testACC])

pd.DataFrame(results, index=kernelTypes, columns=["Train", "Test"])

In [None]:
clf = svm.SVC(kernel='poly', degree=4)
clf.fit(trainX, trainY)
yTestHat = clf.predict(testX)
print metrics.accuracy_score(testY, yTestHat)
print clf.n_support_
clf.support_