# Linear Discriminant Analysis

* Adapted from Jupyter Notebook by [Sebastian Raschka](https://github.com/rasbt/pattern_classification/blob/master/dimensionality_reduction/projection/linear_discriminant_analysis.ipynb)

## Dataset
This notebook uses the famous "Iris" dataset that can be found on the [UCI machine learning repository](https://archive.ics.uci.edu/ml/datasets/Iris). It contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset:
* Iris-setosa (n=50)
* Iris-versicolor (n=50)
* Iris-virginica (n=50)

The four features of the Iris dataset:
* sepal length in cm
* sepal width in cm
* petal length in cm
* petal width in cm

![iris](http://joyceho.github.io/images/iris_petal_sepal.png)

## Reading the dataset

In [None]:
import pandas as pd

df = pd.io.parsers.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',', 
    )
feature_dict = {i:label for i,label in zip(
                range(4),
                  ('sepal length in cm', 
                  'sepal width in cm', 
                  'petal length in cm', 
                  'petal width in cm', ))}
df.columns = [l for i,l in sorted(feature_dict.items())] + ['class label']
df.dropna(how="all", inplace=True) # to drop the empty line at file-end

df.tail()

Since it is more convenient to work with numerical values, we will use the `LabelEncode` from the `scikit-learn` library to convert the class labels into numbers: `1, 2, and 3`.

In [None]:
from sklearn.preprocessing import LabelEncoder

X = df[[0,1,2,3]].values 
y = df['class label'].values 

enc = LabelEncoder()
label_encoder = enc.fit(y)
y = label_encoder.transform(y) + 1

label_dict = {1: 'Setosa', 2: 'Versicolor', 3:'Virginica'}

## Histograms and feature selection

Just to get a rough idea how the samples of our three classes $\omega_1$, $\omega_2$ and $\omega_3$ are distributed, let us visualize the distributions of the four different features in 1-dimensional histograms.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import math

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,6))

for ax,cnt in zip(axes.ravel(), range(4)):  
    
    # set bin sizes
    min_b = math.floor(np.min(X[:,cnt]))
    max_b = math.ceil(np.max(X[:,cnt]))
    bins = np.linspace(min_b, max_b, 25)
    
    # plottling the histograms
    for lab,col in zip(range(1,4), ('blue', 'red', 'green')):
        ax.hist(X[y==lab, cnt],
                   color=col, 
                   label='class %s' %label_dict[lab], 
                   bins=bins,
                   alpha=0.5,)
    ylims = ax.get_ylim()
    
    # plot annotation
    leg = ax.legend(loc='upper right', fancybox=True, fontsize=8)
    leg.get_frame().set_alpha(0.5)
    ax.set_ylim([0, max(ylims)+2])
    ax.set_xlabel(feature_dict[cnt])
    ax.set_title('Iris histogram #%s' %str(cnt+1))
    
    # hide axis ticks
    ax.tick_params(axis="both", which="both", bottom="off", top="off",  
            labelbottom="on", left="off", right="off", labelleft="on")

    # remove axis spines
    ax.spines["top"].set_visible(False)  
    ax.spines["right"].set_visible(False) 
    ax.spines["bottom"].set_visible(False) 
    ax.spines["left"].set_visible(False)    
 
axes[0][0].set_ylabel('count')
axes[1][0].set_ylabel('count')
    
fig.tight_layout()       
        
plt.show()

From just looking at these simple graphical representations of the features, we can already tell that the petal lengths and widths are likely better suited as potential features two separate between the three flower classes. In practice, instead of reducing the dimensionality via a projection (here: LDA), a good alternative would be a feature selection technique. For low-dimensional datasets like Iris, a glance at those histograms would already be very informative. Another simple, but very useful technique would be to use feature selection algorithms, which I have described in more detail in another article: [Feature Selection Algorithms in Python](http://sebastianraschka.com/Articles/2014_sequential_sel_algos.html)

## LDA via scikit-learn


In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# LDA
sklearn_lda = LinearDiscriminantAnalysis(n_components=2)
X_lda_sklearn = sklearn_lda.fit_transform(X, y)

In [None]:
def plot_scikit_lda(X, title):
    
    ax = plt.subplot(111)
    for label,marker,color in zip(
        range(1,4),('^', 's', 'o'),('blue', 'red', 'green')):
        
        plt.scatter(x=X[:,0][y == label],
                    y=X[:,1][y == label] * -1, # flip the figure
                    marker=marker,
                    color=color,
                    alpha=0.5,
                    label=label_dict[label])
  
    plt.xlabel('LD1')
    plt.ylabel('LD2')

    leg = plt.legend(loc='upper right', fancybox=True)
    leg.get_frame().set_alpha(0.5)
    plt.title(title)
    
    # hide axis ticks
    plt.tick_params(axis="both", which="both", bottom="off", top="off",  
            labelbottom="on", left="off", right="off", labelleft="on")

    # remove axis spines
    ax.spines["top"].set_visible(False)  
    ax.spines["right"].set_visible(False) 
    ax.spines["bottom"].set_visible(False) 
    ax.spines["left"].set_visible(False)    
 
    plt.grid()
    plt.tight_layout
    plt.show()

In [None]:
plot_scikit_lda(X_lda_sklearn, title='Default LDA via scikit-learn')