# Introduction to Pandas

Pandas is a library that can be used for data analysis - it has similar functionality to the data frame in R. This notebook will list common code snippets for loading data, transforming data, indexing rows and columns, and quick plots.


In [None]:
# Preliminaries to import stuff we'll need
import pandas as pd
import matplotlib
%matplotlib inline

## Reading and understanding data

We will load a portion of Lahmanâ€™s Baseball Database. This Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. We will focus on team statistics which are found in the Teams.csv file. The full database can be found [here][http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip]

In [None]:
# load the file
teams = pd.read_csv('Teams.csv')

## Basic Analysis in Pandas

pandas has several methods that allow you to quickly analyze a dataset and get an idea of the type and amount of data you are dealing with along with some important statistics.

* .head() - returns the first 5 rows
* .shape - returns the row and column count of a dataset
* .describe() - returns statistics about the numerical columns in a dataset
* .columns - returns the column names
* .dtypes - returns the data type of each column


In [None]:
# get the first 5 rows
teams.head()

In [None]:
# get the column names
teams.columns

In [None]:
# get the shape of the data
teams.shape

In [None]:
# describe the data
teams.describe()

## Selecting parts of the data

pandas offers the ability to subset data according to the values in a particular column or even selecting specific features

In [None]:
teams = teams[teams['yearID'] >= 1985]
teams = teams[['yearID', 'teamID', 'Rank', 'R', 'RA', 'G', 'W', 'H', 'BB', 'HBP', 'AB', 'SF', 'HR', '2B', '3B']]

In [None]:
teams.head()

## Plotting data

pandas data frames have the builtin ability to plot the numeric columns of the data

In [None]:
teams.plot()

In [None]:
# plot the distribution of the number of runs for atlanta
teams[teams["teamID"] == "ATL"]["R"].plot(kind="hist")

In [None]:
# number of runs per team
teams[teams["teamID"] == "ATL"].plot(x="yearID", y="R")