Showing posts with label Sklearn. Show all posts
Showing posts with label Sklearn. Show all posts

Tuesday 17 January 2017

Dealing With Missing Values Using Sklearn and Pandas

Real world data sets contain missing values. These values encoded as blanks, NaNs or other placeholders. we are used to skip that entire row/columns containing missing value while describing a machine learning model. However this comes at the price of losing data.

A good strategy is to fill out the missing values using known part of the data. The Imputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. Let's Check

first import our module and grab our data 

import pandas as pd

data= pd.read_csv(r'titanic.csv')

Let's Grab some column data which has some missing values. Have a look.

 

 

data['Age'].values[1:30]

Output :

array([ 38.,  26.,  35.,  35.,  nan,  54.,   2.,  27.,  14.,   4.,  58.,
        20.,  39.,  14.,  55.,   2.,  nan,  31.,  nan,  35.,  34.,  15.,
        28.,   8.,  38.,  nan,  19.,  nan,  nan])

Here one can see missing values are represented as 'nan' (nan in Numpy array and NaN in pandas dataframe).Below are some method by which missing value can be handled.

  1. Using sklearn.preprocessing Imputer function

  2. Using Pandas fillna() method

1. Using Sklearn Imputer Function

import imputer class from sklearn.preprocessing and define our imputer. here we will describe missing_value placeholder, strategy used to fill out the value and axis.

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
print imp

Output

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Here some additional variables axis, verbose are set to default. Now let's fit our data to our Imputer.

imp.fit(data['Age'].values.reshape(-1,1))

Output:

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Now transform our data replacing Nan with appropriate mean value.

age_reformed= imp.transform(data['Age'].values.reshape(-1,1))

Now Check the transformed data.

print(age_reformed[1:10])

Output:

array([[ 38.        ],
       [ 26.        ],
       [ 35.        ],
       [ 35.        ],
       [ 29.69911765],
       [ 54.        ],
       [  2.        ],
       [ 27.        ],
       [ 14.        ]])

Now these "nan" values has been transformed into mean value(29.69911765). The data after transformation can be used to machine learning model as it doesn't have missing value.

2. Using Pandas fillna()

Python's pandas library provide a direct way to deal with missing value.

First create a new data frame with column value as age.

df= pd.DataFrame(columns=['age'])
df['age']= data['Age'].values

We have missing values in newly created Dataframe.

print(df.head(10))
# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	NaN
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Pandas fillna method is applied to dataframe as

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Now apply this to our data and put a value equal to 25 where nan occurs.

filled_value=df.fillna(25)
print(fill_value.head(10))

# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	25.0
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Here value in 6th row(5th indexed) has been changed from 'NaN' to 25.0. More information about the default variable can be checked from docs.

Pandas's DataFrame fillan:-

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Pandas's Series fillna:-

 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

Comment and let me know what are you using.