The Simple Learning

Tuesday, 17 January 2017

Dealing With Missing Values Using Sklearn and Pandas

Real world data sets contain missing values. These values encoded as blanks, NaNs or other placeholders. we are used to skip that entire row/columns containing missing value while describing a machine learning model. However this comes at the price of losing data.

A good strategy is to fill out the missing values using known part of the data. The Imputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. Let's Check

first import our module and grab our data

import pandas as pd

data= pd.read_csv(r'titanic.csv')

Let's Grab some column data which has some missing values. Have a look.

data['Age'].values[1:30]

Output :

array([ 38.,  26.,  35.,  35.,  nan,  54.,   2.,  27.,  14.,   4.,  58.,
        20.,  39.,  14.,  55.,   2.,  nan,  31.,  nan,  35.,  34.,  15.,
        28.,   8.,  38.,  nan,  19.,  nan,  nan])

Here one can see missing values are represented as 'nan' (nan in Numpy array and NaN in pandas dataframe).Below are some method by which missing value can be handled.

Using sklearn.preprocessing Imputer function
Using Pandas fillna() method

1. Using Sklearn Imputer Function

import imputer class from sklearn.preprocessing and define our imputer. here we will describe missing_value placeholder, strategy used to fill out the value and axis.

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
print imp

Output

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Here some additional variables axis, verbose are set to default. Now let's fit our data to our Imputer.

imp.fit(data['Age'].values.reshape(-1,1))

Output:

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Now transform our data replacing Nan with appropriate mean value.

age_reformed= imp.transform(data['Age'].values.reshape(-1,1))

Now Check the transformed data.

print(age_reformed[1:10])

Output:

array([[ 38.        ],
       [ 26.        ],
       [ 35.        ],
       [ 35.        ],
       [ 29.69911765],
       [ 54.        ],
       [  2.        ],
       [ 27.        ],
       [ 14.        ]])

Now these "nan" values has been transformed into mean value(29.69911765). The data after transformation can be used to machine learning model as it doesn't have missing value.

2. Using Pandas fillna()

Python's pandas library provide a direct way to deal with missing value.

First create a new data frame with column value as age.

df= pd.DataFrame(columns=['age'])
df['age']= data['Age'].values

We have missing values in newly created Dataframe.

print(df.head(10))
# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	NaN
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Pandas fillna method is applied to dataframe as

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Now apply this to our data and put a value equal to 25 where nan occurs.

filled_value=df.fillna(25)
print(fill_value.head(10))

# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	25.0
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Here value in 6th row(5th indexed) has been changed from 'NaN' to 25.0. More information about the default variable can be checked from docs.

Pandas's DataFrame fillan:-

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Pandas's Series fillna:-

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

Comment and let me know what are you using.

Tuesday, 3 January 2017

Domain name configuration from Github to digital-Ocean

Github is a good place to starting out things. Writing some basic code and putting it on github and checking that on a default URL is pretty cool. The default URL can be changed into a domain name.

When I started my blog, I used Github as a hosting provider. you know you can use it for free using public repository or if you are a student. But Letter I moved to some other hosting provider. Now the major issue was how to configure the domain name setting from github to digital ocean (I chose that). There was a lot of article about how to setup a domain name with digital ocean but none of then describe what to do if you’re moving from github.

Now Let’s check what are the issues and how I solved them. First when you configure a domain name with github, you have to do some configuration like, we have to enter two ip addresses into A record and a CNAME file has to be added which have value equal to your github profile url.

I have a apex domain(no www, sometime called root domain or naked domain) so have to configure an ALIAS and A record with my DNS provider.

A record for github :-
While create an A record for github, We have to specify a single IP address that looks like this:-

some time you have to put 2 IP addresses where additional IP address would be 192.30.252.154.

This is explained here on github-help tips-for-configuring-an-a-record-with-your-dns-provider
and A CNAME(alias) record that looks like this …

This is my godaddy account’s DNS setting for github pages. Now I move from github to Digital ocean. So I have to reconfigure my DNS settings. At first I was terrified that how I would configure this all again. Their are many question are going into my head like setting up a A record, Changing alias record(how?). I have an IP address but don’t know how to setup them plus the guideline to configure DNS setting on digital ocean is bit confusing if you are moving from github. In this case I used the below approach that helped me. I hope it would be helpful for you.

First I erase all the record from DNS setting. It was empty as for new domain. it looks like below:-

Now find the fields called “Domain Name Server”.

Point your name servers to digitalOcean and fill in three domain Name Server fields. Once done, save your changes and exit.

The DigitalOcean Domain servers are:-

After setting up above values it looks like below :

Now go back to digital ocean account, select a droplet for which you want to add your domain name and click on “add a domain” as below.

A new page will opened put your domain name and droplet IP and click on “create record”. And you are done.

These change may take some time to be effective. you can test your new domain by pinging it.

ping website_name

it response would be like this :

pinging website_name [IP Address(Droplet)] with 32 byte of data.

Friday, 23 December 2016

Feature Scaling with Python and Sklearn

feature scaling with python

Feature Scaling with Python and Sklearn

Sometime before applying a machine learning algorithm on your dataset, first you have to do some pre-processing for your data. For this purpose Sklearn provide a package to deal with such scenario. Here We will use that to make our work easy. Pre-processing may include Standardization, normalization, Binarization, imputation of missing value etc. Now get ready and start importing !

importing pandas, and numpy
import pandas as pd 
import numpy as np

Nothing special, Just loading our data

data=pd.read\_csv(r'train.csv')
Fare= data['Fare'].values[:40]

Check our input

print Fare

Output:

[   7.25     71.2833    7.925    53.1       8.05      8.4583   51.8625
   21.075    11.1333   30.0708   16.7      26.55      8.05     31.275
    7.8542   16.       29.125    13.       18.        7.225    26.       13.
    8.0292   35.5      21.075    31.3875    7.225   263.        7.8792
    7.8958   27.7208  146.5208    7.75     10.5      82.1708   52.
    7.2292    8.05     18.       11.2417]

This would be our input data which is going to be scaled.
Now starting our, first import pre-processing from sklearn. All the other magic tools are under the pre-processing.

from sklearn import preprocessing

scale() Function:

The function scale provide a basic overview of pre-processing. The scale function will transform your data between -1 to 1.
Apply the function to our input data(Fare) and store the output into other variable called fare_scaled.

fare_scaled= preprocessing.scale(Fare)

Check the scaled input :

print fare_scaled

Output:

[-0.51857041  0.88523827 -0.50377232  0.4866039  -0.50103193 -0.49208073
  0.45947406 -0.2154835  -0.43343642 -0.01826765 -0.31139708 -0.09545451
 -0.50103193  0.00813216 -0.50532447 -0.32674325 -0.03900252 -0.39251257
 -0.28289705 -0.51911849 -0.10751222 -0.39251257 -0.50148793  0.10075727
 -0.2154835   0.01059851 -0.51911849  5.08826339 -0.5047764  -0.50441247
 -0.06978694  2.5346778  -0.50760886 -0.44732033  1.12392606  0.46248848
 -0.51902641 -0.50103193 -0.28289705 -0.43105996]

The data has been scaled and now its lie between -1,1.

Scaling features to a range

Sometime data should be scaled between some minimum and maximum value, often between zero and one. This can be achieved using MinMaxScaler or MaxAbsScaler. Let’s try that.
First define our min_max_scaler and let’s say we want to scale our data between a range 0 to 5.

min_max_scaler= preprocessing.MinMaxScaler(feature_range=(0,5)) 
print min_max_scaler

Output:

MinMaxScaler(copy=True, feature_range=(0, 5))

Now transform our data using min_max_scaler

fare_transformed= min_max_scaler.fit_transform(Fare.reshape(-1,1))

print fare_transformed.reshape(1,-1) #Using reshape(1,-1) just for better view

Output:

[[  4.88710781e-04   1.25223927e+00   1.36839019e-02   8.96784283e-01
    1.61274558e-02   2.41090802e-02   8.72593099e-01   2.70745773e-01
    7.64011338e-02   4.46599550e-01   1.85221386e-01   3.77773434e-01
    1.61274558e-02   4.70139771e-01   1.22998729e-02   1.71537484e-01
    4.28110644e-01   1.12892190e-01   2.10634347e-01   0.00000000e+00
    3.67021797e-01   1.12892190e-01   1.57208484e-02   5.52731893e-01
    2.70745773e-01   4.72338970e-01   0.00000000e+00   5.00000000e+00
    1.27885837e-02   1.31130877e-02   4.00660737e-01   2.72301437e+00
    1.02629264e-02   6.40211123e-02   1.46507282e+00   8.75281009e-01
    8.21034112e-05   1.61274558e-02   2.10634347e-01   7.85201838e-02]]

The scaled data is now lie in range between 0 to 5. Here we can provide any range in which we want our data to be scaled.
The transformation for MinMaxScaler() is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

Now Let’s move to our other scaling function MaxAbsScaler.It scale each feature by its maximum absolute value.
first define our maxabs scaler

maxabs_scaler= preprocessing.MaxAbsScaler()
print maxabs_scaler

Output:

MaxAbsScaler(copy=True)

Now transform the data using maxabs_scaler.fit_transform(). The output is stored into another variable called fare_maxabs_scaled.

fare_maxabs_scaled= maxabs_scaler.fit_transform(Fare.reshape(-1,1))
print fare_maxabs_scaled.reshape(1,-1) #reshape is just for preety view

Output:

[[ 0.02756654  0.27103916  0.03013308  0.20190114  0.03060837  0.03216084
   0.19719582  0.08013308  0.04233194  0.11433764  0.0634981   0.10095057
   0.03060837  0.11891635  0.02986388  0.0608365   0.11074144  0.04942966
   0.06844106  0.02747148  0.09885932  0.04942966  0.03052928  0.13498099
   0.08013308  0.11934411  0.02747148  1.          0.02995894  0.03002205
   0.10540228  0.55711331  0.02946768  0.03992395  0.3124365   0.19771863
   0.02748745  0.03060837  0.06844106  0.04274411]]

Normalization

Normalization is the process of scaling individual samples to have unit norm. The function normalize() provides a quick and easy way to perform normalization on a array-like dataset, either using the l1 or l2 norms:
Check Example :
Normalization using normalize(x)

fare_normalized = preprocessing.normalize(Fare.reshape(1,-1), norm="l1")

Check the output

print fare_normalized

Output:

[[ 0.00586493  0.057665    0.00641097  0.04295552  0.00651209  0.00684239
   0.04195444  0.01704873  0.00900634  0.02432593  0.01350955  0.02147776
   0.00651209  0.02530007  0.0063537   0.01294328  0.02356082  0.01051642
   0.01456119  0.0058447   0.02103284  0.01051642  0.00649526  0.02871791
   0.01704873  0.02539108  0.0058447   0.21275522  0.00637392  0.00638735
   0.02242489  0.11852876  0.0062694   0.00849403  0.0664725   0.04206567
   0.0058481   0.00651209  0.01456119  0.00909403]]

Binarization

Binarization is the process of thresholding numerical features to get boolean values. it binarize data (set feature values to 0 or 1) according to a threshold.Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0.
Check the example Below. First check what our Fare data looks alike .

print Fare

Output:

[   7.25     71.2833    7.925    53.1       8.05      8.4583   51.8625
   21.075    11.1333   30.0708   16.7      26.55      8.05     31.275
    7.8542   16.       29.125    13.       18.        7.225    26.       13.
    8.0292   35.5      21.075    31.3875    7.225   263.        7.8792
    7.8958]

Now Let’s Binarize it. Divide it into two groups fare less than 15.0 and fare more than 15.0 .
Put a threshold value equal to 15.0 so that binarizer convert fare less than 15.0 as 0 and fare more than 15.0 as 1.
Let’s do this.
First define our binarizer:

binarizer = preprocessing.Binarizer(threshold=15.0)
print binarizer

Output:

Binarizer(copy=True, threshold=15.0)

Now transform our data and store it in variable called fare_binarized.

fare_binarized=binarizer.transform(Fare.reshape(-1,1))
print fare_binarized.reshape(1,-1)

Output:

[[ 0.  1.  0.  1.  0.  0.  1.  1.  0.  1.  1.  1.  0.  1.  0.  1.  1.  0.
   1.  0.  1.  0.  0.  1.  1.  1.  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.
   0.  0.  1.  0.]]

Here Values above 15.0 are converted into 1 and value less than or equal to are converted to 0.

Custom transformers

Sklearn provides some method for custom transformation of data. If a function of your choice isn’t listed then you can create your own function for transformation using FunctionTransformer() function.
Check example below
First define our input let’s it is

x=np.array([[121,424],[986,361]])
print x

Output:

[[121 424]
 [986 361]]

Let’s say we want to take square root of our input. We will do this with the help of sklearn.
Call the function FunctionTransformer() with appropriate function(here will take square root of the given input, so will use np.sqrt )

custom_transformer = preprocessing.FunctionTransformer(np.sqrt)

Now Transform our input using custom_transformer function

x_transformed= custom_transformer.transform(x)

Let’s Check what we have transformed.

print x_transformed

Output:

[[ 11.          20.59126028]
 [ 31.40063694  19.        ]]

As we can see the output is perfect square root of input x. More about custom transformation can be learn from docs
In feature scaling, Imputation of missing value and Encoding Categorical feature is also important. I have separately written on these two topics. You should read these to complete feature scaling.

Thanks for reading.