Saturday 12 January 2019

F-1 Score, Precision and Recall

Hi,

Initially, when I started to learn machine learning, I was not able to grasp the terms F-1 Score, precision and recall along with False Positive and False Negative( ya these 2 are more confusing then True Positive and True Negative).

Reading post per post and still not able to get anything in my mind. So One day I sat and just cleared it out. Ok, Fine Here are the Details. Let's start with abbreviation.
Let's Create the basic table which we see in every post.

 Predicted 
  Yes No
ActualYesTPFN
 NoFPTN
TP -  True  Positive      ----> Predicted is Positive and that is True
TN - True  Negative   ----> Predicted is Negative and that is True
FP - False Positive    ----> Predicted is Positive and that is False
FN - False Negative  ----> Predicted is Negative and that is False

So In TP/TN/FP/FN ending Positive, Negative tell us about the prediction. and the Initial True, False tell us "Is that Correct or Not?"

So False Positive means that the prediction is "Positive" but and False Indicate it is Wrong. For example an data corresponds to  class 0 but our classifier predicted the class 1.

Now same can go for False negative. I am leaving that up to you.

Now Here comes the Precision and Recall.

Precision

precison = TP/(TP+FP)

                = (Correct Yes Predicted by Model)/ (Total Yes predicted by Model [TP+FP])

This tells us "How much our model predicted correctly out of Total True predicted by Model(TP+FP)"

Recall

recall = TP/(TP+FN)
         
         = (Correct Yes Predicted by Model)/ (Total Yes in Actual Data[TP+FN])

This tells us "How much our model predict correctly out of total True in Data"


Recall address the question: "Given the positive example, will the classifier detect it".
While Precision address the question: "Given a positive prediction from classifier, how likely is it to be correct".

The image is just for my personal purpose.

Saturday 4 August 2018

Python Generator and Iterator

Iterator

Iterator is an object which allows to traverse through all the elements of a collection.

In Python, an iterator is an object which implements the iterator protocol.

The iterator protocol consists of two methods.

  1. The iter() method, which must return the iterator object,
  2. the next() method, which returns the next element from a sequence.

Iterators have several advantages:

  • can work with infinite sequences
  • save resources

Python has several built-in objects examples lists, tuples, strings, dictionaries or files, which implement the iterator protocol.

## Define Custom Iterator using Iterator Protocol.

class krange(object):
    
    def __init__(self, n):
        self.i = 0
        self.n = n
        
    def __iter__(self):
        return self
    
    def next(self):
        if self.i < self.n:
            i = self.i
            self.i += 1
            return i
        else:
            raise StopIteration()
k = krange(5)
print(k.next())
print(k.next())
print(k.next())
print(k.next())
0
1
2
3

Example: Write an iterator class reverse_iter, that takes a list and iterates it from the reverse direction.

class reverse_iter(object):
    
    def __init__(self, alist):
        self.alist = alist
        self.n = len(alist) -1 
        
    def __iter__(self):
        return self
    
    def next(self):
        if self.n > 0:
            item = self.alist[self.n]
            self.n -= 1
            return item
        else:
            raise StopIteration()
        
abc = reverse_iter([1,2,3,4,5])
print(abc.next())
print(abc.next())
print(abc.next())
print(abc.next())
5
4
3
2

Python Generators

Generator is a special routine that can be used to control the iteration behavior of a loop. A generator is similar to a function returning an array.
Generators are defined similar to a function and have the same properties like has parameters, can be called. But unlike functions, which return a whole array, a generator yields one value at a time. This requires less memory.

Generators in Python:

  • Use the yield keyword and it May use several yield keywords
  • Return an iterator

Generators simplifies creation of iterators. A generator is a function that produces a sequence of results instead of a single value.

So a generator is also an iterator. Here You don’t have to worry about the iterator protocol.

def gen(alist):
   for k in alist:
    yield k

g = gen([1,2,3,4,5])
print(next(g))
print(next(g))
print(g.__next__()) # g.next() in python2
print(g.__next__())
1
2
3
4

Can you think about how it is working internally?

When a generator function is called, it returns a generator object without even beginning execution of the function. When next method is called for the first time, the function starts executing until it reaches yield statement. The yielded value is returned by the next call.

Let’s check an example:

def foo():
    print("start Foo")
    for i in range(5):
        print(f"Before Yield i -- {i}")
        yield i
        print(f"After Yield i -- {i}")
    print("End Foo")
f = foo()
print(f.__next__())
start Foo
Before Yield i -- 0
0
print(f.__next__())
After Yield i -- 0
Before Yield i -- 1
1
print(f.__next__())
After Yield i -- 1
Before Yield i -- 2
2
print(f.__next__())
After Yield i -- 2
Before Yield i -- 3
3
print(f.__next__())
After Yield i -- 3
Before Yield i -- 4
4
print(f.__next__())
After Yield i -- 4
End Foo



---------------------------------------------------------------------------

StopIteration                             Traceback (most recent call last)

<ipython-input-72-2f4b6c2bfb23> in <module>
----> 1 print(f.__next__())


StopIteration: 

Python generator expression

A generator expression is created with round brackets.

aa = (a for a in range(1000)) # this return a generator
ab = [b for b in range(100)] # this return a list
print(next(aa))
print(next(aa))
print(next(aa))
print(next(aa))

print(next(aa))
print(next(aa))
print(next(aa))
print(next(aa))
0
1
2
3
4
5
6
7

Further Reading: http://www.dabeaz.com/generators-uk/

Tuesday 17 January 2017

Dealing With Missing Values Using Sklearn and Pandas

Real world data sets contain missing values. These values encoded as blanks, NaNs or other placeholders. we are used to skip that entire row/columns containing missing value while describing a machine learning model. However this comes at the price of losing data.

A good strategy is to fill out the missing values using known part of the data. The Imputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. Let's Check

first import our module and grab our data 

import pandas as pd

data= pd.read_csv(r'titanic.csv')

Let's Grab some column data which has some missing values. Have a look.

 

 

data['Age'].values[1:30]

Output :

array([ 38.,  26.,  35.,  35.,  nan,  54.,   2.,  27.,  14.,   4.,  58.,
        20.,  39.,  14.,  55.,   2.,  nan,  31.,  nan,  35.,  34.,  15.,
        28.,   8.,  38.,  nan,  19.,  nan,  nan])

Here one can see missing values are represented as 'nan' (nan in Numpy array and NaN in pandas dataframe).Below are some method by which missing value can be handled.

  1. Using sklearn.preprocessing Imputer function

  2. Using Pandas fillna() method

1. Using Sklearn Imputer Function

import imputer class from sklearn.preprocessing and define our imputer. here we will describe missing_value placeholder, strategy used to fill out the value and axis.

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
print imp

Output

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Here some additional variables axis, verbose are set to default. Now let's fit our data to our Imputer.

imp.fit(data['Age'].values.reshape(-1,1))

Output:

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Now transform our data replacing Nan with appropriate mean value.

age_reformed= imp.transform(data['Age'].values.reshape(-1,1))

Now Check the transformed data.

print(age_reformed[1:10])

Output:

array([[ 38.        ],
       [ 26.        ],
       [ 35.        ],
       [ 35.        ],
       [ 29.69911765],
       [ 54.        ],
       [  2.        ],
       [ 27.        ],
       [ 14.        ]])

Now these "nan" values has been transformed into mean value(29.69911765). The data after transformation can be used to machine learning model as it doesn't have missing value.

2. Using Pandas fillna()

Python's pandas library provide a direct way to deal with missing value.

First create a new data frame with column value as age.

df= pd.DataFrame(columns=['age'])
df['age']= data['Age'].values

We have missing values in newly created Dataframe.

print(df.head(10))
# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	NaN
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Pandas fillna method is applied to dataframe as

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Now apply this to our data and put a value equal to 25 where nan occurs.

filled_value=df.fillna(25)
print(fill_value.head(10))

# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	25.0
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Here value in 6th row(5th indexed) has been changed from 'NaN' to 25.0. More information about the default variable can be checked from docs.

Pandas's DataFrame fillan:-

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Pandas's Series fillna:-

 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

Comment and let me know what are you using.