Showing posts with label Python. Show all posts

Saturday, 17 April 2021

Introduction to Python re module

The re module in Python is used to work with regular expressions. It provides functions to search for patterns in strings, and to perform substitutions and splits. Some common functions include:

search(): searches for a match to a pattern in a string
findall(): returns all non-overlapping matches of a pattern in a string
sub(): replaces all occurrences of a pattern in a string with a replacement string
split(): splits a string by a specified pattern

The re module also includes several functions for compiling and working with regular expression patterns, including:

compile(): compiles a regular expression pattern into a pattern object
match(): attempts to match a pattern at the start of a string
fullmatch(): attempts to match a pattern against all of a string

Regular expressions are a powerful tool, They are mostly used to match or find the pattern in the string, You can use special characters and sets to define patterns, and you can use groups and flags to modify the behavior of the match.

It's important to note that regular expressions can be quite complex and hard to read, So, It's always a good idea to use comments in the pattern.

here are a few examples of how the re module can be used in Python:

Finding all occurrences of a pattern in a string:

 import re

 text = "The cat is in the hat"

 # Find all occurrences of "at" in the text
 matches = re.findall("at", text)

 print(matches) 
 # Output: ['at', 'at']

Replacing all occurrences of a pattern in a string:

 import re

 text = "The cat is in the hat"
 # Replace all occurrences of "cat" with "dog"
 new_text = re.sub("cat", "dog", text)
 print(new_text) 
 # Output: "The dog is in the hat"

Splitting a string by a pattern:

 import re

 text = "The,cat,is,in,the,hat"

 # Split the text by ","
 parts = re.split(",", text)

 print(parts) 
 # Output: ['The', 'cat', 'is', 'in', 'the', 'hat']

Matching a pattern at the start of a string:

 import re

 text = "The cat is in the hat"

 # Check if the text starts with "The"
 match = re.match("The", text)

 if match:
     print("Text starts with 'The'")
 else:
     print("Text does not start with 'The'")

 # Output: Text starts with 'The'

Using groups to extract parts of a match:

 import re

 text = "The cat is in the hat"

 # Find all occurrences of "at" preceded by a word
 matches = re.findall(r"(\w+)at", text)

 print(matches) 
 # Output: ['cat', 'hat']

Using a flag to make the search case-insensitive:

 import re

 text = "The Cat is in the Hat"

 # Find all occurrences of "cat" or "Cat"
 matches = re.findall("cat", text,re.IGNORECASE)

 print(matches) 
 # Output: ['Cat']

Using the search() function to find a match:

 import re

 text = "The cat is in the hat"

 # Search for the first occurrence of "cat"
 match = re.search("cat", text)

 if match:
     print("Found a match:", match.group())
 else:
     print("No match found.")

 # output: Found a match: cat

Using the compile()function to create a pattern object:

 import re

 text = "The cat is in the hat"

 # Compile a regular expression pattern
 pattern = re.compile("cat")

 # Search for the first occurrence of the pattern in the text
 match = pattern.search(text)

 if match:
     print("Found a match:", match.group())
 else:
     print("No match found.")

Using the finditer() function to find all matches and iterate over them:

 import re

 text = "The cat is in the hat. The bat is in the mat."

 # Find all occurrences of "at"
 matches = re.finditer("at", text)

 # Iterate over the matches
 for match in matches:
     print("Found a match:", match.group())

 #Output:
 # Found a match: at
 # Found a match: at
 # Found a match: at
 # Found a match: at

Using the escape()function to escape special characters in a string:

 import re

 text = "The .*+?^$[]{}\|() cat is in the hat"

 # Escape special characters in the text
 escaped_text = re.escape(text)

 print(escaped_text) 
 # Output: "The \.\*\+\?\^\$\[\]\{\}\|\(\) cat is in the hat"

Using the purge()function to clear the regular expression cache.
```
 import re

 re.purge()
```

These are just a few examples of how the remodule can be used in Python. There are many more functions and options available in the remodule, so I recommend reading the official documentation for more information and examples.https://docs.python.org/3/library/re.html

I hope these examples help you understand the basics of working with regular expressions in Python.

Saturday, 20 March 2021

Python strptime() vs Python strftime()

The strftime() method returns a string representing date and time using date , time or datetime object. So it is used to convert the date-time object into string format.

Let’s take a look at an example:

    from datetime import datetime
    t = datetime.now()
    print('current-time--->', t)

 #output
    current-time---> 2023-01-10 23:16:42.511274

now let's play around strftime

 
    print(t.strftime('%y-%m-%d'))
output:-->

    23-01-10
    print(t.strftime('%Y-%m-%d'))
output:-->

    2023-01-10
    print(t.strftime('%y-%m-%d %H:%M'))
output:-->

    23-01-10 23:16

So basically we can extract any kind of data from date-time object in string format.

The strptime() method creates a datetime object from the given string. It used to convert a given string into python datetime object.

In the same way, let’s look at an example for strptime.

    from datetime import datetime
    t_date = '2023-01-10'
    t_date_obj = datetime.strptime(t_date, '%Y-%m-%d')

output:-->
    datetime.datetime(2023, 1, 10, 0, 0)

 lets add a time as well.

    t_date_time = '2023-01-10T10:10'
    t_date_time_obj = datetime.strptime(t_date, '%Y-%m-%dT%H:%M')

output:-->
    datetime.datetime(2023, 1, 10, 10, 10)

Here t_date_obj and t_date_time_obj now became regular python datetime object and you can start treating them as datetime object further in your program.

It’s really helpful in time formatting, ingesting data in different time format, date-to-string conversion etc.

Do check out the official python documentation for the full list of abbreviations.

Directive	Meaning	Example
%a	Weekday as locale’s abbreviated name.	Sun, Mon, …, Sat (en_US);So, Mo, …, Sa (de_DE)
%A	Weekday as locale’s full name.	Sunday, Monday, …, Saturday (en_US);Sonntag, Montag, …, Samstag (de_DE)
%w	Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.	0, 1, …, 6
%d	Day of the month as a zero-padded decimal number.	01, 02, …, 31

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

Thursday, 4 March 2021

Python Debugging Techniques for Beginners

Debugging is an important part of the software development process, and Python provides several techniques for beginners to debug their code. Here are some common techniques:

Print statements: One of the simplest and most widely used techniques is to insert print statements in your code to check the values of variables and expressions at different points in the execution. This can help you identify where the problem is occurring and what the state of the program is at that point.
The pdb library: Python includes a built-in library called pdb that provides a command-line interface for debugging. You can insert the statement import pdb; pdb.set_trace() at the point where you want to start debugging, and the program will enter the pdb interactive mode, allowing you to step through the code and inspect variables.
The ipdb library: ipdb is an improved version of pdb library, it allows you to use the same interface as pdb but with some added features like syntax highlighting and tab completion.
The breakpoint() function: Python 3.7 introduces a built-in function breakpoint() that is similar to pdb.set_trace() but it does not require importing pdb, it is built-in to the interpreter.
IDEs and text editors with debugging support: Many IDEs and text editors have built-in support for debugging Python code, such as PyCharm, VSCode, and Sublime Text. These tools provide a graphical user interface that allows you to set breakpoints, step through code, and inspect variables.
Assert statements: Using assert statements you can check if a certain condition is true during runtime and if it isn't it will raise an AssertionError, you can use this to check if the code is working as expected.
logging: Logging is a way to record information about your code's execution. It can be used to record messages that can help you understand what is happening in your code.
try-except block: Using try-except blocks to catch and handle specific errors, can help you isolate and fix the issue, and also can provide the user with a meaningful message.

While debugging can be a frustrating process, these techniques can help you quickly identify and fix errors in your code, allowing you to move on to the next step in your development process. It's important to try different techniques to find the one that works best for you and your specific needs.

Do Comment what's your favorite?

Sunday, 28 February 2021

Python’s Inbuilt Modules to Make Life Easier

Python has a number of built-in modules and packages that are included with the standard distribution. Here are some popular ones:

math: A module that provides mathematical functions and constants, such as trigonometric functions, logarithms, and the constant pi.
random: A module that provides various random number generators and probability distributions.
os: A module that provides a way to interact with the operating system, such as navigating the file system, creating and removing directories, and executing shell commands.
time: A module that provides functions for working with time and dates, such as measuring time intervals and parsing and formatting date and time strings.
datetime: A module that provides classes for working with dates and times, such as date, time, datetime, timedelta, etc.
json: A module for working with JSON data, it provides functions for encoding and decoding JSON data.
re : A module that provides regular expression matching operations.
sys: A module that provides access to some variables and functions that are used or maintained by the Python interpreter, such as the command-line arguments passed to a script.
statistics: A module for mathematical statistics functions, like mean, median, variance and etc.
sys: A module that provides access to system-specific parameters and functions, like interpreter version, command line arguments and etc.
urllib : A module that provides an API for using the basic HTTP and FTP protocols, it also used for opening and reading URLs.
argparse : A module that makes it easy to write user-friendly command-line interfaces.

These are just a few examples of the built-in modules and packages available in Python. Each of them provides a specific set of functionalities that can be used to solve different problems.

Do comment your favorite module.

Monday, 1 February 2021

Python Tips and Tricks for Beginners - Part 1

Here are a few tips and tricks for Python beginners:

Use list comprehensions instead of for loops: List comprehensions are a concise way to create lists and are often faster than for loops.

They have the syntax: [expression for item in iterable if condition]. Example
```
original_list = [1, 2, 3, 4, 5]
squared_list = [x**2 for x in original_list]
print(squared_list)
```
output:
```
[1, 4, 9, 16, 25]
```
Use the "with" statement when working with files: The "with" statement is used when working with files and automatically takes care of closing the file after it's been used. This eliminates the need to explicitly call the close() method and can help prevent errors.

Example:
```
with open('example.txt', 'r') as file:
    contents = file.read()
    print(contents)
```
Once the code inside the with block is finished executing, the file will automatically be closed. This ensures that the file is closed even if an exception is raised inside the with block. The same can be used when writing to a file.
Use the built-in function enumerate(): The enumerate() function is used to loop over a list while keeping track of the index of the current item. It returns both the index and the value of the current item.

Example:
```
fruits = ['apple', 'banana', 'orange']
for index, value in enumerate(fruits):
    print(f"{index}: {value}")
```
Use the ternary operator for simple if-else statements: The ternary operator allows you to write simple if-else statements in a single line of code. The syntax is: value_if_true if condition else value_if_false.
```
value_if_true if condition else value_if_false.
```
Example
```
x = 10
y = 20

bigger = x if x > y else y
print(bigger)
```
Output:
```
20
```
A more complex example:
```
age = 20
status = "minor" if age < 18 else "adult" if age < 65 else "senior"
print(status)
```
output:
```
adult
```
Get familiar with modules and packages: Python has a large number of modules and packages, many of which are designed to perform specific tasks. It is important to familiarize yourself with the available modules and packages and to use the ones that are appropriate for your task.

some of the modules for example (read more Python's Inbuilt Modules for beginner)
1. random
2. os
3. DateTime and time
4. json
Get into the habit of writing documentation: Good documentation is essential for any code you write. This is even more important for beginners, who are still learning the best practices for writing code.
Practice and practice, write more code, and get yourself involved in open-source projects that will help you to improve your skills.
Make use of Python Tutors and Mentors, who can guide and help you to overcome any issues you face while learning.
Get any function doc string in python shell buy adding ? at last and press enter.

Example:
```
sum?
```

output:

Signature: sum(iterable, /, start=0)
Docstring:
Return the sum of a 'start' value (default: 0) plus an iterable of numbers

When the iterable is empty, return the start value.
This function is intended specifically for use with numeric values and may
reject non-numeric types.
Type:      builtin_function_or_method

By implementing these tips and tricks, you can write more efficient and readable Python code, and also it will help you to improve your coding skills.

Do check out Python’s official documentation time to time. Do comment your favorite tricks.

Friday, 29 January 2021

Python Beginners Resource

Regardless of the topic, There is much information on the internet. If you’ve just started or are just about to begin your learning in python, you will find a lot of articles, courses, books, and YouTube channels.

But “what is the best for you” is a good question you should ask yourself. Because if you started following what the internet is saying, you will end up nowhere. Spending time figuring out “Functional programming vs Parallel vs Async programming” is also a waste of time.

There are many resources available for learning Python here I am putting out some of them for the absolute beginner. No prior experience in programming required a little bit of familiarity with the computer required.

1. Codecademy's "Learn Python" course: This interactive course is a great way to get started with the basics of Python. It includes exercises and quizzes to help you practice and reinforce what you've learned. Nothing to install on system. Everything on browser.

2. LearnPython.org: This website offers a variety of tutorials and exercises for learning Python, including a beginner-friendly tutorial that covers the basics of the language.

3. https://www.python.org/'s "BeginnersGuide/Programmers" page: This page, provided by the creators of Python, provides a comprehensive guide to the basics of Python programming, as well as links to more advanced resources.

4. "Automate the Boring Stuff with Python" by Al Sweigart: This popular book is aimed at beginners who want to learn how to use Python to automate repetitive tasks. Get the book or read it free online on their website automatetheboringstuff.com/.

I would suggest to start with Codecademy's "Learn Python" course and then check out other.

Codecademy’s course give you a very basic understanding of Python. With 2 and 3 courses you will learn more about Python. In the “Automate the Boring Stuff” you will learn some of the possible things you can do with python. Checkout the 4th is must.

Along with learning python, In parallel, you can give a try to “Data Structure and Algorithm(DSA)”. It’s not like that if you don’t know the DSA, you won’t able to learn python. But trust me it will add a lot of value to your programming career. Read more about DSA's resources data-structures-and-algorithms-learning-resource.

Let us know how you starting your python journey.

Monday, 3 February 2020

Running a Python or Django Application with Supervisor

Making use of Supervisor

I have lurking around to find a example how can I get my python application work with Supervisor and not able to find some good examples so decided to write one. it will be brief.

Install supervisor

Go with

pip install superviosr

and it’s done. If not please check the official documentation.

Configure the Supervisor

Supervisor have 3 utility.

supervisord
supervisorctl
Web server

supervisord (supervisor daemon) run in background and process all your configuration.
supervisorctl a CLI that can be used to to access some information regarding running process, configuration change, re-start the supervisor daemon etc.
Web server can check the stack/log for the running services.

After installing the supervisor run

echo_supervisord_conf

It will show a sample of supervisor config. Read through it to get some insight.

Let’s create a configuration file that can be used to start the supervisor. (you can copy-past echo_supervisord_conf content and play around that).

In the Installation folder (/etc/supervisor in Ubuntu), do following:–>

create a supervisor.conf file (vim supervisor.conf) with following content

let’s get it one by one :–>

unix_http_server

Start of configuration

; supervisor config file
[unix_http_server]
file=/var/run/supervisor.sock
chmod=0700

supervisord

This section content the functioning configuration of supervisor daemon. Here we have configured below
a. logfile --> path of file to record log of supervisor daemon
b. pidfile --> file in which the process id will be logged.
c. childlogdir --> path of file to record the child process that are running through supervisor daemon.
```
[supervisord]
logfile=/var/log/supervisor/supervisord.log
pidfile=/var/run/supervisord.pid 
childlogdir=/var/log/supervisor 
```
inet_http_server and rpcinterface:supervisor:

This section is used to define the web interface of the supervisor. Here we are defining the port and rpinterface
```
 [inet_http_server]
 port = 127.0.0.1:9001
 
 [rpcinterface:supervisor]
 supervisor.rpcinterface_factory=supervisor.rpcinterface:make_main_rpcinterface
```
supervisorctl

Define configuration related to supervisor. Let’s define the server URL for now.
```
[supervisorctl]
serverurl = http://127.0.0.1:9001
```
Include section

Include section to include the addition programs/process related configuration that we are going to run using the supervisor.
```
 [include]
 files = /etc/supervisor/conf.d/*.ini
```

Running a Python Script using Supervisor

Create a simple python script (sampleapp.py)

import logging
from random import randint
from time import sleep

def main():
    while True:
        i = randint(1,100)
        # If get a digit in multiple of 10 then crash
        if i in [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]:
            logging.error('Generated {}. Application Crashing'.format(i))
            raise Exception('Application Crashing')
        else:
            print('Generated {}. Sleeping for {}'.format(i, str(i//10)))
        sleep(i//10)

if __name__ == "__main__":
    print('Starting the sample app')
    main()

Now define a sampleapp.ini file

define a program using [program:program_name]
directory - define the directory where your script is ( Program will change to mentioned directory while running the program.
command- What to run

[program:sample_app]
directory=/path_to_script_containing_folder/
command=python3 sample_app.py

In Include section we have mentioned that /etc/supervisor/conf.d/*ini will be considered. So either put this sampleapp.ini file in /etc/supervisor/conf.d or create a symlink.

Now run

supervisord

It will start and run in background.

Now Run

supervisorctl

If everything is alright then you will see a response similar to this and we will enter inside the CLI utility of supervisor.

sajjan# supervisorctl
sample_app   RUNNING   pid 25366, uptime 0:01:34
supervisor> 
supervisor> help
default commands (type help <topic>):
=====================================
add    exit      open  reload  restart   start   tail   
avail  fg        pid   remove  shutdown  status  update 
clear  maintail  quit  reread  signal    stop    version

Play around it. Now to check the web interface of the supervisor, visit http://127.0.0.1:9001

Running a Django Application using Supervisor

Similarly, we create a django.ini file for our Django application.

[program:application_name]
directory=/path_to_your_project/
command=/path_to_your_virtual_enviroment/python3 manage.py runserver

Restart the daemon and it’s done. Play around it’s different parameter in configuration file.

Saturday, 4 August 2018

Python Generator and Iterator

Iterator

Iterator is an object which allows to traverse through all the elements of a collection.

In Python, an iterator is an object which implements the iterator protocol.

The iterator protocol consists of two methods.

The iter() method, which must return the iterator object,
the next() method, which returns the next element from a sequence.

Iterators have several advantages:

can work with infinite sequences
save resources

Python has several built-in objects examples lists, tuples, strings, dictionaries or files, which implement the iterator protocol.

## Define Custom Iterator using Iterator Protocol.

class krange(object):
    
    def __init__(self, n):
        self.i = 0
        self.n = n
        
    def __iter__(self):
        return self
    
    def next(self):
        if self.i < self.n:
            i = self.i
            self.i += 1
            return i
        else:
            raise StopIteration()

k = krange(5)

print(k.next())
print(k.next())
print(k.next())
print(k.next())

Example: Write an iterator class reverse_iter, that takes a list and iterates it from the reverse direction.

class reverse_iter(object):
    
    def __init__(self, alist):
        self.alist = alist
        self.n = len(alist) -1 
        
    def __iter__(self):
        return self
    
    def next(self):
        if self.n > 0:
            item = self.alist[self.n]
            self.n -= 1
            return item
        else:
            raise StopIteration()

abc = reverse_iter([1,2,3,4,5])

print(abc.next())
print(abc.next())
print(abc.next())
print(abc.next())

Python Generators

Generator is a special routine that can be used to control the iteration behavior of a loop. A generator is similar to a function returning an array.
Generators are defined similar to a function and have the same properties like has parameters, can be called. But unlike functions, which return a whole array, a generator yields one value at a time. This requires less memory.

Generators in Python:

Use the yield keyword and it May use several yield keywords
Return an iterator

Generators simplifies creation of iterators. A generator is a function that produces a sequence of results instead of a single value.

So a generator is also an iterator. Here You don’t have to worry about the iterator protocol.

def gen(alist):
   for k in alist:
    yield k

g = gen([1,2,3,4,5])

print(next(g))
print(next(g))
print(g.__next__()) # g.next() in python2
print(g.__next__())

Can you think about how it is working internally?

When a generator function is called, it returns a generator object without even beginning execution of the function. When next method is called for the first time, the function starts executing until it reaches yield statement. The yielded value is returned by the next call.

Let’s check an example:

def foo():
    print("start Foo")
    for i in range(5):
        print(f"Before Yield i -- {i}")
        yield i
        print(f"After Yield i -- {i}")
    print("End Foo")

f = foo()
print(f.__next__())

start Foo
Before Yield i -- 0
0

print(f.__next__())

After Yield i -- 0
Before Yield i -- 1
1

print(f.__next__())

After Yield i -- 1
Before Yield i -- 2
2

print(f.__next__())

After Yield i -- 2
Before Yield i -- 3
3

print(f.__next__())

After Yield i -- 3
Before Yield i -- 4
4

print(f.__next__())

After Yield i -- 4
End Foo



---------------------------------------------------------------------------

StopIteration                             Traceback (most recent call last)

<ipython-input-72-2f4b6c2bfb23> in <module>
----> 1 print(f.__next__())


StopIteration:

Python generator expression

A generator expression is created with round brackets.

aa = (a for a in range(1000)) # this return a generator
ab = [b for b in range(100)] # this return a list

print(next(aa))
print(next(aa))
print(next(aa))
print(next(aa))

print(next(aa))
print(next(aa))
print(next(aa))
print(next(aa))

Further Reading: http://www.dabeaz.com/generators-uk/

Tuesday, 17 January 2017

Dealing With Missing Values Using Sklearn and Pandas

Real world data sets contain missing values. These values encoded as blanks, NaNs or other placeholders. we are used to skip that entire row/columns containing missing value while describing a machine learning model. However this comes at the price of losing data.

A good strategy is to fill out the missing values using known part of the data. The Imputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. Let's Check

first import our module and grab our data

import pandas as pd

data= pd.read_csv(r'titanic.csv')

Let's Grab some column data which has some missing values. Have a look.

data['Age'].values[1:30]

Output :

array([ 38.,  26.,  35.,  35.,  nan,  54.,   2.,  27.,  14.,   4.,  58.,
        20.,  39.,  14.,  55.,   2.,  nan,  31.,  nan,  35.,  34.,  15.,
        28.,   8.,  38.,  nan,  19.,  nan,  nan])

Here one can see missing values are represented as 'nan' (nan in Numpy array and NaN in pandas dataframe).Below are some method by which missing value can be handled.

Using sklearn.preprocessing Imputer function
Using Pandas fillna() method

1. Using Sklearn Imputer Function

import imputer class from sklearn.preprocessing and define our imputer. here we will describe missing_value placeholder, strategy used to fill out the value and axis.

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
print imp

Output

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Here some additional variables axis, verbose are set to default. Now let's fit our data to our Imputer.

imp.fit(data['Age'].values.reshape(-1,1))

Output:

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

Now transform our data replacing Nan with appropriate mean value.

age_reformed= imp.transform(data['Age'].values.reshape(-1,1))

Now Check the transformed data.

print(age_reformed[1:10])

Output:

array([[ 38.        ],
       [ 26.        ],
       [ 35.        ],
       [ 35.        ],
       [ 29.69911765],
       [ 54.        ],
       [  2.        ],
       [ 27.        ],
       [ 14.        ]])

Now these "nan" values has been transformed into mean value(29.69911765). The data after transformation can be used to machine learning model as it doesn't have missing value.

2. Using Pandas fillna()

Python's pandas library provide a direct way to deal with missing value.

First create a new data frame with column value as age.

df= pd.DataFrame(columns=['age'])
df['age']= data['Age'].values

We have missing values in newly created Dataframe.

print(df.head(10))
# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	NaN
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Pandas fillna method is applied to dataframe as

DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Now apply this to our data and put a value equal to 25 where nan occurs.

filled_value=df.fillna(25)
print(fill_value.head(10))

# output
 	age
0 	22.0
1 	38.0
2 	26.0
3 	35.0
4 	35.0
5 	25.0
6 	54.0
7 	2.0
8 	27.0
9 	14.0

Here value in 6th row(5th indexed) has been changed from 'NaN' to 25.0. More information about the default variable can be checked from docs.

Pandas's DataFrame fillan:-

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Pandas's Series fillna:-

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

Comment and let me know what are you using.

Friday, 23 December 2016

Feature Scaling with Python and Sklearn

feature scaling with python

Feature Scaling with Python and Sklearn

Sometime before applying a machine learning algorithm on your dataset, first you have to do some pre-processing for your data. For this purpose Sklearn provide a package to deal with such scenario. Here We will use that to make our work easy. Pre-processing may include Standardization, normalization, Binarization, imputation of missing value etc. Now get ready and start importing !

importing pandas, and numpy
import pandas as pd 
import numpy as np

Nothing special, Just loading our data

data=pd.read\_csv(r'train.csv')
Fare= data['Fare'].values[:40]

Check our input

print Fare

Output:

[   7.25     71.2833    7.925    53.1       8.05      8.4583   51.8625
   21.075    11.1333   30.0708   16.7      26.55      8.05     31.275
    7.8542   16.       29.125    13.       18.        7.225    26.       13.
    8.0292   35.5      21.075    31.3875    7.225   263.        7.8792
    7.8958   27.7208  146.5208    7.75     10.5      82.1708   52.
    7.2292    8.05     18.       11.2417]

This would be our input data which is going to be scaled.
Now starting our, first import pre-processing from sklearn. All the other magic tools are under the pre-processing.

from sklearn import preprocessing

scale() Function:

The function scale provide a basic overview of pre-processing. The scale function will transform your data between -1 to 1.
Apply the function to our input data(Fare) and store the output into other variable called fare_scaled.

fare_scaled= preprocessing.scale(Fare)

Check the scaled input :

print fare_scaled

Output:

[-0.51857041  0.88523827 -0.50377232  0.4866039  -0.50103193 -0.49208073
  0.45947406 -0.2154835  -0.43343642 -0.01826765 -0.31139708 -0.09545451
 -0.50103193  0.00813216 -0.50532447 -0.32674325 -0.03900252 -0.39251257
 -0.28289705 -0.51911849 -0.10751222 -0.39251257 -0.50148793  0.10075727
 -0.2154835   0.01059851 -0.51911849  5.08826339 -0.5047764  -0.50441247
 -0.06978694  2.5346778  -0.50760886 -0.44732033  1.12392606  0.46248848
 -0.51902641 -0.50103193 -0.28289705 -0.43105996]

The data has been scaled and now its lie between -1,1.

Scaling features to a range

Sometime data should be scaled between some minimum and maximum value, often between zero and one. This can be achieved using MinMaxScaler or MaxAbsScaler. Let’s try that.
First define our min_max_scaler and let’s say we want to scale our data between a range 0 to 5.

min_max_scaler= preprocessing.MinMaxScaler(feature_range=(0,5)) 
print min_max_scaler

Output:

MinMaxScaler(copy=True, feature_range=(0, 5))

Now transform our data using min_max_scaler

fare_transformed= min_max_scaler.fit_transform(Fare.reshape(-1,1))

print fare_transformed.reshape(1,-1) #Using reshape(1,-1) just for better view

Output:

[[  4.88710781e-04   1.25223927e+00   1.36839019e-02   8.96784283e-01
    1.61274558e-02   2.41090802e-02   8.72593099e-01   2.70745773e-01
    7.64011338e-02   4.46599550e-01   1.85221386e-01   3.77773434e-01
    1.61274558e-02   4.70139771e-01   1.22998729e-02   1.71537484e-01
    4.28110644e-01   1.12892190e-01   2.10634347e-01   0.00000000e+00
    3.67021797e-01   1.12892190e-01   1.57208484e-02   5.52731893e-01
    2.70745773e-01   4.72338970e-01   0.00000000e+00   5.00000000e+00
    1.27885837e-02   1.31130877e-02   4.00660737e-01   2.72301437e+00
    1.02629264e-02   6.40211123e-02   1.46507282e+00   8.75281009e-01
    8.21034112e-05   1.61274558e-02   2.10634347e-01   7.85201838e-02]]

The scaled data is now lie in range between 0 to 5. Here we can provide any range in which we want our data to be scaled.
The transformation for MinMaxScaler() is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

Now Let’s move to our other scaling function MaxAbsScaler.It scale each feature by its maximum absolute value.
first define our maxabs scaler

maxabs_scaler= preprocessing.MaxAbsScaler()
print maxabs_scaler

Output:

MaxAbsScaler(copy=True)

Now transform the data using maxabs_scaler.fit_transform(). The output is stored into another variable called fare_maxabs_scaled.

fare_maxabs_scaled= maxabs_scaler.fit_transform(Fare.reshape(-1,1))
print fare_maxabs_scaled.reshape(1,-1) #reshape is just for preety view

Output:

[[ 0.02756654  0.27103916  0.03013308  0.20190114  0.03060837  0.03216084
   0.19719582  0.08013308  0.04233194  0.11433764  0.0634981   0.10095057
   0.03060837  0.11891635  0.02986388  0.0608365   0.11074144  0.04942966
   0.06844106  0.02747148  0.09885932  0.04942966  0.03052928  0.13498099
   0.08013308  0.11934411  0.02747148  1.          0.02995894  0.03002205
   0.10540228  0.55711331  0.02946768  0.03992395  0.3124365   0.19771863
   0.02748745  0.03060837  0.06844106  0.04274411]]

Normalization

Normalization is the process of scaling individual samples to have unit norm. The function normalize() provides a quick and easy way to perform normalization on a array-like dataset, either using the l1 or l2 norms:
Check Example :
Normalization using normalize(x)

fare_normalized = preprocessing.normalize(Fare.reshape(1,-1), norm="l1")

Check the output

print fare_normalized

Output:

[[ 0.00586493  0.057665    0.00641097  0.04295552  0.00651209  0.00684239
   0.04195444  0.01704873  0.00900634  0.02432593  0.01350955  0.02147776
   0.00651209  0.02530007  0.0063537   0.01294328  0.02356082  0.01051642
   0.01456119  0.0058447   0.02103284  0.01051642  0.00649526  0.02871791
   0.01704873  0.02539108  0.0058447   0.21275522  0.00637392  0.00638735
   0.02242489  0.11852876  0.0062694   0.00849403  0.0664725   0.04206567
   0.0058481   0.00651209  0.01456119  0.00909403]]

Binarization

Binarization is the process of thresholding numerical features to get boolean values. it binarize data (set feature values to 0 or 1) according to a threshold.Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0.
Check the example Below. First check what our Fare data looks alike .

print Fare

Output:

[   7.25     71.2833    7.925    53.1       8.05      8.4583   51.8625
   21.075    11.1333   30.0708   16.7      26.55      8.05     31.275
    7.8542   16.       29.125    13.       18.        7.225    26.       13.
    8.0292   35.5      21.075    31.3875    7.225   263.        7.8792
    7.8958]

Now Let’s Binarize it. Divide it into two groups fare less than 15.0 and fare more than 15.0 .
Put a threshold value equal to 15.0 so that binarizer convert fare less than 15.0 as 0 and fare more than 15.0 as 1.
Let’s do this.
First define our binarizer:

binarizer = preprocessing.Binarizer(threshold=15.0)
print binarizer

Output:

Binarizer(copy=True, threshold=15.0)

Now transform our data and store it in variable called fare_binarized.

fare_binarized=binarizer.transform(Fare.reshape(-1,1))
print fare_binarized.reshape(1,-1)

Output:

[[ 0.  1.  0.  1.  0.  0.  1.  1.  0.  1.  1.  1.  0.  1.  0.  1.  1.  0.
   1.  0.  1.  0.  0.  1.  1.  1.  0.  1.  0.  0.  1.  1.  0.  0.  1.  1.
   0.  0.  1.  0.]]

Here Values above 15.0 are converted into 1 and value less than or equal to are converted to 0.

Custom transformers

Sklearn provides some method for custom transformation of data. If a function of your choice isn’t listed then you can create your own function for transformation using FunctionTransformer() function.
Check example below
First define our input let’s it is

x=np.array([[121,424],[986,361]])
print x

Output:

[[121 424]
 [986 361]]

Let’s say we want to take square root of our input. We will do this with the help of sklearn.
Call the function FunctionTransformer() with appropriate function(here will take square root of the given input, so will use np.sqrt )

custom_transformer = preprocessing.FunctionTransformer(np.sqrt)

Now Transform our input using custom_transformer function

x_transformed= custom_transformer.transform(x)

Let’s Check what we have transformed.

print x_transformed

Output:

[[ 11.          20.59126028]
 [ 31.40063694  19.        ]]

As we can see the output is perfect square root of input x. More about custom transformation can be learn from docs
In feature scaling, Imputation of missing value and Encoding Categorical feature is also important. I have separately written on these two topics. You should read these to complete feature scaling.

Thanks for reading.