Cleaning data for data visualisation

Nikita sharma
3 min readOct 18, 2018

This small post provides information on cleaning data by dealing with missing data present in a dataframe.

Data cleaning is the process of ensuring that your data is correct and useable by identifying any errors in the data, or missing data by correcting or deleting them.
Cleaning up data is the first and most important step, as it ensures the quality of the data is met to prepare data for visualization.

We are using python language to clean data for this post.

Reading the data

Initially we have to import libraries to read and load the data.

After loading data our first step is to check all the index of the data.

This can be checked by using following command :

data.columns

Handle Missing data

In statistics, missing data, or missing values, occur when no data value is stored/provided for the variable in an observation. Missing data is a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Let’s check if the dataframe has missing data :

data.isnull().sum() #detect missing values

The goal of cleaning operations is to prevent problems caused by missing data that can arise when training a model.

How to deal with missing data

Before trying to deal with missing values in an analysis, we need to understand which variables contain the missing values, and we need to examine the patterns of missing values. In the above example we can see that there are two types of missing data one is int type whereas other is object type.

There are several ways to deal with missing data :

Ignore the data row

Deleting or dropping the rows that contain missing data or Nan data. But, obviously we get poor performance if the percentage of such rows is high.

data.dropna(inplace = True)

Sometimes, we can delete or drop the row only if all the values are Nan

data.dropna(how='all')

Use fillna() to fill missing values

Instead of deleting the Nan values or missing values, we can fill all the missing values.Missing data can be filled by propagating the non-Nan values forward or backward along a Series. Sometimes, Nan value will remain Nan even after forward filling or back filling if a next or previous value isn’t available or it is also a Nan value.

data.fillna(method='ffill')
data.fillna(methods = 'bfill')

Use a constant value to fill in for missing values

This technique is used because sometimes it just doesn’t make sense to try and predict the missing value.

# This fills all the null values with 0.
data.fillna(value=0, inplace=True)

For filling missing data in an object type data we can simply fill with string — ‘missing’ , and we will handle the values in code if necessary.

data['Embarked'].fillna('missing', inplace=True)

Replace with mean, median and mode value

Mean is the average number of the data set, median is the middle number and mode is the number that occurs most often. Mode can also be used to fill missing data in object type data.

# we can use median
train_data['Age'].fillna(train_data['Age'].median(), inplace = True)

#we can use mean()
train_data['Age'].fillna(train_data['Age'].mean(), inplace = True)

#we can use mode()
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace = True)

Finally, we can check whether all our missing data are filled or removed by the command above.

Hope, this post was useful. cheers

Originally published at confusedcoders.com on October 18, 2018.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Nikita sharma
Nikita sharma

Written by Nikita sharma

Data Scientist | Python programmer

No responses yet

Write a response