Data cleaning in data science, machine learning & business analytics is one of the most under-rated tasks. The key fuel to rich analysis & insights is data, and unclean or irrelevant data could lead to business leaders taking uninformed decisions; thereby negatively impacting business performance.
What is Data cleaning?
Data cleaning (or cleansing; can be used interchangeably) refers to the end-to-end process of detecting & fixing any issues or errors in connection to a dataset. There are invariably four aspects that fall under the activities of data cleaning in data science – data completeness, data correctness, data accuracy, and data relevance.
Therefore, data cleaning refers to the series of steps that ensures that the underlying data used for high-end analysis or modeling is complete, correct, accurate, and relevant.
8 Effective Data Cleaning Techniques
As commonly said, a typical data scientist spends as much as 80% of his/her time cleaning data using various data cleaning techniques & methods. The rest 20% of the time in preparing statistical models or performing value-add analysis.
This is indicative of the fact that once the data is cleaned; analyzing or building models on top of it is a relatively simpler task. With this backdrop, let's discuss the 8 key data cleaning techniques –
1. Random Whitespaces within the Data Content
This is a common issue with many data structures wherein undesired spaces in the middle tend to distort the meaning of the data. For example – ‘this is a cat’ and ‘this is a cat’ would be considered two separate data. This can be catered to using a TRIM function which is predefined for such data cleansing.
2. Remove Blank Data
Blank data is a serious concern for many analysts as they tend to dilute the overall quality of data. For example – a record wherein 5 out of the 8 fields are blank cannot be used for targeted analysis. These blank data should ideally be treated in only the data collection phase wherein they should design intelligent forms with programmed fields so that it doesn’t accept null values.
3. Numbers Getting Converted to Text on Exporting
Data scientists have often faced issues wherein numerical aggregate functions stop working – this is due to the numbers getting auto-converted into texts. The use of the VALUE method can be help cater to such data issues.
4. Highlighting Erroneous Data
In large datasets, there are generally a lot of calculated fields as well wherein the error handling is not done properly. Due to this, there can be data such as #N/A, #VALUE, etc. which tends to spoil the data. Also, if these fields are in turn used in any other calculations; it also invariably throws an error. The best way to handle this is to use the IFERROR operator and assign a default value to the field in case of any errors in calculation.
5. Treatment of Duplicates
There are many times when the same data gets persisted multiple times resulting in duplicate entries. This can also lead to errors in case there is a primary key attached to it. To handle this, the ‘remove duplicates’ option in Excel can be used to remove duplicates and keep a single instance of each record.
6. Same Data in Different Cases
Imagine a situation wherein a single data is entered in multiple forms such as Tokyo, TOKYO, tokyo, etc. While all these entries refer to the same thing; they would be treated as separate. This can be handled in multiple ways – using UCASE or LCASE with data so that all data gets converted to a common case, or handling this during the data entry phase itself.
7. Irrelevant/Inconsistent Formatting
This happens mostly in cases wherein data is exported from different platforms. The best way to handle this is to remove all formatting of data coming from different sources, and then place uniform formatting on the data.
8. Spelling Errors in Text Data
This is mostly relevant in text data wherein unlike Word or PowerPoint, Excel or other statistical tools may not be able to run spell or grammar checks. In such cases, analysts can press the ‘F7’ key in Excel to perform spell & grammar checks on text data.
Here's a video for you to get to know better the concept of Data Cleaning...
To check if data is correct, complete, relevant and accurate, is what is known as the process of data cleaning. It is imperative to businesses' growth to make sure that the data they're using to analyse their content, is clean.
To learn more about this and Data Science, enrol in Board Infinity's Data Science course and gain expertise in all the skills and tools required to become a successful Data Scientist. All with the help of personalised mentoring from top industry experts!
Blog | Board Infinity Newsletter
Join the newsletter to receive the latest updates in your inbox.