Data cleaning in data science, machine learning & business analytics is one of the most under-rated tasks. The key fuel to rich analysis & insights is data; and unclean or irrelevant data could lead to business leaders taking uninformed decisions; thereby negatively impacting business performance.
Before getting to the tactical side of things, it is important to understand the coverage of ‘data cleaning’ – i.e. what are the various metrics or factors which are covered under data cleansing exercise.
What is data cleaning?
Data cleaning (or cleansing; can be used interchangeably) refers to the end to end process of detecting & fixing any issues or errors in connection to a dataset. Now, an immediate follow-up question is ‘what are the factors which are covered under data errors?’
There are invariably four aspects which fall under data cleaning activities in data science – data completeness, data correctness, data accuracy and data relevance.
Therefore, data cleaning refers to the series of steps which ensures that the underlying data used for high-end analysis or modelling is complete, correct, accurate and relevant.
8 ways of using data cleaning techniques
As commonly said, a typical data scientist spends as much as 80% of his/her time in cleaning data using various data cleaning techniques & methods. The rest 20% of the time in preparing statistical models or performing value-add analysis.
This is indicative of the fact that once the data is cleaned; analyzing or building models on top of it is a relatively simpler task. With this backdrop, lets discuss on the 8 key ways of using data cleaning techniques –
- Random whitespaces within the data content – This is a common issue with many data structures wherein undesired spaces in the middle tends to distort the meaning of the data. For example – ‘this is a cat’ and ‘this is a cat’ would be considered as two separate data. This can be catered to using a TRIM function which is predefined for such data cleansing
- Remove blank data – Blank data is a serious concern for many analysts as they tend to dilute the overall quality of data. For example – A record wherein 5 out of the 8 fields are blank cannot be used for targeted analysis. These blank data should ideally be treated in the data collection phase only wherein they should design intelligent forms with programmed fields so that it doesn’t accept null values
- Numbers getting converted to text on exporting – Data scientists have often faced issues wherein numerical aggregate functions stop working – this is due to the numbers getting auto-converted into texts. The use of the VALUE method can be help cater to such data issues
- Highlighting erroneous data – In large datasets, there are generally a lot of calculated fields as well wherein the error handling is not done properly. Due to this, there can be data such as #N/A, #VALUE etc. which tends to spoil the data. Also, if these fields are in turn used in any other calculations; it also invariably throws an error. The best way to handle this is to use IFERROR operator and assign a default value to the field in case of any errors in calculation.
- Treatment of duplicates – There are many times when the same data gets persisted multiple times resulting in duplicate entries. This can also lead to errors in case there is a primary key attached to it. To handle this, the ‘remove duplicates’ option in Excel can be used to remove duplicates, and keep a single instance of each record
- Same data in different cases – Imagine a situation wherein a single data is entered in multiple forms such as Tokyo, TOKYO, tokyo etc. While all these entries refer the same; they would be treated as separate. This can be handled in multiple ways – using UCASE or LCASE with data so that all data gets converted to a common case, or handle this during the data entry phase itself
- Irrelevant/inconsistent formatting – This happens mostly in cases wherein data is exported from different platforms. The best way to handle this is t remove all formatting of data coming from different sources; and then place an uniform formatting on the data
- Spelling errors in text data – This is mostly relevant in text data wherein unlike Word or PowerPoint, Excel or other statistical tools may not be able to run spell or grammar checks. In such cases, analysts can press the ‘F7’ key in Excel to perform spell & grammar checks on text data.