The onset and generation of huge volumes of data being collected day in and day out has made it almost impossible for businesses to perform a qualitative analysis of performance & happenings. Now a day, virtually everything has a quantity attached to it – be it campaign effectiveness, operational efficiency, human resource productivity, financial management, sales forecasts and so on. In fact, factors or variables which are not quantifiably measured become very different to track for the business. Hence, statistics and mathematics form an integral part of data science.
Now, mathematics and statistics are not the same; and the relevance of each of these in data science is different. The subsequent section would focus on the key difference between the two concepts; and drill-down on the implication & usage of statistics in data science.
What is the key difference between mathematics & statistics?
The key difference in mathematics and statistics is the difference in its concepts & usages – while mathematical concepts and models are more specific to concrete, well-defined problems involving precise measures; statistics is more relevant for abstract problems. Hence, the key demarcation between the two can be explained by the concept of ‘confidence level’.
Now, coming to the usage aspect in the field of data science; given the complex nature of the problems which businesses face, the need of inducting reasoning, creating logical assumptions and drawing possible conclusions is the need of the hour. Therefore, the implications of statistics in data science is extremely high.
Steps in Data Science – Where is statistics relevant?
The stepwise execution of a typical data science problem is mentioned below for reference.
In the mentioned steps involved in most of the data science projects starting from understanding the business problem to finally deploying & delivering custom solutions to businesses. The highlighted portion shows the different steps wherein statistics becomes an imperative option. The data modelling exercise (on top of the cleaned dataset), optimization as well as simulation of certain business scenario requires leverage of statistics.
Statistical concepts which every data science professional must know
There are multiple statistical concepts and frameworks which evolves at regular period of time; however, there are certain basic statistical concepts which all data science professionals or aspirants must follow –
Classification algorithm – Classification techniques are a must for machine learning & data mining exercises. Amongst all classification algorithms, a few which are most frequently used are logistic regression, decision trees and discriminant analysis
Regression – Linear regression (simple & multiple) is the most fundamental statistical technique which is used in data science; primarily enabling causal relationship between two or more variables
Probability distribution & estimation theories – All statistical techniques related to data science are heavily dependent on probability distribution & estimation techniques.