Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.
Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
Hadoop consists of four main modules:
- Hadoop Distributed File System (HDFS) – A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.
- Yet Another Resource Negotiator (YARN) – Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.
- MapReduce – A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key-value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.
- Hadoop Common – Provides common Java libraries that can be used across all modules.
Hadoop makes it easier to use all the storage and processing capacity in cluster servers and to execute distributed processes against huge amounts of data.
Hadoop provides the building blocks on which other services and applications can be built.
Applications that collect data in various formats can place data into the Hadoop cluster by using an API operation to connect to the NameNode. The NameNode tracks the file directory structure and placement of “chunks” for each file, replicated across DataNodes. To run a job to query the data, provide a MapReduce job made up of many maps and reduce tasks that run against the data in HDFS spread across the DataNodes. Map tasks run on each node against the input files supplied, and reducers run to aggregate and organize the final output.
Hadoop Use Cases
Hadoop for Telecom Industry:
Let’s look at a couple of examples from the telecom industry where Big Data and Hadoop solved a critical business problem.
China Mobil Guangdong company has an extensive network of customers. Not only do they have to maintain billions of mobile call records, but it is essential to business that they have real-time access to call records and billing information of the customers.
Traditional database management system couldn’t scale in this scenario, so they came up with a cost-effective solution using Hadoop. With the help of Intel tech, they used HBase to store billions of call record details. As per an estimate, nearly 30 Terabytes of data is added to their database on a monthly basis.
Another example of Big Data management in the telecom industry comes from Nokia. They store and analyse the massive volume of data from their manufactured mobile phones. To paint a fair picture of Nokia’s Big Data, they manage 100 TB of structured data along with 500+ TB of semi-structured data. Hadoop Distributed Framework System provided by Cloudera manages all variety of Nokia’s data and processes it in a scale of petabytes.
Hadoop in Retail Sector:
With the ever-increasing interaction over social media and retail channels, customers are comparing products, services and prices for multiple online and store retailers.
With this kind of behaviour, consumers can quickly shift from one retailer to another, which makes it absolutely essential for companies in the retail sector to tap this information and keep track of the leaks in the sales funnel.
It is necessary for a retail company to tap retail big data analytics, using Hadoop to understand customer’s purchase behaviour.
Big Data and Hadoop are being used in the retail industry for the following use cases:
- Retail analytics for inventory forecasting.
- Retail analytics for dynamic pricing of products.
- Retail analytics for supply chain efficiency.
- Retail analytics for targeted and customized promotion and marketing.
- Retail analytics in fraud detection and prevention.
Hadoop in the Healthcare Sector:
No other industry has benefited from the use of Hadoop as much as the Healthcare industry has.
Healthcare industry leverages Big Data for curing diseases, reducing medical cost, predicting and managing epidemics and maintaining the quality of human life by keeping track of large scale health indexes and metrics.
The big data generated in the healthcare sector is mainly due to patient record keeping and regulatory requirements.
McKinsey projected that efficient usage of Big Data and Hadoop in the healthcare industry can reduce the data warehousing expenses by $300-$500 billion globally. The data generated by electronic health devices is difficult to analyse using traditional database management systems. Complexity and volume of the healthcare data is the primary driving force behind the transition from legacy systems to Hadoop in the healthcare industry. Using Hadoop on such scale of data helps in easy and quick data representation, database design, clinical decision analytics, data querying and fault tolerance.
Hadoop in the Financial Sector:
After the economic recession, most of the financial institutions and national monetary associations started maintaining a single Hadoop Cluster containing more than petabytes of financial data aggregated from multiple enterprise and legacy database systems.
Along with aggregating, banks and financial institutions started pulling in other data sources - such as customer call records, chat and weblogs, email correspondence and others. When such unprecedented scale of data is analysed with the assistance of Hadoop, MapReduce and techniques like sentiment analysis, text processing, pattern matching, graph creating; banks were able to identify the same customers across different sources along with accurate risk assessment.
Predicting market behaviour is another classic problem in the financial sector. Different market sources such as stock exchanges, banks, revenue and proceeds department, securities market - hold a massive volume of data themselves but their performance is interdependent. Hadoop provides the technological backbone by creating a platform where data from such sources can be compiled and processed in real-time.
Morgan Stanley with assets over 350 billion is one of the world’s biggest financial services organizations. It relies on the Hadoop framework to make industry critical investment decisions. Hadoop provides scalability and better results through its administrator and can manage petabytes of data which is not possible with traditional database systems.