What is Hadoop?
Apache’s Hadoop is an open source framework written in Java for distributed storage and processing of large datasets (also known as big data) spread across many clustered commodity hardware such as servers. It can handle large volumes of structured, semi-structured, and unstructured data such as images, audio or video files. Hadoop is not a database and is far from structured so it does not use query languages like SQL, which relational databases use. Instead, it uses MapReduce which allows users to write programs to retrieve data. Hadoop is deployed as software through many companies such as Cloudera that often provide a SQL-like language for ease of use.
Understanding Hadoop architecture
Hadoop’s architecture communicates in a master / worker model and operates on two main components known as Hadoop Distributed File System (HDFS) and MapReduce. These systems operate together for storage and analysis of information.
Hadoop’s Master / Worker Model
The master / worker communication model contains two main component for both master and worker. The master contains a name node and job tracker. The name node contains information about the nodes in the cluster such as what files are contained or the locations. The job tracker keeps track of the tasks assigned to each node as well as systematizes exchanges of information.
There are multiple workers in a Hadoop system, and they all contain both a data node and a task tracker. The data node holds all the data whereas the task track computes the tasks assigned to the specific worker.
2 core technologies of Hadoop
- The HDFS is a system of distributing filing aimed at running on commodity hardware. Data is broken down into small blocks and distributed across data nodes. Along with MapReduce, this ensures both scalability and fault tolerance.
- MapReduce is a framework for distributed processing of data on clusters of commodity hardware with two main functions: map and reduce. The map function takes your data and converts it into sets of data with tuples such as key-value pairs. Mapping allows queries to be split and distributed across parallel nodes and processed in parallel. Reduce takes the output from a map and merges data tuples in smaller sets of tuples. This allows the results to be gathered and delivered. These two operate in parallel to distribute data across data nodes in a cluster. In the confines of the communication architecture, MapReduce performs all the computations on the data and assigns tasks to individual workers and monitors tasks to ensure they follow through in the event of any failures.
4 Advantages of Hadoop
Hadoop is made for big data demands. Hadoop operates specifically for large datasets and it is designed with the capabilities to complement big data. Several of its design elements, such as those noted above, provide a strong advantage for choosing Hadoop for your big data needs.
- Linear scalability to meet your data needs. Hadoop is highly scalable and any of your data size needs are met by adding additional commodity hardware to clusters.
- A cost-effective solution for storing data. Hadoop runs on commodity hardware, which is inexpensive and offers ease of use and offers extensive compatibility.
- Highly flexible data types. Hadoop allows for capabilities of processing both structured and unstructured data types. It is malleable for all your corporate data needs.
- Fault tolerance for the unexpected. By using clusters of multiple servers, data is replicated across data nodes allowing any failures to be replicated, copied and read from another available data node.