HDFS Architecture
Lets look into HDFS architecture.

HDFS is the important part in the big data journey. HDFS stands for Hadoop Distributed File System. It is designed to store and manage large volumes of data across a cluster of commodity hardware. Its architecture consists of several components working together to provide a scalable, fault-tolerant, and high-throughput storage solution. Here’s an overview of the key components of the HDFS architecture:
- NameNode: The NameNode is the master node in the HDFS architecture and is responsible for managing the file system namespace and metadata. It keeps track of the directory tree, file permissions, and file-to-block mappings. The metadata is stored persistently in memory and periodically written to disk. The NameNode does not store the actual data but maintains metadata information about where the data blocks are located in the cluster
- DataNodes: DataNodes are the slave nodes in the HDFS architecture and are responsible for storing the actual data blocks. Each DataNode manages a set of data blocks and periodically reports their health status to the NameNode. DataNodes handle read and write requests from clients and replicate data blocks to ensure fault tolerance. They also perform block scanning to detect and correct any data corruption or inconsistencies.
- Secondary NameNode: The Secondary NameNode is a helper node that periodically merges the namespace and metadata changes from the NameNode’s memory with the fsimage and edits log files on disk. This process helps reduce the recovery time in case of NameNode failure and improves overall system reliability. It does not serve as a standby or backup NameNode.
- Client: Clients interact with the HDFS cluster to read, write, and manipulate files. They communicate with the NameNode to retrieve file metadata and locate data blocks. Once the client knows the locations of the data blocks, it communicates directly with the DataNodes to read or write data.
- Block: HDFS divides large files into smaller blocks (typically 128 MB or 256 MB) and distributes them across multiple DataNodes in the cluster. Each block is replicated across multiple DataNodes for fault tolerance and data redundancy. The replication factor is configurable and determines the number of copies of each block stored in the cluster.
- Rack Awareness: HDFS is rack-aware, meaning it is aware of the physical network topology of the cluster and tries to place replicas of data blocks on different racks to improve data locality and network bandwidth utilization. By placing replicas on different racks, HDFS ensures that data is still accessible even in the event of rack-level failures.
Features of HDFS
- It handles large datasets
- It has high fault tolerance
- It has high throughput
- WORM – it Writes Once and Read Many
- It uses commodity hardware
- Data get splits in form of blocks