Hadoop DataNodes with Dynamic Storage using LVM

Rohan Parab
4 min readMar 14, 2021

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. This short article covers using Data Nodes over a dynamically allocated storage using Logical Volume Management (LVM).

Hadoop clusters are designed specifically to store and analyze mass amounts of structured and unstructured data in a distributed computing environment. Hadoop clusters consist of a network of connected master and slave nodes that utilize high availability, low-cost commodity hardware.

HDFS Architecture:

Logical volume management (LVM) is a form of storage virtualization that offers system administrators a more flexible approach to managing disk storage space than traditional partitioning.

This article focuses on the practical for creating HDFS cluster where the data nodes shared directory will have specific sized volume. Also, if needed the volume of the directory could be changed on-the-air. This architecture will have near-zero downtime if the size of the shared volume is to be increased or decreased.

We are using the Oracle VirtualBox Hypervisor. For the purpose of the demonstration, we create a single Data Node architecture. The OS used for the Name Node as well as the Data Node is Red Hat Enterprise Linux 8.

Attaching a Virtual Hard-drive to the Data node

To create Dynamic Storage, let’s attach a virtual hard drive to the Data node. The storage we are attaching to the Virtual Machine is of 100Gb.

Created VDI is attached to the Data node
  • boot up the machine and check the Hard-Disk

Creating the Logical Volume

  1. Convert the Storage device to a physical volume and display the created physical volume.
$ sudo pvcreate /dev/sdb

2. Creating and displaying Volume Group

$ sudo vgcreate datanode_lv_vol /dev/sdb

3. Creating and displaying Logical Volume

A Logical volume is created by creating a partition in the Volume Group. We can make any number of the partition using the Volume Group, unlike the Physical Partition which can have at most 4 partitions.

$ sudo lvcreate --size 50G --name vol_01 datanode_lv_vol

If we now take a look at the Volume Group again, we can see that the Current LV attached is 1.

4. Format the newly created Logical Volume.

In order to insert data into any partition, we need to format the partition. We will be using the ext4 file system to format the logical partition.

$ mkfs.ext4 /dev/datanode_lv_vol/vol_01

5. Create a directory and mount the hard-disk on the directory

It is assumed that the Machine is the first time being configured as a DataNode. Let’s create a directory to store the data of the client when pushed.

$ sudo mkdir -pv /servera_data

Configure Hadoop to store data in the /servera_data directory

Now let’s connect the data node to the cluster.

$ sudo hadoop-daemon.sh start datanode

check the admin-report of the cluster

Now let’s try to increase the size of the volume to 80Gb

  • Increase the size of the volume
$ sudo lvextend --size +30G /dev/datanode_lv_vol/vol_01

Size of logical volume datanode_lv_vol/vol_01 changed from 50.00 GiB (12800 extents) to 80.00 GiB (20480 extents).
Logical volume datanode_lv_vol/vol_01 successfully resized.

  • Resize the partition
$ sudo resize2fs /dev/datanode_lv_vol/vol_01

Let’s check the admin report again,

Check that the size of the storage had increased by almost 30 Gb without even restarting the service. Using the resize2fs command we just format and append the storage on the device which has not been written yet, this way we do not lose any of the data. We can also reduce the size of the partition in a similar way. This is how we create a Hadoop DataNodes with Dynamic Storage. This method is very useful if we are not sure of the scale of the data that would be pushed by the client.

Thanks!

--

--