Image by: Unsplash - Amritanshu Sikdar

De-duplication with LVM

selfhosted May 30, 2022

In all honesty, I've never been a fan of using LVM's except for incredibly large filesystems and only then on very old operating systems. The first I've learned of LVM was when I was training for RHCE 5 , yeap, thats a long time ago.

Back in the oldentimes LVM was your best bet for formatting or using large filesystems. Later obviously the kernels resolved these issues. Filesystem limitations.

But now filesystem sizes no longer pose a problem. Not enough disk space ? Just order a new disk from Amazon and move on. But seriously, thats not a holdable situation.

In the environment I currently work we have a few 'repository' systems that hold a lot of data. In a majority of cases the data is just a different filename with the same content. Which makes most files just redundant, however we can not remove them because installations would fail.

An earlier attempt to resolve this we tried a java application that could do file de-duplication, but the problem was that it would only write files out as owned by root. Which is not a good thing.


It turns out that with RedHat 8 there now is a solution named lvmvdo.

So lets put it to the test :)

As test for file de-duplication i have added a second drive to a virtual machine. On this drive, once set up, i will create two directories and copy the contents of a CentOS iso into it and record the used space.  The df command will show the drive actual used data. While deduplication is done on a block level, the filesystem will not know about this. We can however see the amount of data 'saved' with a few commands shown below.

Creating the LVM

With dmesg we can verify if the live added drive has been detected.

ansible@ansible ~ $ dmesg | tail
[1293943.911414] scsi 0:0:1:0: Direct-Access     VMware   Virtual disk     2.0  PQ: 0 ANSI: 6
[1293943.919149] sd 0:0:1:0: Attached scsi generic sg2 type 0
[1293943.919911] sd 0:0:1:0: [sdb] 209715200 512-byte logical blocks: (107 GB/100 GiB)
[1293943.919941] sd 0:0:1:0: [sdb] Write Protect is off
[1293943.919944] sd 0:0:1:0: [sdb] Mode Sense: 61 00 00 00
[1293943.919974] sd 0:0:1:0: [sdb] Cache data unavailable
[1293943.919977] sd 0:0:1:0: [sdb] Assuming drive cache: write through
[1293943.925458] sd 0:0:1:0: [sdb] Attached SCSI disk

If it isnt in your case, try sudo partprobe if that doesnt work, verify if a drive was indeed added and reboot. Though it should be mentioned, this LVMVDO feature was only added in RedHat family 8 and higher.

Instal VDO with sudo yum install vdo rsync kmod-kvdo

Im currently following this document to the letter. It seems that the VDO part is only added on the LV. Which means creating the PV and VG should be done prior to this.

sudo pvcreate /dev/sdb
sudo vgcreate vg1 /dev/sdb

sudo lvcreate --type vdo --name vdo_repo --size 98GB --virtualsize 150GB vg1

# virtualsize can be higher than actual size, creating an 'overbooking'.

sudo mkfs.ext4 -E nodiscard /dev/vg1/vdo_repo

Mounting

Create a temporary directory and mount the filesystem.

sudo mkdir /mnt/deduptest
sudo mkdir /mnt/iso

sudo mount /dev/vg1/vdo_repo /mnt/deduptest/

# download an iso file and mount it to /mnt/iso

sudo mount -o loop /<path>/CentOS-Stream-8-x86_64-20220525-dvd1.iso /mnt/iso

# create two directories
mkdir -p /mnt/deduptest/{one,two}

Time to put it to the test

Record drive size.

df -h /mnt/deduptest/
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/vg1-vdo_repo  147G   61M  140G   1% /mnt/deduptest

Note that the physical size of the disk is 98GB. 50% overbooking is generous and possibly even dangerous. Depending on your intention for the drive you should decrease that percentage.

Rsync

Rsync the data to both directories ( one , two ).

sudo -i
rsync -aH /mnt/iso/ /mnt/deduptest/one/
vdo-stats --human-readable /dev/mapper/vg1-vpool0-vpool

#Device                    Size      Used Available Use% Space saving%
#/dev/mapper/vg1-vpool0-vpool     98.0G     14.4G     83.6G  14%           26%

# Do it again
rsync -aH /mnt/iso/ /mnt/deduptest/two/
vdostats --human-readable /dev/mapper/vg1-vpool0-vpool

#Device                    Size      Used Available Use% Space saving%
#/dev/mapper/vg1-vpool0-vpool     98.0G     14.4G     83.6G  14%           58%

While the 22GB copy action is stil going, the following commands will show if it is actually doing deduplication:

sudo lvs -o+vdo_compression,vdo_deduplication
  LV       VG  Attr       LSize  Pool   Origin Data%  Meta%  Move Log Cpy%Sync Convert VDOCompression VDODeduplication
  vdo_repo vg1 vwi-aov--- 97.00g vpool0        10.63                                          enabled          enabled
  vpool0   vg1 dwi------- 98.00g               14.69                                          enabled          enabled


sudo lvs -o+vdo_compression_state,vdo_index_state
  LV       VG  Attr       LSize  Pool   Origin Data%  Meta%  Move Log Cpy%Sync Convert VDOCompressionState VDOIndexState
  vdo_repo vg1 vwi-aov--- 97.00g vpool0        10.63                                   online              online
  vpool0   vg1 dwi------- 98.00g               14.69                                   online              online

Once the two copy actions were complete it definetely showed a space saving:

Run Size Used Available Use% Space saving%
first copy 98.0G 14.4G 83.6G 14% 26%
second copy 98.0G 14.4G 83.6G 14% 58%

So all in all a great success, cant wait to ask my colleagues to put this to the test.

Enjoy,

Tags

Riccardo B.

Riccardo is an all round Linux Systems Engineer with over 20 years of experience and a knack for Automation. Favoring acronyms like NAO, IaC, SRE and more. Also hardly ever writes in third person :)