De-duplication with LVM
In all honesty, I've never been a fan of using LVM's except for incredibly large filesystems and only then on very old operating systems. The first I've learned of LVM was when I was training for RHCE 5 , yeap, thats a long time ago.
Back in the oldentimes LVM was your best bet for formatting or using large filesystems. Later obviously the kernels resolved these issues. Filesystem limitations.
But now filesystem sizes no longer pose a problem. Not enough disk space ? Just order a new disk from Amazon and move on. But seriously, thats not a holdable situation.
In the environment I currently work we have a few 'repository' systems that hold a lot of data. In a majority of cases the data is just a different filename with the same content. Which makes most files just redundant, however we can not remove them because installations would fail.
An earlier attempt to resolve this we tried a java application that could do file de-duplication, but the problem was that it would only write files out as owned by root. Which is not a good thing.
It turns out that with RedHat 8 there now is a solution named lvmvdo
.
So lets put it to the test :)
As test for file de-duplication i have added a second drive to a virtual machine. On this drive, once set up, i will create two directories and copy the contents of a CentOS iso into it and record the used space. The df
command will show the drive actual used data. While deduplication is done on a block level, the filesystem will not know about this. We can however see the amount of data 'saved' with a few commands shown below.
Creating the LVM
With dmesg we can verify if the live added drive has been detected.
ansible@ansible ~ $ dmesg | tail
[1293943.911414] scsi 0:0:1:0: Direct-Access VMware Virtual disk 2.0 PQ: 0 ANSI: 6
[1293943.919149] sd 0:0:1:0: Attached scsi generic sg2 type 0
[1293943.919911] sd 0:0:1:0: [sdb] 209715200 512-byte logical blocks: (107 GB/100 GiB)
[1293943.919941] sd 0:0:1:0: [sdb] Write Protect is off
[1293943.919944] sd 0:0:1:0: [sdb] Mode Sense: 61 00 00 00
[1293943.919974] sd 0:0:1:0: [sdb] Cache data unavailable
[1293943.919977] sd 0:0:1:0: [sdb] Assuming drive cache: write through
[1293943.925458] sd 0:0:1:0: [sdb] Attached SCSI disk
If it isnt in your case, try sudo partprobe
if that doesnt work, verify if a drive was indeed added and reboot. Though it should be mentioned, this LVMVDO feature was only added in RedHat family 8 and higher.
Instal VDO with sudo yum install vdo rsync kmod-kvdo
Im currently following this document to the letter. It seems that the VDO part is only added on the LV. Which means creating the PV and VG should be done prior to this.
sudo pvcreate /dev/sdb
sudo vgcreate vg1 /dev/sdb
sudo lvcreate --type vdo --name vdo_repo --size 98GB --virtualsize 150GB vg1
# virtualsize can be higher than actual size, creating an 'overbooking'.
sudo mkfs.ext4 -E nodiscard /dev/vg1/vdo_repo
Mounting
Create a temporary directory and mount the filesystem.
sudo mkdir /mnt/deduptest
sudo mkdir /mnt/iso
sudo mount /dev/vg1/vdo_repo /mnt/deduptest/
# download an iso file and mount it to /mnt/iso
sudo mount -o loop /<path>/CentOS-Stream-8-x86_64-20220525-dvd1.iso /mnt/iso
# create two directories
mkdir -p /mnt/deduptest/{one,two}
Time to put it to the test
Record drive size.
df -h /mnt/deduptest/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg1-vdo_repo 147G 61M 140G 1% /mnt/deduptest
Note that the physical size of the disk is 98GB. 50% overbooking is generous and possibly even dangerous. Depending on your intention for the drive you should decrease that percentage.
Rsync
Rsync the data to both directories ( one , two ).
sudo -i
rsync -aH /mnt/iso/ /mnt/deduptest/one/
vdo-stats --human-readable /dev/mapper/vg1-vpool0-vpool
#Device Size Used Available Use% Space saving%
#/dev/mapper/vg1-vpool0-vpool 98.0G 14.4G 83.6G 14% 26%
# Do it again
rsync -aH /mnt/iso/ /mnt/deduptest/two/
vdostats --human-readable /dev/mapper/vg1-vpool0-vpool
#Device Size Used Available Use% Space saving%
#/dev/mapper/vg1-vpool0-vpool 98.0G 14.4G 83.6G 14% 58%
While the 22GB copy action is stil going, the following commands will show if it is actually doing deduplication:
sudo lvs -o+vdo_compression,vdo_deduplication
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert VDOCompression VDODeduplication
vdo_repo vg1 vwi-aov--- 97.00g vpool0 10.63 enabled enabled
vpool0 vg1 dwi------- 98.00g 14.69 enabled enabled
sudo lvs -o+vdo_compression_state,vdo_index_state
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert VDOCompressionState VDOIndexState
vdo_repo vg1 vwi-aov--- 97.00g vpool0 10.63 online online
vpool0 vg1 dwi------- 98.00g 14.69 online online
Once the two copy actions were complete it definetely showed a space saving:
Run | Size | Used | Available | Use% | Space saving% |
---|---|---|---|---|---|
first copy | 98.0G | 14.4G | 83.6G | 14% | 26% |
second copy | 98.0G | 14.4G | 83.6G | 14% | 58% |
So all in all a great success, cant wait to ask my colleagues to put this to the test.
Enjoy,