вторник, 31 июля 2012 г.

Notes on configuring two-nodes proxmox cluster with drbd-backed storage

We had a task to deploy two new visualization servers with possibility of live migration and high availability data. The second means that in case of physical server failure you don't want faulted VMs to be powered up automagically on another node, but just that you can do it by hand in five minutes.

We decided to use proxmox VE 2, because it's free, we have experience of maintaining proxmox 1.9 systems and because it supports live migration without shared storage.

So, we configured two nodes with 4 additional LVM volume groups each: for VZ data for each node (n1vz with one lvm mounted on first node on /var/lib/vz and n2vz with one volume mounted on /var/lib/vz/ on second, n1kvm and n2kvm as VM disk storage on each node, n1kvm is used by VMs running normally on first node, n2kvm - by VMs running on second node). 4 DRBD volumes with primary-primary configuration was created for each of 4 volume groups. Using separate pair of drbd devices for VM's disks makes split brain recovery easier, as explained here. And note, we can't use drbd-mirrored (quazy-shared) disk for VZ storage, because final step of VZ migration includes "rm -rf" after rsyncing container private area.

In such configuration we can do live migration of KVM VMs and VZ. Also we have copy of each VM and VZ for emergencies (falling of one node).

Some difficulties we met were related to LVM and DRBD startup ordering. First one was the following: LVM locked drbd backing storage and drbd couldn't use them. It was solved with correct filter in lvm.conf. The other one was more difficult. Physical volumes n1vz and n2vz available over DRBD couldn't be mounted normally - they should be mounted after initial system startup. Usually firstly starts lvm (and init script makes vgchange -ay, activating volume groups), then drbd, and now we have additional VG, but they are not active.

To solve this problem we are supposed to use hearthbeat. But I am too lazy to study it. So I adopted things more familiar to me - automounter (autofs) to mount /var/lib/vz and udev to make volume groups available on drbd* device appearance. I've added "/- /etc/auto.direct" line to /etc/auto.master and created /etc/auto.direct file, containing:

/var/lib/vz              -fstype=ext4            :/dev/mapper/n1vz-data
Configuration of udev consisted from creation of /etc/udev/rules.d/80-drbd-lvm.rules file, containing:
ACTION=="add|change", SUBSYSTEM=="block",KERNEL=="drbd*", RUN+="/bin/sh -c /sbin/lvm vgscan; /sbin/lvm vgchange -a y'"

I consider this more elegant then just including "vgchange -a y && mount ..." in rc.local.