CentOS
Debugging a (particular) failing boot service on Linux
Submitted by ckdake on Thu, 2010-01-21 19:51At work I recently rolled out a newer version of the Dell OpenManage tools which included for the first time a build of Openwsman. We didn't specifically need this functionality, but it's good to stay current with the OpenManage tools. To load in the (unrelated) new kernel on a test machine, I rebooted the machine using Cobbler's power management functionality on our administrative system, but after 5 minutes the machine was still not responding to pings so something was broken. I used remote desktop to hop on our one Windows server in the datacenter which we use to get at the interactive consoles of our servers (Thankfully the new DRAC6 card's have console applets that work on Macs!), and pulled up the console for this machine.
The boot process was hung on "Starting openwsman" and didn't seem to be doing anything. Doh!
I restarted the machine again, and at the grub boot menu added a "S" to the boot string to start up the system in single user mode, and booted things up. "chkconfig openwsman off" to disable the service, and another reboot to get the machine back up and running to let me troubleshoot a little better. I took a look in /etc/init.d/openwsman to see what might be hanging, and nothing immediately looked suspicious. It was a pretty standard init script, with the extra feature of generating OpenSSL certificates if they didn't exist already:
if [ ! -f "/etc/openwsman/serverkey.pem" ]; then
if [ -f "/etc/ssl/servercerts/servercert.pem" \
-a -f "/etc/ssl/servercerts/serverkey.pem" ]; then
echo "Using common server certificate /etc/ssl/servercerts/servercert.pem"
ln -s /etc/ssl/servercerts/server{cert,key}.pem /etc/openwsman/
else
echo "Generating Openwsman server public certificate and private key"
FQDN=`hostname --fqdn`
if [ "x${FQDN}" = "x" ]; then
FQDN=localhost.localdomain
fi
cat << EOF | sh /etc/openwsman/owsmangencert.sh > /dev/null 2>&1
--
SomeState
SomeCity
SomeOrganization
SomeOrganizationalUnit
${FQDN}
root@${FQDN}
EOF
fi
fi
>It's a little strange, but not unheard of practice to do this, and shouldn't cause any problems. (Puppet and Func, two other systems tools we use, generate their certs in the application which is a lot more common.)
I extracted the only possible culprit from the owsmangencert.sh script and tried running the openssl command manually:
openssl req -days 365 $@ -config /etc/openwsman/ssleay.cnf \ -new -x509 -nodes -out cert.out \ -keyout key.out
and it seemed that this was indeed the problem. It just sat there and didn't complete with the speediness I expect from OpenSSL. Time for strace!
cat << EOF | strace openssl req -days 365 -config ./ssleay.cnf.2 -new -x509 -nodes -out cert.out -keyout key.out > -- > SomeState > SomeCity > SomeOrganization > SomeOrganizationalUnit > test > root@test > EOF >
This ended up doing a long read with output like:
open("/dev/random", O_RDONLY) = 3
read(3, "\323K\372u_ya'\27\266\320\25\22\373\240\330~'\224\310\243\356\225\350.\245\362\3058\230Zb"..., 1024) = 128
read(3, "K\7:\273Zdr\274\25\227\263\366\260U\337Owp\6y\2333c\361\322\334\217\370.k\375]"..., 896) = 128
read(3, "dH\375V\327\230Bi\221\342\326\26R\301v^Qv5f\347\303g7\2747\345\360\207A!\227"..., 768) = 128
read(3, "X&\254r\331\353<:\36!\333\340\353", 640) = 13
read(3, "\357F\27\347\372atf", 627) = 8
read(3, "\231\347\232\362\345\215n\227", 619) = 8
read(3, "\324\304\323\30\325\10G\332", 611) = 8
Looks like /dev/random wasn't returning random data nearly fast enough, which makes a whole lot of sense! /dev/random is "good" random data because it is based on environmental entropy and the entropy data is only used once, but on a modern multi-core systems doing lots of things, there usually isn't much entropy available. That means that while this command would eventually finish, it could take a very long time.
The fix: using /dev/urandom instead. It is "not quite as good" random data because the output may have less entropy than /dev/random, and it uses internal entropy bits multiple times to generate it's output, but it's "good enough" for generating cryptographic keys. And, it is non blocking which means that a caller will never have to wait inane amounts of time for enough "random" data. (See http://en.wikipedia.org/wiki//dev/random for a longer explanation.
I replaced the two occurrences of /dev/random, one in /etc/openwsman/ssleay.cnf and one in /etc/openwsman/owsmangencert.sh, and initial startup of openwsman (including key generation) became pretty instantaneous. "chkconfig --levels 2345 openwsmand on" to turn it back on, and a reboot (after removing the generated keys and certs) to confirm, and the machine booted up as expected. To make this work everywhere, I customized those two config files and added them to our Puppet system so that all Dell servers would get Openwsman set up properly when the update is run globally:
file {
"ssleay.cnf":
path => "/etc/openwsman/ssleay.cnf",
source => "puppet://$server/dell/ssleay.cnf",
}
file {
"owsmangencert.sh":
path => "/etc/openwsman/owsmangencert.sh",
source => "puppet://$server/dell/owsmangencert.sh",
}
Problem solved and all machines will automatically get the correct fix, so the next time a machine won't finish starting up, it will be a new and different problem to debug.
VLANs in OpenVZ
Submitted by ckdake on Tue, 2008-10-28 11:38OpenVZ seems to be the hot open source container based virtualization tool these days. Instead of tools like VMWare and Xen which virtualize the hardware and allow each guest operating system to run their own kernel, OpenVZ uses operating system level virtualization. While less flexible and less "secure" in some instances, this allows for better performance of the guests due to lower overhead.
I've been tinkering with using OpenVZ for a project to provide rapidly deployable emergency copies of infrastructure for situations where the primary and secondary hardware go down (DNS servers, LDAP servers, etc). OpenVZ meets the need here because it has command line management tools, is low overhead, and these kinds of services don't depend on a specific kernel or hardware stack as much as some others might. The tricky part for me is that some of these services live on separate VLANs.
In this setup, each machine (including the OpenVZ host) has two Gigabit Ethernet interfaces bonded together to two ports on separate switches that are stacked together. This provides higher throughput and prevents interruption of service if a switch, cable, or interface fails. The hosts typically don't know about VLANs and the interfaces on the switches are in access mode which automatically tags all traffic to the proper vlan. However, the OpenVZ host will need access to multiple VLANs so that it's guest machines can get to the right places on the network, so some things need to change. It will need it's own VLAN as well as the VLAN for each guest machine.
Firstly, the switchports are configured to trunk the right VLANs to both of the ports that the OpenVZ host is plugged into. Note that if you do this, you'll loose access to the machine so make sure you're connected out-of-band to the console! On the switch in a config shell (Cisco IOS example):
# interface Gi1/0/1 # switchport trunk encapsulation dot1q # switchport trunk allowed vlan 10,20-30 # switchport mode trunk # interface Gi2/0/1 # switchport trunk encapsulation dot1q # switchport trunk allowed vlan 10,20-30 # switchport mode trunk
Then the OpenVZ machine is configured to support VLANs by adding kernel modules and creating a new interface. Note that these instructions are for RHEL5/CentOS5:
- Add "modprobe 8021q" and "modprobe vzethdev" to /etc/rc.modules
- chmod +x /etc/rc.modules
- Manually run /etc/rc.modules. It will be automatically run when the system boots
- reconfigure the /etc/sysconfig/network-scripts/ifcfg-bond0 to have no IP or BOOTPROTO information, "ONBOOT=yes" and "MODE=trunk"
- create /etc/sysconfig/network-scripts/ifcfg-bond0.10 like the following:
DEVICE=bond0.10 IPADDR=10.0.10.2 NETMASK=255.255.255.0 GATEWAY=10.0.10.1 NETWORK=10.0.10.0 BROADCAST=10.0.10.255 ONBOOT=yes BOOTPROTO=none USERCTL=no VLAN=yes PHYSDEV=bond0
- and load the interface with "ifcfg bond0.10 && ifup bond0.10"
- make sure that proxy_arp and forwarding are enabled for bond0.10 in /proc/sys/net/ipv4/conf/bond0.10/. If not, you should reconfigure your system to set these by default. Consult your operating system documentation for instructions on this.
Once this is done, you should be able to use this host on the network (on VLAN 10) like nothing changed! If not, make sure routes are set up right, ifconfig looks right, etc. Assuming it works, you're halfway there! Up next is creating an interface for each vlan you want mapped. Here's an example for /etc/sysconfig/network-scripts/ifcfg-bond0.20 on VLAN 20:
DEVICE=bond0.20 ONBOOT=yes BOOTPROTO=none USERCTL=no VLAN=yes PHYSDEV=bond0
Note that it doesn't have any IP information. We'll specify this inside of the OpenVZ instance. Next, we actually create a blank OpenVZ instance (you can use an existing one, but this is provided for completeness sake) and give it an eth0 interface. I'm using 20 as the ID here because this instance will be on VLAN 20, but this is not a requirement.
vzctl create 20 --ostemplate centos-5-x86_64-default-5.2-20081013 --config vps.basic vzctl set 20 --onboot no --save vzctl set 20 --hostname vlan20host.local --save vzctl set 20 --numothersock 120 --save vzctl set 20 --nameserver 10.0.10.1 --save vzctl start 20 vzctl set 20 --netif_add eth0 --save
On each host, use "eth0" as the name of the interface. OpenVZ will automatically create the eth0 interface in the guest and an interface like "veth20.0" on the host where "20" in the name represents the guest ID and the .0 indicates that this is the default interface for the guest. You could add an eth0.21 interface to the guest with vzctl if you wanted VLAN 21 also piped into the guest, which would create a eth0.21 on the guest and veth20.21 on the host.
Now that it has the interface, enter the instance with "vzctl enter 20" and set up it's networking by creating /etc/sysconfig/network-scripts/ifcfg-eth0:
DEVICE=eth0 IPADDR=10.0.20.1 NETMASK=255.255.255.0 GATEWAY=10.0.20.1 NETWORK=10.0.20.0 BROADCAST=10.0.20.255 ONBOOT=yes BOOTPROTO=none USERCTL=no
Then "ifcfg eth0 && ifup eth0". You won't be able to send traffic yet, but configuration inside of the guest here is done. Head back to the OpenVZ host and set up a bridge to connect things. Here, we name the bridge so that it's recognizable as this VLAN and add the VLAN interface and OpenVZ host interface to it. Making one bridge per VLAN is probably the right thing to do, and if multiple guests are on the same VLAN, just add their host interfaces to the same bridge.
brctl addbr vzbr20 brctl addif vzbr20 bond0.20 brctl addif vzbr20 veth20.0 ifup vzbr20 0
Again, make sure that forwarding and arp_proxy are enabled for the bridge and the veth20.0 interface (created by adding eth0 to the guest). And that's it! You should be able to ping the guest's gateway from the guest. If you can't, run a ping from the guest and run tcpdump from the host to see where packets stop, looking at interfaces in this order:
veth20.0
vzbr20
bond0.20
bond0
Whichever one it stops on, make sure you have forwarding and proxy_arp enabled, and make sure that the bridge has the two things it should as members with "brctl show"
Each time the vm is rebooted, you'll need to add it's host interface to the bridge again and may need to re-enable forwarding and proxy_arp depending on how your OS is configured. If you are using a newer version of OpenVZ (>3.0.22) there is a easy workaround for this on the Veth wiki page. There is a more complex workaround for older OpenVZ versions there as well, but I should be using the newest version of OpenVZ when it's time to deploy this thing so I'm waiting!
Hopefully you find this useful, please let me know with a comment on the original article at ckdake.com or via an email if you have any suggestions, comments, or corrections!
Helpful References:
http://wiki.openvz.org/Veth
http://wiki.openvz.org/Venet
http://www.howtoforge.com/installing-and-using-openvz-on-centos5.2-p2

