Hello, we've got a big problem on our GSX server (Linux Host, SUSE 10.0, Kernel 2.6.13-15 SMP on Athlon64x2 with 4GB RAM and SW-RAID1). When we inserted a CD in the DVD ROM drive, the following messages came into the
/var/log/warn:
Feb 28 15:19:30 baum kernel: rtc: lost some interrupts at 256Hz.
Feb 28 15:19:30 baum kernel: rtc: lost some interrupts at 256Hz.
Feb 28 15:19:44 baum udevd\[2373]: get_netlink_msg: no ACTION in payload found, skip event 'mount'
Feb 28 15:19:47 baum udevd\[2373]: get_netlink_msg: no ACTION in payload found, skip event 'umount'
Feb 28 15:19:47 baum kernel: rtc: lost some interrupts at 256Hz.
Feb 28 15:20:26 baum kernel: rtc: lost some interrupts at 256Hz.
Feb 28 15:20:45 baum kernel: PCI-DMA: Out of IOMMU space for 225280 bytes at device 0000:00:08.0
Feb 28 15:20:45 baum kernel: end_request: I/O error, dev sda, sector 12248446
Feb 28 15:20:45 baum kernel: Operation continuing on 1 devices
Feb 28 15:20:45 baum kernel: PCI-DMA: Out of IOMMU space for 221184 bytes at device 0000:00:08.0
Feb 28 15:20:45 baum kernel: end_request: I/O error, dev sdb, sector 12248454
Feb 28 15:20:45 baum kernel: printk: 118 messages suppressed.
Feb 28 15:20:45 baum kernel: Buffer I/O error on device md0, logical block 1004928
Feb 28 15:20:45 baum kernel: lost page write due to I/O error on md0
Feb 28 15:20:45 baum kernel: Buffer I/O error on device md0, logical block 1004929
Feb 28 15:20:45 baum kernel: lost page write due to I/O error on md0
as you can see, the kernel obviously got "PCI-DMA: Out of IOMMU" and therefore lost some writes. That made the SW-RAID1 break (md0 lost the partition from sda, md1 lost the partition from sdb).
We have 7 clients running within GSX server (6 Linux, just 1 Windows).
Now what did happen, why did it happen, and how can we prevent it from happening again? We are currently waiting for the SW-RAID to rebuild, but I would like to prevent this from happening again.
There have been some kernel updates in the last weeks, I'll apply them first. But I doubt it will be solved by this, at least I was not aware of such an error in the kernel.