We are in the process of migrating this forum. A new space will be available soon. We are sorry for the inconvenience.

Help, my server almost never comes back up...


Norm
11-24-2011, 11:27 PM
Quote Originally Posted by Eric
This is likely caused by the annoying OVH kernels.
Hi Eric,

I have not tried all of the OVH Linux kernels... but I can absolutely say that the 2.6.38.2-2 grsecurity kernel is inadequate. Whoever configured/compiled it appears to be inexperienced with PaX/SeLinux extentions.

-Norm

Eric
11-20-2011, 12:44 AM
This is likely caused by the annoying OVH kernels.
I find the best way is to install the "VMWare server" OS when reinstalling the server.
Then just uninstall vmware and your left with a standard centos install without all the ovh changes.

Norm
11-12-2011, 06:30 AM
Alright...

I dedicated a few hours into fixing this issue. Just in case anyone else in the future encounters a similar problem... I will document what I have done.

1.) I used the Netboot Expert mode to boot into Rescue-Pro.

2.) This is actually a very vulnerable time for your server... the root password was just e-mailed in clear text across two continents, an ocean and a dozen routers. I waited for the e-mail to arrive and rapidly logged into the rescue shell. I immediately changed the root password. I then began scanning for bad blocks. Using fsck,badblocks and smartctl I was able to see that there were some minor drive problems. This step took several hours to scan terra-byte drives. Because the bad sectors have now been marked as 'Do Not Use' I might be able to salvage the operating system.

3.) I rebooted from rescue mode with my fingers crossed. Unfortunately the server never came back up... Although when I was in rescue mode I ran fsck... I guess it was unable to recover some unreadable sectors.

3.) I went back into the OVH Manager and over to NetBoot and modified my server so that it would boot using the kernel over the network. I then forced a hardware reboot using the service OVH provides. (I am guessing that they are using remote-IP power supply or something)

4.) Now I am back inside the servers Linux operating system. However I am using the network kernel. I obviously do not want to boot from a remote unknown kernel. I ended up rebuilding the /boot partition and downloading a kernel from ftp://ftp.ovh.net/made-in-ovh/bzImage. Using grub and grub-install I was able to rebuild the MBR.

5.) I logged back into the OVH Manager and back to NetBoot. I changed the server settings so that it would go back to booting from the hard drive. I crossed my fingers...

6.) I grabbed an ice cold beer from the fridge... because if this doesn't work screw it... I'll get inebriated... or translated into Irish 'fluthered'.

7.) I rebooted and everything seems to be back to normal. I am able to boot from the hard drive again and so far everything is looking good. I'll add some scripts to monitor the drive for bad blocks and log the s.m.a.r.t. statistics.

I'm going to get fluthered anyway. But I guess you already knew that...

-Norm

Norm
11-12-2011, 12:13 AM
Hi,

This is completely unacceptable... over 24 hours has passed and my server is still at runlevel-1 requiring human intervention. I have taken some pro-active steps and booted into rescue mode.

root@rescue:~# fsck -fc /dev/sda1
fsck 1.41.3 (12-Oct-2008)
fsck.ext3: /lib/libblkid.so.1: no version information available (required by fsck.ext3)
fsck.ext3: /lib/libuuid.so.1: no version information available (required by fsck.ext3)
e2fsck 1.41.9 (22-Aug-2009)
/: recovering journal
Checking for bad blocks (read-only test): done
/: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/: ***** FILE SYSTEM WAS MODIFIED *****
/: 54430/642112 files (0.8% non-contiguous), 347729/2559744 blocks
The /boot partition apparently had some errors. It is fairly obvious that the hard drive is failing.

I have only been a customer here less than a month and I have been spending the entire time basically re-installing the OS over and over again on failing hardware.

jadaka
11-11-2011, 07:21 PM
Really odd which server do you have? You might want to boot into rescue mode and run complete diagnostic.

Norm
11-11-2011, 04:59 AM
Hi,

I have been browsing the Interventions summary and I see that the server went through a "Main HDD replacement" on 2011-11-09 04:43:37 which was just a few days ago. Maybe it is actually other hardware that is failing? Could you guys check into this for me?

[Update]
I also see that my server has an open Incident ticket... number 862169. Looks like it was opened on 2011-11-09 07:07:03 but I am unable to see who/why it was opened. I did not create it... maybe it was something automated or something.

Anyway, thanks for looking into this. I'll check back tomorrow.

Norm
11-11-2011, 04:26 AM
Hi,

I feel as if I made a huge mistake by choosing to host my websites here at OVH. It has been a complete nightmare so far...

It takes me several hours to harden/lock-down the Linux operating system. I custom compile and tune my lighttpd and other software. I have been making absolutely zero changes that should effect the booting process...

However... over the past few weeks the server often does not come back up after rebooting. I have been dealing with this by re-installing the operating system and spending several more hours re-installing/re-hardening/re-customizing.

But today is the last straw... I spent 4-5 hours last night configuring the server as I want it... I made sure that it would successfully reboot... and when I had everything perfect... I left my office.

Today... I return and make some minor modifications to the SSHD daemon and reboot... server does not come back up and appears to be in Runlevel-1. It responds to ICMP which tells me that it has booted. However... NONE of the services are running... bind,lighttpd,sshd. This essentially implies that the server is at Runlevel-1 requiring human intervention.

I am not going to reinstall the OS again... I need an explanation for why this is occuring. My OVH id is bn33244.

Thanks