Thursday, August 14, 2014

How to replace faulty primary mirrored root disk in Solaris with SVM setup on fly

Here I am explaining about, how to replace faulty primary mirrored root disk on fly in Solaris with SVM setup.

We may face root disk failure in some case whether it may be due to hardware failure or software error. If we face issue with hardware level like disk hard error, then there is no option to rectify the issue until replace the faulty disk.

In my case the running / booting disk got hard error so I went to vendor to confirm the disk failure for replacement.

Now most of the case SUN/ Oracle vendor will not send FE  - field engineer instead of they will send parts to our DC and we have to arrange our self to replace faulty parts that is call (Customer Replaceable Units (CRU) Replacement Policy).

Here my case also same (CRU), so the below procedure I have performed to replace primary mirrored root disk without rebooting the server successfully.

Technical steps:
1- Identify the faulty disk
2- Completely unconfigure and remove the disk from SVM control, using metadetach, metaclear, and metadb
3- Completely unconfigure the faulty device from the o/s using cfgadm
4- Configure the New/ replaced device from the o/s using cfgadm
5- Reconfigure the disk into SVM using prtvtoc, metadb, metainit, and metattach

Check and identify the status of disks:
Note - we have to compare couple of the below output to make sure the disk is faulty.
root@ivmprod /$ echo|format |head
Searching for disks...done

AVAILABLE DISK SELECTIONS:
   0. c1t0d0 (drive type unknown)
      /pci@1c,600000/scsi@2/sd@0,0
   1. c1t1d0 (SEAGATE-ST557703LSUN36G-0307 cyl 24620 alt 2 hd 27 sec 107)
      /pci@1c,600000/scsi@2/sd@1,0
   2. c1t2d0 (SUN36G cyl 24620 alt 2 hd 27 sec 107)
      /pci@1c,600000/scsi@2/sd@2,0
root@ivmprod /$

root@ivmprod /$ iostat -En
c1t0d0          Soft Errors: 881 Hard Errors: 205 Transport Errors: 144
Vendor: SEAGATE  Product: ST557704LSUN36G  Revision: 0307 Serial No: 7378
Size: 36.42GB (36418595328 bytes)
Media Error: 168 Device Not Ready: 0 No Device: 2 Recoverable: 289
Illegal Request: 592 Predictive Failure Analysis: 0
c1t1d0          Soft Errors: 592 Hard Errors: 6 Transport Errors: 0
Vendor: SEAGATE  Product: ST557703LSUN36G  Revision: 0707 Serial No: 7445
Size: 36.42GB (36418595328 bytes)
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 592 Predictive Failure Analysis: 0
root@ivmprod /$
Check currently OS booted with which disk (optional):
root@ivmprod /$ prtconf -pv |grep -i bootpath
        bootpath:  '/pci@1c,600000/scsi@2/disk@0,0:a'
root@ivmprod /$ 
Collect disk controller information:
root@ivmprod /$ cfgadm -al
Ap_Id                     Type         Receptacle   Occupant     Condition
c1                        scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0            disk         connected    configured   unknown
c1::dsk/c1t1d0            disk         connected    configured   unknown
c1::dsk/c1t2d0            disk         connected    configured   unknown
c2                        scsi-bus     connected    unconfigured unknown
c3                        fc-fabric    connected    configured   unknown
c3::50060e8005437940      disk         connected    configured   failing
c3::50060e8005afca40      disk         connected    configured   failing
c4                        fc-fabric    connected    configured   unknown
c4::50060e8005481140      disk         connected    configured   unknown
c4::50060e8005bb9240      disk         connected    configured   unknown
c5                        fc-fabric    connected    configured   unknown
c5::50060e8005437950      disk         connected    configured   failing
c5::50060e8005afca50      disk         connected    configured   failing
c6                        fc-fabric    connected    configured   unknown
c6::50060e8005481150      disk         connected    configured   unknown
c6::50060e8005bb9250      disk         connected    configured   unknown
root@ivmprod /$
Check metadevice status:
root@ivmprod /$ metastat -t |grep Maintenance
   c1t0d0s0    0  No     Maintenance             Mon Jun 23 20:08:18 2014
   c1t0d0s1    0  No     Maintenance             Wed Jun 25 11:00:10 2014
root@ivmprod /$

root@ivmprod /$ metastat | grep State
 State: Needs maintenance
      State: Okay
    State: Needs maintenance
        Device     Start Block  Dbase State        Hot Spare
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
      State: Needs maintenance
      State: Okay
    State: Needs maintenance
        Device     Start Block  Dbase State        Hot Spare
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
      State: Okay
      State: Okay
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
root@ivmprod /$
Check metadb status:
root@ivmprod /$ metadb -i
        flags           first blk       block count
      Wm  p  l          16              1034            /dev/dsk/c1t0d0s7
      W   p  l          1050            1034            /dev/dsk/c1t0d0s7
      W   p  l          2084            1034            /dev/dsk/c1t0d0s7
     a    p  luo        16              1034            /dev/dsk/c1t1d0s7
     a    p  luo        1050            1034            /dev/dsk/c1t1d0s7
     a    p  luo        2084            1034            /dev/dsk/c1t1d0s7
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors
root@ivmprod /$
Collect metadevice information and make sure which are the metadevice we have to detach from mirroring:
root@ivmprod /$ metastat -p
d10 -m d11 d12 1
d11 1 1 c1t0d0s0
d12 1 1 c1t1d0s0
d20 -m d21 d22 1
d21 1 1 c1t0d0s1
d22 1 1 c1t1d0s1
d30 -m d31 d32 1
d31 1 1 c1t0d0s6
d32 1 1 c1t1d0s6
root@ivmprod /$
Detach failed disk from mirror / SVM:
root@ivmprod /$ metadetach d10 d11
metadetach: ivmprod: d10: attempt an operation on a submirror that has erred components
root@ivmprod /$
Above we got error so doing forcefully:
root@ivmprod /$ metadetach -f d10 d11
d10: submirror d11 is detached
root@ivmprod /$ metadetach -f d20 d21
d20: submirror d21 is detached
root@ivmprod /$ metadetach d30 d31
d30: submirror d31 is detached
root@ivmprod /$
Check the status of detached disks:
root@ivmprod /$ metastat -p
d10 -m d12 1
d12 1 1 c1t1d0s0
d20 -m d22 1
d22 1 1 c1t1d0s1
d30 -m d32 1
d32 1 1 c1t1d0s6
d11 1 1 c1t0d0s0
d21 1 1 c1t0d0s1
d31 1 1 c1t0d0s6
root@ivmprod /$
Delete faulted metadb/replica devices:
root@ivmprod /$ metadb -d /dev/dsk/c1t0d0s7
root@ivmprod /$ metadb -i
        flags           first blk       block count
     a    p  luo        16              1034            /dev/dsk/c1t1d0s7
     a    p  luo        1050            1034            /dev/dsk/c1t1d0s7
     a    p  luo        2084            1034            /dev/dsk/c1t1d0s7
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors
root@ivmprod /$
Clear detached metadevices and check the status:
root@ivmprod /$ metaclear d11 d21 d31
d11: Concat/Stripe is cleared
d21: Concat/Stripe is cleared
d31: Concat/Stripe is cleared
root@ivmprod /$
root@ivmprod /$ metastat -p
d10 -m d12 1
d12 1 1 c1t1d0s0
d20 -m d22 1
d22 1 1 c1t1d0s1
d30 -m d32 1
d32 1 1 c1t1d0s6
root@ivmprod /$
Unconfigure faulty disk from OS :
root@ivmprod /$ cfgadm -c unconfigure c1::dsk/c1t0d0
cfgadm: Component system is busy, try again: failed to offline: /devices/pci@1c,600000/scsi@2/sd@0,0
     Resource             Information
------------------  -----------------------
/dev/dsk/c1t0d0s1   dump device (dedicated)
root@ivmprod /$
We got error above as dump device is configured with separate slice which is SWAP FS.
Identify dump device:
root@ivmprod /$ dumpadm
      Dump content: kernel pages
       Dump device: /dev/dsk/c1t0d0s1 (dedicated)
Savecore directory: /var/crash/ivmprod
  Savecore enabled: yes
root@ivmprod /$
Identify SWAP MD device:
root@ivmprod /$ swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d20     85,20     16 16779296 16757472
root@ivmprod /$
Re-configure DUMP device with MD device name:
root@ivmprod /$ dumpadm -d /dev/md/dsk/d20
      Dump content: kernel pages
       Dump device: /dev/md/dsk/d20 (swap)
Savecore directory: /var/crash/ivmprod
  Savecore enabled: yes
root@ivmprod /$
Unconfigure faulty disk from OS again :
root@ivmprod /$ cfgadm -c unconfigure c1::dsk/c1t0d0
root@ivmprod /$ cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
c1                        scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0            unavailable  connected    unconfigured unknown
c1::dsk/c1t1d0            disk         connected    configured   unknown
c1::dsk/c1t2d0            disk         connected    configured   unknown
c2                        scsi-bus     connected    unconfigured unknown
c3                        fc-fabric    connected    configured   unknown
c3::50060e8005437940      disk         connected    configured   failing
c3::50060e8005afca40      disk         connected    configured   failing
c4                        fc-fabric    connected    configured   unknown
c4::50060e8005481140      disk         connected    configured   unknown
c4::50060e8005bb9240      disk         connected    configured   unknown
c5                        fc-fabric    connected    configured   unknown
c5::50060e8005437950      disk         connected    configured   failing
c5::50060e8005afca50      disk         connected    configured   failing
c6                        fc-fabric    connected    configured   unknown
c6::50060e8005481150      disk         connected    configured   unknown
c6::50060e8005bb9250      disk         connected    configured   unknown
root@ivmprod /$
Now the time to inform our DC engineer to pull out the faulty disk and insert new disk. Get confirmation from DC engineer before configure new disk in to OS level. Also we can see the console logs to confirm whether the disk has been replaced successfully or not.
Below is the console logs for my case:
sc) showlogs

Log entries since JUL 05 14:31:48
----------------------------------
AUG 13 18:02:04 ivmprod: 0004004f: "Indicator HDD0.OK2RM is now ON"
AUG 13 18:10:12 ivmprod: 00040071: "DISK @ HDD0 has been removed."
AUG 13 18:10:16 ivmprod: 0004004f: "Indicator HDD0.OK2RM is now OFF"
AUG 13 18:10:38 ivmprod: 00060000: "SC Login: User admin Logged on."
AUG 13 18:10:48 ivmprod: 00040072: "DISK @ HDD0 has been inserted."
sc)
Configure newly added disk from OS:
root@ivmprod /$ cfgadm -c configure c1::dsk/c1t0d0
root@ivmprod /$ cfgadm -al
Ap_Id                     Type         Receptacle   Occupant     Condition
c1                        scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0            disk         connected    configured   unknown
c1::dsk/c1t1d0            disk         connected    configured   unknown
c1::dsk/c1t2d0            disk         connected    configured   unknown
c2                        scsi-bus     connected    unconfigured unknown
c3                        fc-fabric    connected    configured   unknown
c3::50060e8005437940      disk         connected    configured   failing
c3::50060e8005afca40      disk         connected    configured   failing
c4                        fc-fabric    connected    configured   unknown
c4::50060e8005481140      disk         connected    configured   unknown
c4::50060e8005bb9240      disk         connected    configured   unknown
c5                        fc-fabric    connected    configured   unknown
c5::50060e8005437950      disk         connected    configured   failing
c5::50060e8005afca50      disk         connected    configured   failing
c6                        fc-fabric    connected    configured   unknown
c6::50060e8005481150      disk         connected    configured   unknown
c6::50060e8005bb9250      disk         connected    configured   unknown
root@ivmprod /$ devfsadm -v
root@ivmprod /$ echo|format |head
Searching for disks...done

AVAILABLE DISK SELECTIONS:
   0. c1t0d0 (SUN36G cyl 24620 alt 2 hd 27 sec 107)
      /pci@1c,600000/scsi@2/sd@0,0
   1. c1t1d0 (SEAGATE-ST557703LSUN36G-0307 cyl 24620 alt 2 hd 27 sec 107)
      /pci@1c,600000/scsi@2/sd@1,0
   2. c1t2d0 (SUN36G cyl 24620 alt 2 hd 27 sec 107)
      /pci@1c,600000/scsi@2/sd@2,0
root@ivmprod /$
Copy VTOC from secondary disk to primary disk which we replaced:
root@ivmprod /$prtvtoc /dev/rdsk/c1t1d0s2 |fmthard -s- /dev/rdsk/c1t0d0s2
fmthard:  New volume table of contents now in place.
root@ivmprod /$
Check and compare both VTOC:
root@ivmprod /$ prtvtoc /dev/rdsk/c1t1d0s2
* /dev/rdsk/c1t1d0s2 partition map
*
* Dimensions:
*     512 bytes/sector
*     107 sectors/track
*      27 tracks/cylinder
*    2889 sectors/cylinder
*   24622 cylinders
*   24620 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      2    00          0  50332158  50332157
       1      3    01   50332158  16779312  67111469
       2      5    00          0  71127180  71127179
       6      0    00   67111470   3489912  70601381
       7      0    00   70601382    525798  71127179
root@ivmprod /$ prtvtoc /dev/rdsk/c1t0d0s2
* /dev/rdsk/c1t0d0s2 partition map
*
* Dimensions:
*     512 bytes/sector
*     107 sectors/track
*      27 tracks/cylinder
*    2889 sectors/cylinder
*   24622 cylinders
*   24620 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      2    00          0  50332158  50332157
       1      3    01   50332158  16779312  67111469
       2      5    00          0  71127180  71127179
       6      0    00   67111470   3489912  70601381
       7      0    00   70601382    525798  71127179
root@ivmprod /$
Create metadb with three replica on slice 7(s7) to new disk:
root@ivmprod /$ metadb -afc 3 /dev/rdsk/c1t0d0s7
root@ivmprod /$ metadb -i
        flags           first blk       block count
     a        u         16              1034            /dev/dsk/c1t0d0s7
     a        u         1050            1034            /dev/dsk/c1t0d0s7
     a        u         2084            1034            /dev/dsk/c1t0d0s7
     a    p  luo        16              1034            /dev/dsk/c1t1d0s7
     a    p  luo        1050            1034            /dev/dsk/c1t1d0s7
     a    p  luo        2084            1034            /dev/dsk/c1t1d0s7
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors
root@ivmprod /$
Create medadevice using metainit command:
root@ivmprod /$ metainit d11 1 1 c1t0d0s0
d11: Concat/Stripe is setup
root@ivmprod /$ metainit d21 1 1 c1t0d0s1
d21: Concat/Stripe is setup
root@ivmprod /$ metainit d31 1 1 c1t0d0s6
d31: Concat/Stripe is setup
root@ivmprod /$
Check the created medadevice details:
root@ivmprod /$ metastat -p
d10 -m d12 1
d12 1 1 c1t1d0s0
d20 -m d22 1
d22 1 1 c1t1d0s1
d30 -m d32 1
d32 1 1 c1t1d0s6
d11 1 1 c1t0d0s0
d21 1 1 c1t0d0s1
d31 1 1 c1t0d0s6
root@ivmprod /$
Attach newly created metadevice to main metadevice:
root@ivmprod /$ metattach d10 d11
d10: submirror d11 is attached
root@ivmprod /$ metattach d20 d21
d20: submirror d21 is attached
root@ivmprod /$ metattach d30 d31
d30: submirror d31 is attached
root@ivmprod /$
Verify and make sure attached metadevice are correct:
root@ivmprod /$  metastat -p
d10 -m d11 d12 1
d11 1 1 c1t0d0s0
d12 1 1 c1t1d0s0
d20 -m d21 d22 1
d21 1 1 c1t0d0s1
d22 1 1 c1t1d0s1
d30 -m d31 d32 1
d31 1 1 c1t0d0s6
d32 1 1 c1t1d0s6
root@ivmprod /$
Check the syncing status:
root@ivmprod /$ metastat |grep %
    Resync in progress: 1 % done
    Resync in progress: 1 % done
    Resync in progress: 11 % done
root@ivmprod /$
After syncing completed, just make sure all MD device are okay, checking the below detail:
root@ivmprod /$ metastat -t |grep Maintenance
root@ivmprod /$

root@ivmprod /$ metastat | grep State
      State: Okay
      State: Okay
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
      State: Okay
      State: Okay
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
      State: Okay
      State: Okay
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
    State: Okay
        Device     Start Block  Dbase State        Hot Spare
root@ivmprod /$
Cool...!

1 comment:

  1. Great Help.. Thanks for detailed steps and your effort to make it online!

    ReplyDelete