home rss search January 01, 2017

MegaCLI Scripts and Commands


making LSI raid controllers a little easier to work with

MegaCLI is the command line interface (CLI) binary used to communicate with the full LSI family of raid controllers found in Supermicro, DELL (PERC), ESXi and Intel servers. The program is a text based command line interface (CLI) and is comprised of a single static binary file. We are not a fan of graphical interfaces (GUI) and appreciate the control a command line program gives over a GUI solution. Using some simple shell scripting we can find out the health of the RAID, email ourselves about problems and work with failed drives.

There are many MegaCLI command pages which simply rehash the same commands over and over and we wanted to offer something more. For our examples we are using Ubuntu Linux and FreeBSD with the MegaCli64 binary. All of these same scripts and commands work for the 32bit and 64bit binaries.

Installing the MegaCLI binary

In order to communicate with the LSI card you will need the MegaCLI or MegaCLI64 (64bit) program. The install should be quite easy, but LSI make us jump through a few hoops. This is what we found:

On our Ubuntu Linux 64bit and FreeBSD 64bit servers we simply copied MegaCli64 (64bit) to /usr/local/sbin/ . You can put the binary anywhere you want, but we choose /usr/local/sbin/ because it is in root's path. Make sure to secure the binary. Make the owner root and chmod the binary to 700 (chown root /usr/local/sbin/MegaCli64; chmod 700 /usr/local/sbin/MegaCli64). The install is now done. We would like to see LSI make a Ubuntu PPA or FreeBSD ports entry sometime in the future, but this setup was not too bad.

The lsi.sh MegaCLI interface script

Once you have MegaCLI installed, the following is a script to help in getting information from the raid card. The shell script does nothing more then execute the commands you normally use on the CLI. The script can show the status of the raid and drives. You can identify any drive slot by using the blinking light on the chassis. The script can help you identify drives which are starting to error out or slow down the raid so you can replace drives early. We have also included a "setdefaults" method to setup a new raid card to specs we use for our 400+ raids. Finally, use the "checkNemail" method to check the raid status and mail you with a list of drives and which one is reporting the problem.

You are welcome to copy and paste the following script. We call the script "lsi.sh", but you can use any name you wish. just make sure to set the full path to the MegaCli binary in the script and make the script executable. We tried to comment every method so take a look at the script before using it.

#!/bin/bash
#
# Calomel.org 
#     https://calomel.org/megacli_lsi_commands.html
#     LSI MegaRaid CLI 
#     lsi.sh @ Version 0.05
#
# description: MegaCLI script to configure and monitor LSI raid cards.

# Full path to the MegaRaid CLI binary
MegaCli="/usr/local/sbin/MegaCli64"

# The identifying number of the enclosure. Default for our systems is "8". Use
# "MegaCli64 -PDlist -a0 | grep "Enclosure Device"" to see what your number
# is and set this variable.
ENCLOSURE="8"

if [ $# -eq 0 ]
   then
    echo ""
    echo "            OBPG  .:.  lsi.sh $arg1 $arg2"
    echo "-----------------------------------------------------"
    echo "status        = Status of Virtual drives (volumes)"
    echo "drives        = Status of hard drives"
    echo "ident \$slot   = Blink light on drive (need slot number)"
    echo "good \$slot    = Simply makes the slot \"Unconfigured(good)\" (need slot number)"
    echo "replace \$slot = Replace \"Unconfigured(bad)\" drive (need slot number)"
    echo "progress      = Status of drive rebuild"
    echo "errors        = Show drive errors which are non-zero"
    echo "bat           = Battery health and capacity"
    echo "batrelearn    = Force BBU re-learn cycle"
    echo "logs          = Print card logs"
    echo "checkNemail   = Check volume(s) and send email on raid errors"
    echo "allinfo       = Print out all settings and information about the card"
    echo "settime       = Set the raid card's time to the current system time"
    echo "setdefaults   = Set preferred default settings for new raid setup"
    echo ""
   exit
 fi

# General status of all RAID virtual disks or volumes and if PATROL disk check
# is running.
if [ $1 = "status" ]
   then
      $MegaCli -LDInfo -Lall -aALL -NoLog
      echo "###############################################"
      $MegaCli -AdpPR -Info -aALL -NoLog
      echo "###############################################"
      $MegaCli -LDCC -ShowProg -LALL -aALL -NoLog
   exit
fi

# Shows the state of all drives and if they are online, unconfigured or missing.
if [ $1 = "drives" ]
   then
      $MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state://g'
   exit
fi

# Use to blink the light on the slot in question. Hit enter again to turn the blinking light off.
if [ $1 = "ident" ]
   then
      $MegaCli  -PdLocate -start -physdrv[$ENCLOSURE:$2] -a0 -NoLog
      logger "`hostname` - identifying enclosure $ENCLOSURE, drive $2 "
      read -p "Press [Enter] key to turn off light..."
      $MegaCli  -PdLocate -stop -physdrv[$ENCLOSURE:$2] -a0 -NoLog
   exit
fi

# When a new drive is inserted it might have old RAID headers on it. This
# method simply removes old RAID configs from the drive in the slot and make
# the drive "good." Basically, Unconfigured(bad) to Unconfigured(good). We use
# this method on our FreeBSD ZFS machines before the drive is added back into
# the zfs pool.
if [ $1 = "good" ]
   then
      # set Unconfigured(bad) to Unconfigured(good)
      $MegaCli -PDMakeGood -PhysDrv[$ENCLOSURE:$2] -a0 -NoLog
      # clear 'Foreign' flag or invalid raid header on replacement drive
      $MegaCli -CfgForeign -Clear -aALL -NoLog
   exit
fi

# Use to diagnose bad drives. When no errors are shown only the slot numbers
# will print out. If a drive(s) has an error you will see the number of errors
# under the slot number. At this point you can decided to replace the flaky
# drive. Bad drives might not fail right away and will slow down your raid with
# read/write retries or corrupt data. 
if [ $1 = "errors" ]
   then
      echo "Slot Number: 0"; $MegaCli -PDlist -aALL -NoLog | egrep -i 'error|fail|slot' | egrep -v ' 0'
   exit
fi

# status of the battery and the amount of charge. Without a working Battery
# Backup Unit (BBU) most of the LSI read/write caching will be disabled
# automatically. You want caching for speed so make sure the battery is ok.
if [ $1 = "bat" ]
   then
      $MegaCli -AdpBbuCmd -aAll -NoLog
   exit
fi

# Force a Battery Backup Unit (BBU) re-learn cycle. This will discharge the
# lithium BBU unit and recharge it. This check might take a few hours and you
# will want to always run this in off hours. LSI suggests a battery relearn
# monthly or so. We actually run it every three(3) months by way of a cron job.
# Understand if your "Current Cache Policy" is set to "No Write Cache if Bad
# BBU" then write-cache will be disabled during this check. This means writes
# to the raid will be VERY slow at about 1/10th normal speed. NOTE: if the
# battery is new (new bats should charge for a few hours before they register)
# or if the BBU comes up and says it has no charge try powering off the machine
# and restart it. This will force the LSI card to re-evaluate the BBU. Silly
# but it works.
if [ $1 = "batrelearn" ]
   then
      $MegaCli -AdpBbuCmd -BbuLearn -aALL -NoLog
   exit
fi

# Use to replace a drive. You need the slot number and may want to use the
# "drives" method to show which drive in a slot is "Unconfigured(bad)". Once
# the new drive is in the slot and spun up this method will bring the drive
# online, clear any foreign raid headers from the replacement drive and set the
# drive as a hot spare. We will also tell the card to start rebuilding if it
# does not start automatically. The raid should start rebuilding right away
# either way. NOTE: if you pass a slot number which is already part of the raid
# by mistake the LSI raid card is smart enough to just error out and _NOT_
# destroy the raid drive, thankfully.
if [ $1 = "replace" ]
   then
      logger "`hostname` - REPLACE enclosure $ENCLOSURE, drive $2 "
      # set Unconfigured(bad) to Unconfigured(good)
      $MegaCli -PDMakeGood -PhysDrv[$ENCLOSURE:$2] -a0 -NoLog
      # clear 'Foreign' flag or invalid raid header on replacement drive
      $MegaCli -CfgForeign -Clear -aALL -NoLog
      # set drive as hot spare
      $MegaCli -PDHSP -Set -PhysDrv [$ENCLOSURE:$2] -a0 -NoLog
      # show rebuild progress on replacement drive just to make sure it starts
      $MegaCli -PDRbld -ShowProg -PhysDrv [$ENCLOSURE:$2] -a0 -NoLog
   exit
fi

# Print all the logs from the LSI raid card. You can grep on the output.
if [ $1 = "logs" ]
   then
      $MegaCli -FwTermLog -Dsply -aALL -NoLog
   exit
fi

# Use to query the RAID card and find the drive which is rebuilding. The script
# will then query the rebuilding drive to see what percentage it is rebuilt and
# how much time it has taken so far. You can then guess-ti-mate the
# completion time.
if [ $1 = "progress" ]
   then
      DRIVE=`$MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state://g' | egrep build | awk '{print $3}'`
      $MegaCli -PDRbld -ShowProg -PhysDrv [$ENCLOSURE:$DRIVE] -a0 -NoLog
   exit
fi

# Use to check the status of the raid. If the raid is degraded or faulty the
# script will send email to the address in the $EMAIL variable. We normally add
# this method to a cron job to be run every few hours so we are notified of any
# issues.
if [ $1 = "checkNemail" ]
   then
      EMAIL="raidadmin@localhost"

      # Check if raid is in good condition
      STATUS=`$MegaCli -LDInfo -Lall -aALL -NoLog | egrep -i 'fail|degrad|error'`

      # On bad raid status send email with basic drive information
      if [ "$STATUS" ]; then
         $MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state://g' | mail -s `hostname`' - RAID Notification' $EMAIL
      fi
fi

# Use to print all information about the LSI raid card. Check default options,
# firmware version (FW Package Build), battery back-up unit presence, installed
# cache memory and the capabilities of the adapter. Pipe to grep to find the
# term you need.
if [ $1 = "allinfo" ]
   then
      $MegaCli -AdpAllInfo -aAll -NoLog
   exit
fi

# Update the LSI card's time with the current operating system time. You may
# want to setup a cron job to call this method once a day or whenever you
# think the raid card's time might drift too much. 
if [ $1 = "settime" ]
   then
      $MegaCli -AdpGetTime -aALL -NoLog
      $MegaCli -AdpSetTime `date +%Y%m%d` `date +%H:%M:%S` -aALL -NoLog
      $MegaCli -AdpGetTime -aALL -NoLog
   exit
fi

# These are the defaults we like to use on the hundreds of raids we manage. You
# will want to go through each option here and make sure you want to use them
# too. These options are for speed optimization, build rate tweaks and PATROL
# options. When setting up a new machine we simply execute the "setdefaults"
# method and the raid is configured. You can use this on live raids too.
if [ $1 = "setdefaults" ]
   then
      # Read Cache enabled specifies that all reads are buffered in cache memory. 
       $MegaCli -LDSetProp -Cached -LAll -aAll -NoLog
      # Adaptive Read-Ahead if the controller receives several requests to sequential sectors
       $MegaCli -LDSetProp ADRA -LALL -aALL -NoLog
      # Hard Disk cache policy enabled allowing the drive to use internal caching too
       $MegaCli -LDSetProp EnDskCache -LAll -aAll -NoLog
      # Write-Back cache enabled
       $MegaCli -LDSetProp WB -LALL -aALL -NoLog
      # Continue booting with data stuck in cache. Set Boot with Pinned Cache Enabled.
       $MegaCli -AdpSetProp -BootWithPinnedCache -1 -aALL -NoLog
      # PATROL run every 672 hours or monthly (RAID6 77TB @60% rebuild takes 21 hours)
       $MegaCli -AdpPR -SetDelay 672 -aALL -NoLog
      # Check Consistency every 672 hours or monthly
       $MegaCli -AdpCcSched -SetDelay 672 -aALL -NoLog
      # Enable autobuild when a new Unconfigured(good) drive is inserted or set to hot spare
       $MegaCli -AdpAutoRbld -Enbl -a0 -NoLog
      # RAID rebuild rate to 60% (build quick before another failure)
       $MegaCli -AdpSetProp \{RebuildRate -60\} -aALL -NoLog
      # RAID check consistency rate to 60% (fast parity checks)
       $MegaCli -AdpSetProp \{CCRate -60\} -aALL -NoLog
      # Enable Native Command Queue (NCQ) on all drives
       $MegaCli -AdpSetProp NCQEnbl -aAll -NoLog
      # Sound alarm disabled (server room is too loud anyways)
       $MegaCli -AdpSetProp AlarmDsbl -aALL -NoLog
      # Use write-back cache mode even if BBU is bad. Make sure your machine is on UPS too.
       $MegaCli -LDSetProp CachedBadBBU -LAll -aAll -NoLog
      # Disable auto learn BBU check which can severely affect raid speeds
       OUTBBU=$(mktemp /tmp/output.XXXXXXXXXX)
       echo "autoLearnMode=1" > $OUTBBU
       $MegaCli -AdpBbuCmd -SetBbuProperties -f $OUTBBU -a0 -NoLog
       rm -rf $OUTBBU
   exit
fi

### EOF ###

How do I use the lsi.sh script ?

First, execute the script without any arguments. The script will print out the "help" statement showing all of the available commands and a very short description of the function. Inside the script you can also see we also put in detailed comments.

For example, lets look at the status of the RAID volumes or what LSI calls virtual drives. Run the script with the "status" argument. This will simply print the details of the raid drives and if PATROL or Check Consistency is running. In our example we have two(2) RAID6 volumes of 18.1TB each. The first array is "Partially Degraded" and the second is "Optimal" which means it is healthy.

calomel@lsi:~# ./lsi.sh status

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 18.188 TB
Sector Size         : 512
Parity Size         : 3.637 TB
State               : Partially Degraded
Strip Size          : 256 KB
Number Of Drives    : 12
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
PI type: No PI

Is VD Cached: No


Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 18.188 TB
Sector Size         : 512
Parity Size         : 3.637 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 12
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAdaptive, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
PI type: No PI

Is VD Cached: No

###############################################
                                     
Adapter 0: Patrol Read Information:

Patrol Read Mode: Auto
Patrol Read Execution Delay: 672 hours
Number of iterations completed: 2 
Current State: Stopped
Patrol Read on SSD Devices: Disabled

Exit Code: 0x00
###############################################
                                     
Check Consistency on VD #0 is not in progress.
Check Consistency on VD #1 is not in progress.

Exit Code: 0x00

Why is the first volume is degraded ?

The first virtual disk lost a drive, which was already replaced and is now rebuilding. We can look at the status of all the drives using the lsi.sh script and the "drives" argument. You can see slot number 9 is the drive which is rebuilding.

calomel@lsi:~# ./lsi.sh drives

Slot Number: 0 - Online, Spun Up
Slot Number: 1 - Online, Spun Up
Slot Number: 2 - Online, Spun Up
Slot Number: 3 - Online, Spun Up
Slot Number: 4 - Online, Spun Up
Slot Number: 5 - Online, Spun Up
Slot Number: 6 - Online, Spun Up
Slot Number: 7 - Online, Spun Up
Slot Number: 8 - Online, Spun Up
Slot Number: 9 - Rebuild
Slot Number: 10 - Online, Spun Up
Slot Number: 11 - Online, Spun Up
Slot Number: 12 - Online, Spun Up
Slot Number: 13 - Online, Spun Up
Slot Number: 14 - Online, Spun Up
Slot Number: 15 - Online, Spun Up
Slot Number: 16 - Online, Spun Up
Slot Number: 17 - Online, Spun Up
Slot Number: 18 - Online, Spun Up
Slot Number: 19 - Online, Spun Up
Slot Number: 20 - Online, Spun Up
Slot Number: 21 - Online, Spun Up
Slot Number: 22 - Online, Spun Up
Slot Number: 23 - Online, Spun Up

When will the rebuild be finished ?

The card will only tell use how far the rebuild is done and how long the process has been running. Using the "progress" script argument we see the rebuild is 32% done and the rebuild has taken 169 minutes so far. Since the rebuild is close enough to 33% done we simply multiply the time taken (169 minutes) times 3 to derive the total time of 507 minutes or 8.45 hours if the load on the raid is the same to completion.

calomel@lsi:~#./lsi.sh progress
                                     
Rebuild Progress on Device at Enclosure 8, Slot 9 Completed 32% in 169 Minutes.

Want more speed out of FreeBSD ? Check out our FreeBSD Network Tuning guide where we enhance 1 gigabit and 10 gigabit network configurations.

How does the lsi.sh script check errors and send out email ?

The "checkNemail" argument will check the status of the volumes, also called virtual drives, and if the string degraded or error is found will send out email. Make sure to set the $EMAIL variable to your email address in the script. The output of the email shows slot 9 rebuilding. The first virtual drive in this example contain slots 0 through 11. If the physical drive was bad on the other hand we would see slot 9 as Unconfigured(bad) , Unconfigured(good) or even Missing.

Date: Wed, 20 Feb 2033 17:01:11 -0500
From: root@localhost
To: raidadmin@localhost
Subject: calomel.org - RAID Notification

Slot Number: 0 - Online, Spun Up
Slot Number: 1 - Online, Spun Up
Slot Number: 2 - Online, Spun Up
Slot Number: 3 - Online, Spun Up
Slot Number: 4 - Online, Spun Up
Slot Number: 5 - Online, Spun Up
Slot Number: 6 - Online, Spun Up
Slot Number: 7 - Online, Spun Up
Slot Number: 8 - Online, Spun Up
Slot Number: 9 - Rebuild
Slot Number: 10 - Online, Spun Up
Slot Number: 11 - Online, Spun Up
Slot Number: 12 - Online, Spun Up
Slot Number: 13 - Online, Spun Up
Slot Number: 14 - Online, Spun Up
Slot Number: 15 - Online, Spun Up
Slot Number: 16 - Online, Spun Up
Slot Number: 17 - Online, Spun Up
Slot Number: 18 - Online, Spun Up
Slot Number: 19 - Online, Spun Up
Slot Number: 20 - Online, Spun Up
Slot Number: 21 - Online, Spun Up
Slot Number: 22 - Online, Spun Up
Slot Number: 23 - Online, Spun Up

We prefer to run the script with "checkNemail" in a cron job. This way when the raid has an issue we get notification. The following cron job will run the script every two(2) hours. As long as the raid is degraded you will get email. We see this function as a reminder to check on the raid if it is not finished rebuilding by morning.

SHELL=/bin/bash
PATH=/bin:/sbin:/usr/bin:/usr/sbin
#
#minute (0-59)
#|   hour (0-23)
#|   |    day of the month (1-31)
#|   |    |   month of the year (1-12 or Jan-Dec)
#|   |    |   |   day of the week (0-6 with 0=Sun or Sun-Sat)
#|   |    |   |   |   commands
#|   |    |   |   |   |
# raid status, check and report 
00   */2  *   *   *   /root/lsi.sh checkNemail

Questions?

How do I setup two(2) 12 drive RAID6 arrays in a 24 slot chassis ?

Using two commands we can configure the drives from 0 through 11 in the first RAID6 array. Then do the same for the next virtual drive with drives 12 through 23. The directive "-r6" stands for RAID6 which is a raid with two parity drives and a bit safer then RAID5. Using 2TB drives this will make two(2) 18.1 terabyte raid volumes when formatted with XFS. Initialization takes around 19 hours.

MegaCli64 -CfgLdAdd -r6'[8:0,8:1,8:2,8:3,8:4,8:5,8:6,8:7,8:8,8:9,8:10,8:11]' -a0 -NoLog
MegaCli64 -CfgLdAdd -r6'[8:12,8:13,8:14,8:15,8:16,8:17,8:18,8:19,8:20,8:21,8:22,8:23]' -a0 -NoLog

How do I setup raid 1+0 ?

RAID 10 is striping of mirrored arrays and requires a minimum of 4 drives. We will setup slot 0 and 1 as one mirror (Array0) and slots 2 and 3 as the second mirror (Array1). Then we (RAID0) stripe between both sets of raid1 mirrors. In most cases RAID 10 provides better throughput and latency than all other RAID levels except RAID 0 (which wins in throughput, but loses in data safety). RAID10 is the preferable RAID level for I/O-intensive applications such as database, email and web servers as it is fast and provides data integrity.

MegaCli64 -CfgSpanAdd -r10 -Array0[8:0,8:1] -Array1[8:2,8:3] -a0 -NoLog

What do the "Cache Policy" values mean ?

Cache Policy's are how the raid card uses on board RAM to collect data before writing out to disk or to read data before the system asks for it. Write cache is used when we have a lot of data to write and it is faster to write data sequentially to disk instead of writing small chunks. Read cache is used when the system has asked for some data and the raid card keeps the data in cache in case the system asks for the same data again. It is always faster to read and write to cache then to access spinning disks. Understand that you should only use caching if you have good UPS power to the system. If the system looses power and does not flush the cache it is possible to loose data. No one wants that. Lets look at each cache policy LSI raid card use.

So how fast is the raid volume with caching enabled and disabled ? A simple test is using hdparm to show disk access. Caching allows this test to run two(2) to three(3) times faster on the exact same hardware. For our machines we prefer to use caching.

## Enable caching on the LSI and disks

$ hdparm -tT /dev/sdb1
  /dev/sdb1:
   Timing cached reads:        18836 MB in  2.00 seconds = 9428.07 MB/sec
   Timing buffered disk reads:  1403 MB in  3.00 seconds =  467.67 MB/sec


## Disable all caching on the LSI card and disks

$ hdparm -tT /dev/sdb1
  /dev/sdb1:
   Timing cached reads:         6743 MB in  2.00 seconds = 3371.76 MB/sec
   Timing buffered disk reads:   587 MB in  3.01 seconds =  198.37 MB/sec

How about a FreeBSD ZFS raid-z2 array using the LSI raid card ?

ZFS on FreeBSD is one of the absolute best file systems we have ever used. It is very fast, stable and joy to use. Lets look at setting up a raid-z2 ZFS pool using 12 separate hard drives all connected through an LSI MegaRAID controller in a JBOD (Just a Bunch Of Disks) like configuration.

The LSI MegaRAID native JBOD mode does not work very well and we do not recommend using it. If you use LSI JBOD mode then all of the caching algorithms on the raid card are disabled and for some reason the drive devices are not exported to FreeBSD. The working solution is to setup all of the individual drives as separate RAID0 (raid zero) arrays and bind them all together using ZFS. We are currently using raids in this setup in live production and they work without issue.

For this example we are going to configure 12 RAID-0 LDs, each consisting of a single disk and then use ZFS to make the raid-z2 (RAID6) volume. The LSI setup will be as close to JBOD mode as we can get, but the advantage of this mode is it allows caching and optimization algorithms to be used on the raid card. Here's the RAID-0 LD and ZFS creation commands:

# Set slots 0-11 to 12 individual RAID0 volumes. This is just a simple while
# loop to go through all 12 drives. Use "./lsi.sh status" script to see all the
# volumes afterwards.
 i=0; while [ $i -le 11 ] ; do MegaCli64 -cfgldadd -r0[8:${i}] WB RA Cached CachedBadBBU -strpsz512 -a0 -NoLog ; i=`expr $i + 1`; done

# Create a RAID-Z2 (RAID6) ZFS volume out of 12 drives called "tank". Creation
# time of the ZFS raid is just a few seconds compared to creating a RAID6
# volume through the raid card which initializes in around 19 hours.
 zpool create tank raidz2 mfid0 mfid1 mfid2 mfid3 mfid4 mfid5 mfid6 mfid7 mfid8 mfid9 mfid10 mfid11

# Done! It is that easy. You should now see a drive mounted
# as "tank" using "df -h". Check the volume with "zpool status"

# OPTIONAL: We use two(2) Samsung 840 Pro 256MB SSD drives as L2ARC cache
# drives. The SSD drives are in slot 12 and 13. This means that up to 512GB of
# the most frequently accessed data can be kept in SSD cache and not read from
# spinning media. This greatly speeds up access times. We use two cache drives,
# compared to just one 512MB, so _when_ one SSD dies the other will take on the
# cache load (now up to 256MB) till the failed drive is replaced.
 MegaCli64 -cfgldadd -r0[8:12] WB RA Cached CachedBadBBU -strpsz512 -a0 -NoLog
 MegaCli64 -cfgldadd -r0[8:13] WB RA Cached CachedBadBBU -strpsz512 -a0 -NoLog
 zpool add tank cache mfid12
 zpool add tank cache mfid13

We lost a ZFS drive! How to replace a bad disk

Lets say the drive in slot 5 died, was removed or needs to be replaced due to reported errors. ZFS says the "tank" pool is DEGRADED using the "zpool status" command. We just need to pull out the old slot 5 drive, put in the new drive in slot 5, configure the new drive for RAID0 on the LSI card and then tell FreeBSD ZFS to replace the old dead drive with the new one we just inserted. Sounds like a lot of steps, but it is really easy!

# First, replace the old drive with the new drive in slot 5. Then check the
# status of slot 5 by running "./lsi.sh drives" 

# OPTIONAL: If the drive comes up as Unconfigured(bad) using "./lsi.sh drives"
# just run "./lsi.sh good 5" to make slot five(5) Unconfigured(good). OR,
# manually run the following two(2) MegaCli64 commands to remove any foreign
# configurations and make the drive in enclosure 8, slot 5 Unconfigured(good) 
 ./lsi.sh good 5
    -OR manually type-
 MegaCli64 -CfgForeign -Clear -aALL -NoLog
 MegaCli64 -PDMakeGood -PhysDrv[8:5] -a0 -NoLog

# Configure the new drive in slot 5 for RAID0 through the LSI controller. 
# Make sure the drive is in Unconfigured(good) status according to "./lsi.sh
# drives" script found at the top of this page.
 MegaCli64 -cfgldadd -r0[8:5] WB RA Cached CachedBadBBU -strpsz512 -a0 -NoLog

# Add the new drive in slot 5 (mfid5) into ZFS. The "zpool replace" command
# will replace the old mfid5 (first mention) with the new mfid5 (second
# mention). Our setup resilvered the tank pool at 1.78GB/s using all 6 CPU
# cores at a load of 4.3. Resilvering 3TB of data takes 28 minutes.
 zpool replace tank mfid5 mfid5

# OPTIONAL: Since we removed the virtual drive (slot 5) and then added a
# virtual drive back in, we need to re-apply the default cache settings to the
# RAID0 volumes on the LSI card. Use "./lsi.sh status" to look at slot 5 and
# compare its values to the other drives if your are interested. Setting our
# preferred defaults is easily done using our lsi.sh script found at the
# beginning of this page and can be applied to active, live raids.
 ./lsi.sh setdefaults

# Done!

How fast is a ZFS RAID through the LSI MegaRAID controller ?

Check out our FreeBSD ZFS Raid Speeds, Safety and Capacity page. We examine more then a dozen different ZFS raid configurations and compare each of them.

What happens when the Battery Backup Unit (BBU) is bad, disabled or missing ?

The LSI raid BBU allows the raid controller to cache data before being written to the raid disks. Without the battery backup unit the raid card can not guarantee data in the card's cache will be written to the physical disks if the power goes out. So, if the the BBU is bad, if the raid card is running a "battery relearn" test or if the BBU is disabled then the cached Write-Back policy is automatically disabled and Write-Through is enabled. The result of the direct to disk Write-Through policy is writes become an order of magnitude slower.

As a test we disabled cached Write-Back on our test raid. The bonnie++ benchmark test resulted in writes of 121MB/sec compared to enabling Write-Back and writing at 505MB/sec.

You can check the status of your BBU using the following command.

MegaCli64 -AdpBbuCmd -GetBbuStatus -a0 -NoLog | egrep -i 'charge|battery

You should be able to find new BBU units for as little as $40 searching online. LSI will sell the same unit to you for well over $100 each. The biggest problem is replacing the battery unit since it is in the case and you will need to unrack the server, pull off the top and replace the battery. Probably not that bad if you have one server in the office, but it is quite a job to unrack a hundred raids in a remote data center. These batteries should have been designed to be hot swappable from the rear of the rack mounted chassis in the first place.

What if you do not want to or can not replace the BBU ?

Truthfully, the raid will work perfectly fine other then higher latency, more CPU usage and lower transfer speeds. If you want to you can simply force the raid card to use write-back cache even if the BBU is dead or missing with the following command. We use the CachedBadBBU option on raid cards which work perfectly fine, but the BBU recharge circuit does not work. Please make sure your system in on a reliable UPS as you do not want to loose any data still in cache and not yet written out to disk. After you execute this command you should see your volume's "Current Cache Policy" will include "Write Cache OK if Bad BBU" instead of just "No Write Cache if Bad BBU".

MegaCli64 -LDSetProp CachedBadBBU -LAll -aAll -NoLog

You may also want to check out this post, auto-learn tests kill disk performance. Remember, if your virtual disk's "Current Cache Policy" is "No Write Cache if Bad BBU" and the raid card goes into battery relearn mode all write caching is disabled as the battery is temporarily offline. Take a look at the graphs to see the severe performance degradation they experienced. Of course if you enabled the CachedBadBBU option then you do not have to worry about when battery relearn mode runs as your cache will always ibe enabled.


Contact Us RSS Feed Google Site Search