Why and How to Use Cluster Check in Oracle Solaris Cluster 3.3 5/11

July 2011
By Ed McKnight

With Oracle Solaris Cluster 3.3 5/11 come some major enhancements to one of the best kept secrets of Oracle Solaris Cluster: the configuration checking utility, cluster check.

cluster check (the check subcommand to the cluster (1CL) command) was introduced in Oracle Solaris Cluster 3.2 1/09 to replace the original sccheck (1M) utility. With that transition, configuration checking was no longer required to run the Oracle Explorer Data Collector as part of the check, which was a relief to many users. However, cluster check just did about the same things sccheck(1M) had done: run a set of checks that would test for broken or potentially dangerous configurations and offer recommendations for fixing the situations.

The latest set of enhancements to cluster check is focused on validating clusters during installation and initial configuration. The intent of these enhancements is to enable administrators to perform the most important of the Enterprise Installation Services (EIS) checks themselves. If EIS personnel are involved, they, too, will benefit from using cluster check, as described here. As a matter of fact, the EIS team as well as the Oracle Support Services team played a vital part in defining and implementing these enhancements.

Note: EIS is a legacy Sun service, and installations now come from Oracle Advanced Customer Services (ACS).

In addition to a new focus on installation-time and configuration-time checking, a seemingly small but quite important change was made to the checks themselves: the way the results are titled. I always recommend that cluster check be run with the -v (verbose) flag to turn on verbose progress reporting. In the past, the checks were titled with "problem statements" that described a problem. Many people would miss the fact that most or all the checks were passing, so they were alarmed by the titles and thought lots of problems were being discovered on their cluster. Now, all existing checks are titled with "check titles" instead of "problem statements." Most of the titles are actually a question and, typically, a "yes" answer means the cluster passed the check.

Now for the big stuff: There are over forty new checks, many of which apply both before clustering is installed (recall that scinstall(1M) runs cluster check before configuring a node) and right after the initial configuration of services. And even though the focus is on initial installation and configuration, these checks are still useful over the entire life of the cluster.

There is a related subcommand in cluster(1CL) called list-checks, which does what you'd expect: It lists the checks that are available to be run on a cluster. With the addition of so many new checks, the output is longer than one screen, so a few new flags were added to the subcommand. The -K flag lists not checks but unique keywords that the checks are tagged with. Armed with one or more interesting keywords, you can use the keywords with the -k flag to get a list of checks that are tagged with that keyword. Listing 1 shows the command for listing all unique keywords, and Listing 2 shows the command for listing all checks that are tagged with keyword svm.

Listing 1: Listing All Keywords

pb52:/tmp# cluster list-checks -K
  /etc/system
  SolarisCluster3.2
  SolarisCluster3.3
  SolarisCluster3.x
  adapter
  cacao
  clustermode
  eeprom
  functional
...
  svm
...
  zfs
  zones
pb52:/tmp#

Listing 2: Listing Checks that Are Tagged with Keyword svm

pb52:/tmp# cluster list-checks -k svm
  M6984130  :   (Critical)   Are the global device namespace mirrors
    using unique metadevice instance numbers across the cluster?
  S6984126  :   (Critical)   Are the Solaris boot disk mirror definitions
  using DID device names?
  S6994578  :   (High)       Are shared SVM disksets using DID devices?
  S6994590  :   (High)       Are local Solaris Volume Manager replicas
    configured correctly?
  S6998141  :   (Variable)   Are the disk slices for the local SVM
  replicas
  large enough?
pb52:/tmp#

Once you discover a check that looks interesting, you can get a lot more detail about it by using the verbose flag plus the -C flag and the check's ID, as shown in Listing 3.

Listing 3: Getting Details for a Check

pb52:/tmp# cluster list-checks -v -C I6994574
  I6994574: (Moderate) Fix for GLDv3 interfaces on cluster transport
  vulnerability applied?
Keywords: single, network, interactive
Applicability: Applies for only Solaris Cluster 3.2 with Solaris 10
when certain patches are not installed and GLDv3network interfaces are
used for private interconnect and VLAN tagging is not configured. This
check is not applicable when a network switch is not used for Solaris
Cluster private interconnect.
Check Logic: Ask user if switches are in use; if yes, and VLAN tagging
is NOT used then the check is violated
Version: 1.2
Revision Date: 11/01/10
pb52:/tmp#

Most check IDs begin with the letter S. These are single-node checks, which means that the question can be answered by looking only at a single node and it needs to be asked on each node separately. Checks beginning with M are multi-node checks, which require data from more than one node. Often multi-node checks are consistency checks.

But most interesting are two new classes of checks: those that begin with I for interactive, such as in the previous example, and those that begin with F for functional.

Previously, all checks fetched data from the live cluster (or an Oracle Explorer Data Collector archive, if you specified one or more of those), but interactive checks are capable of asking you for data that can't be obtained programmatically. There aren't many interactive checks, but they provide a way to get data that wasn't available before.

Finally, the biggest new feature...Before now, all checks have been read-only. The checks would never make a change on a cluster or node other than to use a bit of disk space for writing the results. The new functional checks change that. These checks actually exercise some basic functionality of a cluster. Because of this, the only way to run a functional check is to provide a single functional check ID with the -C flag. Functional checks are not included (and neither are interactive checks) in a "default" run where no check IDs or keywords are specified.

The command shown in Listing 4 lists all available functional checks in nonverbose mode.

Listing 4: Listing All Functional Checks

pb52:/tmp# cluster list-checks -k functional
  F6968101  :   (Critical)   Perform resource group switchover
  F6984120  :   (Critical)   Induce cluster transport failure - single adapter
  F6984121  :   (Critical)   Perform cluster shutdown.
  F6984140  :   (Critical)   Perform node panic
pb52:/tmp#

Caution: I strongly recommend that you do not run any functional checks on a cluster that is actually in service for real users or applications. These checks are intended to be run after the cluster is configured but before it's put into service. There are a few, such as switchover check F6968101, that can be run with care on a working cluster, but be careful.

Functional checks display a series of menus and questions, and they always ask for final confirmation before taking any action. The general case is for the check to take a snapshot of a relevant state, make the change you've requested, and then examine the resulting state to see if all is well.

The most extreme case is the check that shuts down the entire cluster, which is shown in Listing 5.

Listing 5: Functional Check that Shuts Down the Cluster

pb52:/tmp# cluster check -v -o <some dir that survives a reboot> -C F6984121

After all nodes settle down to the boot prompt, you reboot them and wait for the cluster to reform. Then you complete this check by providing the exact same command line you provided the first time, and the check examines the cluster to see that it has come up in a healthy state.

All other checks write interesting things to the log only if the check is violated. Passing checks just make note of themselves and that they passed. Functional checks, on the other hand, always write quite a bit to the log, and if they detected something suspicious in the evaluation phase of the check, they write suggestions for investigating the issue.

These enhancements to cluster check make this utility more important than ever to the installer and the cluster administrator. This utility is not only run on nodes before clustering is configured. It's also useful over the life of the cluster to watch for trouble that can crop up as configurations are changed, and it's now available as a validation tool for newly configured clusters. For more information, see "How to Validate the Cluster" in Chapter 3, "Establishing the Global Cluster," of the Oracle Solaris Cluster Software Installation Guide.

Revision 1, 07/26/2011