VxVM User's Guide

Description of the Volume Manager

Chapter 1

Introduction

This chapter provides detailed information about the VERITAS Volume Manager (VxVM). The first part of this chapter describes the Volume Manager and its features; the second part provides general information on disk arrays and Redundant Arrays of Inexpensive Disks (RAID).

The following topics are covered in this chapter:

Volume Manager Overview

To use the VERITAS Volume Manager effectively, you must have a fairly good understanding of its principles of operation. This section provides information needed to understand the Volume Manager. The section begins with descriptions of how the Volume Manager works and the objects that VxVM manipulates. Various features of the Volume Manager are discussed later in this section.

How the Volume Manager Works

The Volume Manager builds virtual devices called volumes on top of physical disks. Volumes are accessed by a UNIX file system, a database, or other applications in the same way physical disk partitions would be accessed. Volumes are composed of other virtual objects that can be manipulated to change the volume's configuration. Volumes and their virtual components are referred to as Volume Manager objects. Volume Manager objects can be manipulated in a variety of ways to optimize performance, provide redundancy of data, and perform backups or other administrative tasks on one or more physical disks without interrupting applications. As a result, data availability and disk subsystem throughput are improved.

To understand the Volume Manager, you must first understand the relationships between physical objects and Volume Manager objects.

Physical Objects

To perform disk management tasks using the Volume Manager, you must understand two physical objects:

Physical disks
Partitions

Physical Disks

A physical disk is the underlying storage device (media), which may or may not be under Volume Manager control. A physical disk can be accessed using a device name such as c#b#t#d#, where c# is the controller, b# is the bus, t# is the target ID, and d# is the disk number. The disk in Figure 1 is disk number 0 with a target ID of 0, and it is connected to controller number 0 in the system.

Figure 1 Example of a Physical Disk

Partitions

A physical disk can be divided into one or more partitions. The partition number, or s#, is given at the end of the device name. Note that a partition can take up an entire physical disk, such as the partition shown in Figure 2.

Figure 2 Example of a Partition

The relationship between physical objects and Volume Manager objects is established when you place a partition from a physical disk under Volume Manager control.

Volume Manager Objects

There are several Volume Manager objects that must be understood before you can use the Volume Manager to perform disk management tasks:

VM disks
Disk groups
Subdisks
Plexes
Volumes

The Volume Manager objects are described in the sections that follow.

VM Disks

A VM disk is a contiguous area of disk space from which the Volume Manager allocates storage. When you place a partition from a physical disk under Volume Manager control, a VM disk is assigned to the partition. Each VM disk corresponds to at least one partition. A VM disk is typically composed of a public region (from which storage is allocated) and a private region (where configuration information is stored).

A VM disk is accessed using a unique disk media name, which you can supply (or else the Volume Manager assigns one that typically takes the form disk##). Figure 3 shows a VM disk with a disk media name of disk01 that is assigned to the partition c0b0t0d0s0.

Figure 3 Example of a VM Disk

With the Volume Manager, applications access volumes (created on VM disks) rather than partitions.

Disk Groups

A disk group is a collection of VM disks that share a common configuration. A configuration consists of a set of records containing detailed information about existing Volume Manager objects, their attributes, and their relationships. The default disk group is rootdg (the root disk group). Additional disk groups can be created, as necessary. Volumes are created within a disk group; a given volume must be configured from disks belonging to the same disk group. Disk groups allow the administrator to group disks into logical collections for administrative convenience. A disk group and its components can be moved as a unit from one host machine to another.

Subdisks

A subdisk is a set of contiguous disk blocks; subdisks are the basic units in which the Volume Manager allocates disk space. A VM disk can be divided into one or more subdisks. Each subdisk represents a specific portion of a VM disk, which is mapped to a specific region of a physical disk. Since the default name for a VM disk is disk## (such as disk01), the default name for a subdisk is disk##-##. So, for example, disk01-01 would be the name of the first subdisk on the VM disk named disk01.

Figure 4 Example of a Subdisk

A VM disk may contain multiple subdisks, but subdisks cannot overlap or share the same portions of a VM disk. The example given in Figure 5 shows a VM disk, with three subdisks, that is assigned to one partition.

Figure 5 Example of Three Subdisks Assigned to One Partition

Any VM disk space that is not part of a subdisk is considered to be free space, which can be used to create new subdisks.

Plexes

The Volume Manager uses subdisks to build virtual entities called plexes (also referred to as mirrors). A plex consists of one or more subdisks located on one or more disks. There are three ways that data can be organized on the subdisks that constitute a plex:

concatenation
striping (RAID-0)
RAID-5

Concatenation is discussed in this section. Details on striping (RAID-0) and RAID-5 are presented later in the chapter.

Concatenation

Concatenation maps data in a linear manner onto one or more subdisks in a plex. If you were to access all the data in a concatenated plex sequentially, you would first access the data in the first subdisk from beginning to end, then access the data in the second subdisk from beginning to end, and so forth until the end of the last subdisk.

The subdisks in a concatenated plex do not necessarily have to be physically contiguous and can belong to more than one VM disk. Concatenation using subdisks that reside on more than one VM disk is also called spanning.

Figure 6 illustrates concatenation with one subdisk.

Figure 6 Example of Concatenation

Concatenation with multiple subdisks is useful when there is insufficient contiguous space for the plex on any one disk. Such concatenation can also be useful for load balancing between disks, and for head movement optimization on a particular disk.

Figure 7 shows how data would be spread over two subdisks in a spanned plex.

Figure 7 Example of Spanning

Since the first six blocks of data (B1 through B6) consumed most or all of the room on the partition that VM disk disk01 is assigned to, subdisk disk01-01 is alone on VM disk disk01. However, the last two blocks of data, B7 and B8, take up only a portion of the room on the partition that VM disk disk02 is assigned to. That means that the remaining free space on VM disk disk02 can be put to other uses. In this example, subdisks disk02-02 and disk02-03 are currently available for some other disk management tasks.

CAUTION! Spanning a plex across multiple disks increases the chance that a disk failure will result in failure of its volume. Use mirroring or RAID-5 (both described later) to substantially reduce the chance that a single disk failure will result in volume failure.

Volumes

A volume is a virtual disk device that appears to applications, databases, and file systems like a physical disk partition, but does not have the physical limitations of a physical disk partition. A volume consists of one or more plexes, each holding a copy of the data in the volume. Due to its virtual nature, a volume is not restricted to a particular disk or a specific area thereof. The configuration of a volume can be changed, using the Volume Manager interfaces, without causing disruption to applications or file systems that are using the volume. For example, a volume can be mirrored on separate disks or moved to use different disk storage.

A volume can consist of up to 32 plexes, each of which contains one or more subdisks. In order for a volume to be usable, it must have at least one associated plex with at least one associated subdisk. Note that all subdisks within a volume must belong to the same disk group.

The Volume Manager uses the default naming conventions of vol## for volumes and vol##-## for plexes in a volume. Administrators are encouraged to select more meaningful names for their volumes.

A volume with one plex (see Figure 8) contains a single copy of the data.

Figure 8 Example of a Volume with One Plex

Note that volume vol01 in Figure 8 has the following characteristics:

it contains one plex named vol01-01
the plex contains one subdisk named disk01-01
the subdisk disk01-01 is allocated from VM disk disk01

A volume with two or more plexes (see Figure 9) is considered "mirrored" and contains mirror images of the data. Refer to "Mirroring (RAID-1)" for more information on mirrored volumes.

Figure 9 Example of a Volume with Two Plexes

Note that volume vol06 in Figure 9 has the following characteristics:

it contains two plexes named vol06-01 and vol06-02
each plex contains one subdisk
each subdisk is allocated from a different VM disk (disk01 and disk02)

Figure 10 shows how a volume would look if it were set up with a simple, concatenated configuration.

Figure 10 Example of a Volume in a Concatenated Configuration

Relationships Between VxVM Objects

Volume Manager objects are of little use until they are combined to build volumes. Volume Manager objects generally have the following relationship:

VM disks are placed under VxVM control and grouped into disk groups
one or more subdisks (each representing a specific portion of a disk) are combined to form plexes
a volume is composed of one or more plexes

The example in Figure 11 illustrates the relationships between (virtual) Volume Manager objects, as well as how they relate to physical disks. This illustration shows a disk group containing two VM disks (disk01 and disk02). disk01 has a volume with one plex and two subdisks and disk02 has a volume with one plex and a single subdisk.

Figure 11 Relationship Between VxVM Objexts

Volume Manager RAID Implementations

A Redundant Array of Inexpensive Disks (RAID) is a disk array (a group of disks that appear to the system as virtual disks, or volumes) that uses part of its combined storage capacity to store duplicate information about the data stored in the array. This duplicate information makes it possible to regenerate the data in the event of a disk failure.

This section focuses on the Volume Manager's implementations of RAID. For a general description of disk arrays and the various levels of RAID, refer to "Disk Array Overview."

The Volume Manager supports the following levels of RAID:

RAID-0 (Striping)
RAID-1 (Mirroring)
RAID-0 plus RAID-1 (Striping and Mirroring)
RAID-5

The sections that follow describe how the Volume Manager implements each of these RAID levels.

Striping (RAID-0)

Striping is a technique of mapping data so that the data is interleaved among two or more physical disks. More specifically, a striped plex contains two or more subdisks, spread out over two or more physical disks. Data is allocated alternately and evenly to the subdisks of a striped plex.

The subdisks are grouped into "columns," with each physical disk limited to one column. Each column contains one or more subdisks and can be derived from one or more physical disks. The number and sizes of subdisks per column can vary. Additional subdisks can be added to columns, as necessary.

Data is allocated in equal-sized units (called stripe units) that are interleaved between the columns. Each stripe unit is a set of contiguous blocks on a disk. The default stripe unit size is 64 kilobytes.

For example, if there are three columns in a striped plex and six stripe units, data is striped over three physical disks, as illustrated in Figure 12. The first and fourth stripe units are allocated in column 1; the second and fifth stripe units are allocated in column 2; and the third and sixth stripe units are allocated in column 3. Viewed in sequence, the first stripe begins with stripe unit 1 in column 1, stripe unit 2 in column 2, and stripe unit 3 in column 3. The second stripe begins with stripe unit 4 in column 1, stripe unit 5 in column 2, and stripe unit 6 in column 3. Striping continues for the length of the columns (if all columns are the same length) or until the end of the shortest column is reached. Any space remaining at the end of subdisks in longer columns becomes unused space.

Figure 12 Striping Across Three Disks (Columns)

A stripe consists of the set of stripe units at the same positions across all columns. In Figure 12, stripe units 1, 2, and 3 constitute a single stripe.

Striping is useful if you need large amounts of data to be written to or read from the physical disks quickly by using parallel data transfer to multiple disks. Striping is also helpful in balancing the I/O load from multi-user applications across multiple disks.

CAUTION! Striping a volume, or splitting a volume across multiple disks, increases the chance that a disk failure will result in failure of that volume. For example, if five volumes are striped across the same five disks, then failure of any one of the five disks will require that all five volumes be restored from a backup. If each volume were on a separate disk, only one volume would have to be restored. Use mirroring or RAID-5 (both described later) to substantially reduce the chance that a single disk failure will result in failure of a large number of volumes.

Figure 13 shows a striped plex with three equal sized, single-subdisk columns. There is one column per physical disk.

Figure 13 Example of a Striped Plex with One Subdidk per Column

Although the example in Figure 13 shows three subdisks that consume all of the VM disks, it is also possible for each subdisk in a striped plex to take up only a portion of the VM disk, thereby leaving free space for other disk management tasks.

Figure 14 shows a striped plex with 3 columns containing subdisks of different sizes. Each column contains a different number of subdisks. There is one column per physical disk. Although striped plexes are usually created using a single subdisk from each of the VM disks being striped across, it is also possible to allocate space from different regions of the same disk or from another disk (if the plex is grown, for instance).

Figure 14 Example of a Striped Plex with Multiple Subdisks Per Column

Figure 15 shows how a volume would look if it were set up for the simple striped configuration given in Figure 13.

Figure 15 Example of a Volume in a Striped Configuration

Mirroring (RAID-1)

Mirroring is a technique of using multiple mirrors (plexes) to duplicate the information contained in a volume. In the event of a physical disk failure, the mirror on the failed disk becomes unavailable, but the system continues to operate using the unaffected mirrors. Although a volume can have a single plex, at least two plexes are required to provide redundancy of data. Each of these plexes should contain disk space from different disks in order for the redundancy to be effective.

When striping or spanning across a large number of disks, failure of any one of those disks will generally make the entire plex unusable. The chance of one out of several disks failing is sufficient to make it worthwhile to consider mirroring in order to improve the reliability (and availability) of a striped or spanned volume.

Striping Plus Mirroring (RAID-0 + RAID-1)

The Volume Manager supports the combination of striping with mirroring. When used together on the same volume, striping plus mirroring offers the benefits of spreading data across multiple disks while providing redundancy of data.

For striping and mirroring to be effective together, the striped plex and its mirror must be allocated from separate disks. The layout type of the mirror can be concatenated or striped.

Volume Manager and RAID-5

This section describes the Volume Manager's implementation of RAID-5. For general information on RAID-5, refer to "RAID-5."

Although both mirroring (RAID-1) and RAID-5 provide redundancy of data, their approaches differ. Mirroring provides data redundancy by maintaining multiple complete copies of a volume's data. Data being written to a mirrored volume is reflected in all copies. If a portion of a mirrored volume fails, the system will continue to utilize the other copies of the data.

RAID-5 provides data redundancy through the use of parity (a calculated value that can be used to reconstruct data after a failure). While data is being written to a RAID-5 volume, parity is also calculated by performing an exclusive OR (XOR) procedure on data. The resulting parity is then written to the volume. If a portion of a RAID-5 volume fails, the data that was on that portion of the failed volume can be recreated from the remaining data and the parity.

Traditional RAID-5 Arrays

A traditional RAID-5 array is made up of several disks organized in rows and columns, where a column is a number of disks located in the same ordinal position in the array and a row is the minimal number of disks necessary to support the full width of a parity stripe. Figure 16 shows the row and column arrangement of a traditional RAID-5 array.

Figure 16 Traditional RAID-5 Array

This traditional array structure was developed to support growth by adding more rows per column. Striping is accomplished by applying the first stripe across the disks in Row 0, then the second stripe across the disks in Row 1, then the third stripe across Row 0's disks, and so on. This type of array requires all disks (partitions), columns, and rows to be of equal size.

VxVM RAID-5 Arrays

The Volume Manager's RAID-5 array structure differs from the traditional structure. Due to the virtual nature of its disks and other objects, the Volume Manager does not need to use rows. Instead, the Volume Manager uses columns consisting of variable length subdisks (as illustrated in Figure 17). Each subdisk represents a specific area of a disk.

Figure 17 VxVM RAID-5 Array

With the Volume Manager RAID-5 array structure, each column can consist of a different number of subdisks and the subdisks in a given column can be derived from different physical disks. Additional subdisks can be added to the columns, as necessary. Striping (described in "Striping (RAID-0)") is accomplished by applying the first stripe across each subdisk at the top of each column, then another stripe below that, and so on for the entire length of the columns. For each stripe, an equal-sized stripe unit is placed in each column. With RAID-5, the default stripe unit size is 16 kilobytes.

Note: Mirroring of RAID-5 volumes is not currently supported.

Left-Symmetric Layout

There are several layouts for data and parity that can be used in the setup of a RAID-5 array. The layout selected for the Volume Manager's implementation of RAID-5 is the left-symmetric layout. The left-symmetric parity layout provides optimal performance for both random I/Os and large sequential I/Os. In terms of performance, the layout selection is not as critical as the number of columns and the stripe unit size selection.

The left-symmetric layout stripes both data and parity across columns, placing the parity in a different column for every stripe of data. The first parity stripe unit is located in the rightmost column of the first stripe. Each successive parity stripe unit is located in the next stripe, left-shifted one column from the previous parity stripe unit location. If there are more stripes than columns, the parity stripe unit placement begins in the rightmost column again.

Figure 18 illustrates a left-symmetric parity layout consisting of five disks (one per column).

Figure 18 Left-Symmetric Layout

For each stripe, data is organized starting to the right of the parity stripe unit. In Figure 18, data organization for the first stripe begins at P0 and continues to stripe units 0-3. Data organization for the second stripe begins at P1, then continues to stripe unit 4, and on to stripe units 5-7. Data organization proceeds in this manner for the remaining stripes.

Each parity stripe unit contains the result of an exclusive OR (XOR) procedure performed on the data in the data stripe units within the same stripe. If data on a disk corresponding to one column is inaccessible due to hardware or software failure, data can be restored by XORing the contents of the remaining columns' data stripe units against their respective parity stripe units (for each stripe). For example, if the disk corresponding to the leftmost column in Figure 18 were to fail, the volume would be placed in a degraded mode. While in degraded mode, the data from the failed column could be recreated by XORing stripe units 1-3 against parity stripe unit P0 to recreate stripe unit 0, then XORing stripe units 4, 6, and 7 against parity stripe unit P1 to recreate stripe unit 5, and so on.

Note: Failure of multiple columns in a plex with a RAID-5 layout will detach the volume. This means that the volume will no longer be allowed to satisfy read or write requests. Once the failed columns have been recovered, it might be necessary to recover the user data from backups.

Logging

Without logging, it is possible for data not involved in any active writes to be lost or silently corrupted if a disk fails and the system also fails. If this double-failure occurs, there is no way of knowing if the data being written to the data portions of the disks or the parity being written to the parity portions have actually been written. Therefore, the recovery of the corrupted disk may be corrupted itself.

Logging is used to prevent this corruption of recovery data. A log of the new data and parity is made on a persistent device (such as a disk-resident volume or non-volatile RAM). The new data and parity are then written to the disks.

In Figure 19, the recovery of Disk B is dependent on the data on Disk A and the parity on Disk C having both completed. The diagram shows a completed data write and an incomplete parity write causing an incorrect data reconstruction for the data on Disk B.

Figure 19 Incomplete Write

This failure case can be handled by logging all data writes before committing them to the array. In this way, the log can be replayed, causing the data and parity updates to be completed before the reconstruction of the failed drive takes place.

Logs are associated with a RAID-5 volume by being attached as additional, non-RAID-5 layout plexes. More than one log plex can exist per RAID-5 volume, in which case the log areas are mirrored.

Read-Modify-Write

When you write to a RAID-5 array, the following steps may be followed for each stripe involved in the I/O:

: 1. The data stripe units to be updated with new write data are accessed and read into internal buffers. The parity stripe unit is read into internal buffers.
: 2. The parity is updated to reflect the contents of the new data region. First, the contents of the old data undergo an exclusive OR (XOR) with the parity (logically removing the old data). The new data is then XORed into the parity (logically adding the new data). The new data and new parity are written to a log.
: 3. The new parity is written to the parity stripe unit. The new data is written to the data stripe units. All stripe units are written in a single write.

This process is known as a read-modify-write cycle, which is the default type of write for RAID-5. If a disk fails, both data and parity stripe units on that disk become unavailable. The disk array is then said to be operating in a degraded mode.

The read-modify-write sequence is illustrated in Figure 20.

Figure 20 Read-Modify-Write

Full-Stripe Writes

When large writes (writes that cover an entire data stripe) are issued, the read-modify-write procedure can be bypassed in favor of a full-stripe write. A full-stripe write is faster than a read-modify-write because it does not require the read process to take place. Eliminating the read cycle reduces the I/O time necessary to write to the disk. A full-stripe write procedure consists of the following steps:

: 1. All the new data stripe units are XORed together, generating a new parity value. The new data and new parity is written to a log.
: 2. The new parity is written to the parity stripe unit. The new data is written to the data stripe units. The entire stripe is written in a single write.

Figure 21 shows a full-stripe write.

Figure 21 Full-Stripe Write

Reconstruct-Writes

When 50 percent or more of the data disks are undergoing writes in a single I/O, a reconstruct-write can be used. A reconstruct-write saves I/O time by XORing because it does not require a read of the parity region and only requires a read of the unaffected data (which amounts to less than 50 percent of the stripe units in the stripe).

A reconstruct-write procedure consists of the following steps:

: 1. Unaffected data is read from the unchanged data stripe unit(s).
: 2. The new data is XORed with the old, unaffected data to generate a new parity stripe unit. The new data and resulting parity are logged.
: 3. The new parity is written to the parity stripe unit. The new data is written to the data stripe units. All stripe units are written in a single write.

Figure 22 illustrates a reconstruct-write. A reconstruct-write is preferable to a read-modify-write in this situation because it reads only the necessary data disks, rather than reading the disks and the parity disk.

Figure 22 Reconstruct-Write

Hot-Relocation

Hot-relocation is the ability of a system to automatically react to I/O failures on redundant (mirrored or RAID-5) VxVM objects and restore redundancy and access to those objects. The Volume Manager detects I/O failures on VxVM objects and relocates the affected subdisks to disks designated as spare disks and/or free space within the disk group. The Volume Manager then reconstructs the VxVM objects that existed before the failure and makes them redundant and accessible again.

When a partial disk failure occurs (that is, a failure affecting only some subdisks on a disk), redundant data on the failed portion of the disk is relocated and the existing volumes comprised of the unaffected portions of the disk remain accessible.

Note: Hot-relocation is only performed for redundant (mirrored or RAID-5) subdisks on a failed disk. Non-redundant subdisks on a failed disk are not relocated, but the system administrator is notified of their failure.

How Hot-Relocation Works

The hot-relocation feature is enabled by default. No system administrator intervention is needed to start hot-relocation when a failure occurs.

The hot-relocation daemon, vxrelocd, is responsible for monitoring VxVM for events that affect redundancy and performing hot-relocation to restore redundancy. vxrelocd also notifies the system administrator (via electronic mail) of failures and any relocation and recovery actions. See the vxrelocd(1M) manual page for more information on vxrelocd.

The vxrelocd daemon starts during system startup and monitors the Volume Manager for failures involving disks, plexes, or RAID-5 subdisks. When such a failure occurs, it triggers a hot-relocation attempt.

A successful hot-relocation process involves:

1. Detecting VxVM events resulting from the failure of a disk, plex, or RAID-5 subdisk.

2. Notifying the system administrator (and other users designated for notification) of the failure and identifying the affected VxVM objects. This is done through electronic mail.

3. Determining which subdisks can be relocated, finding space for those subdisks in the disk group, and relocating the subdisks. Notifying the system administrator of these actions and their success or failure.

4. Initiating any recovery procedures necessary to restore the volumes and data. Notifying the system administrator of the recovery's outcome.

Note: Hot-relocation does not guarantee the same layout of data or the same performance after relocation. The system administrator may therefore wish to make some configuration changes after hot-relocation occurs.

How Space is Chosen for Relocation

A spare disk must be initialized and placed in a disk group as a spare before it can be used for replacement purposes. If no disks have been designated as spares when a failure occurs, VxVM automatically uses any available free space in the disk group in which the failure occurs. If there is not enough spare disk space, a combination of spare space and free space is used. The system administrator can designate one or more disks as hot-relocation spares within each disk group. Disks can be designated as spares using the Visual Administrator, vxdiskadm, or vxedit (as described in ). Disks designated as spares do not participate in the free space model and should not have storage space allocated on them.

When selecting space for relocation, hot-relocation preserves the redundancy characteristics of the VxVM object that the relocated subdisk belongs to. For example, hot-relocation ensures that subdisks from a failed plex are not relocated to a disk containing a mirror of the failed plex. If redundancy cannot be preserved using any available spare disks and/or free space, hot-relocation does not take place. If relocation is not possible, the system administrator is notified and no further action is taken.

When hot-relocation takes place, the failed subdisk is removed from the configuration database and VxVM takes precautions to ensure that the disk space used by the failed subdisk is not recycled as free space.

For information on how to take advantage of hot-relocation, refer to Chapter 2 and Chapter 3 of .

Volume Resynchronization

When storing data redundantly, using mirrored or RAID-5 volumes, the Volume Manager takes necessary measures to ensure that all copies of the data match exactly. However, under certain conditions (usually due to complete system failures), small amounts of the redundant data on a volume can become inconsistent or unsynchronized. Aside from normal configuration changes (such as detaching and reattaching a plex), this can only occur when a system crashes while data is being written to a volume. Data is written to the mirrors of a volume in parallel, as is the data and parity in a RAID-5 volume. If a system crash occurs before all the individual writes complete, it is possible for some writes to complete while others do not, resulting in the data becoming unsynchronized. This is very undesirable. For mirrored volumes, it can cause two reads from the same region of the volume to return different results if different mirrors are used to satisfy the read request. In the case of RAID-5 volumes, it can lead to parity corruption and incorrect data reconstruction. When the Volume Manager recognizes this situation, it needs to make sure that all mirrors contain exactly the same data and that the data and parity in
RAID-5 volumes agree. This process is called volume resynchronization. For volumes that are part of disk groups that are automatically imported at boot time (such as rootdg), the resynchronization process takes place when the system reboots.

Not all volumes may require resynchronization after a system failure. Volumes that were never written or that were quiescent (i.e., had no active I/O) when the system failure occurred could not have had any outstanding writes and thus do not require resynchronization. The Volume Manager notices when a volume is first written and marks it as dirty. When a volume is closed by all processes or stopped cleanly by the administrator, all writes will have completed and the Volume Manager removes the dirty flag for the volume. Only volumes that are marked dirty when the system reboots require resynchronization.

The exact process of resynchronization depends on the type of volume.
RAID-5 volumes that contain RAID-5 logs can simply "replay" those logs. If no logs are available, the volume is placed in reconstruct-recovery mode and all parity is regenerated. For mirrored volumes, resynchronization is achieved by placing the volume in recovery mode (also called read-writeback recovery mode) and resynchronizing all data in the volume in the background. This allows the volume to be available for use while recovery is ongoing.

The process of resynchronization can be computationally expensive and can have a significant impact on system performance. The recovery process attempts to alleviate some of this impact by attempting to "spread out" recoveries to avoid stressing a specific disk or controller. Additionally, for very large volumes or for a very large number of volumes, the resynchronization process can take a long time. These effects can be addressed by using Dirty Region Logging for mirrored volumes, or by making sure that RAID-5 volumes have valid RAID-5 logs.

Dirty Region Logging

Dirty Region Logging (DRL) is an optional property of a volume, used to provide a speedy recovery of mirrored volumes after a system failure. DRL keeps track of the regions that have changed due to I/O writes to a mirrored volume and uses this information to recover only the portions of the volume that need to be recovered.

If DRL is not used and a system failure occurs, all mirrors of the volumes must be restored to a consistent state by copying the full contents of the volume between its mirrors. This process can be lengthy and I/O intensive; it may also be necessary to recover the areas of volumes that are already consistent.

DRL logically divides a volume into a set of consecutive regions. It keeps track of volume regions that are being written to. A dirty region log is maintained that contains a status bit representing each region of the volume. For any write operation to the volume, the regions being written are marked dirty in the log before the data is written. If a write causes a log region to become dirty when it was previously clean, the log is synchronously written to disk before the write operation can occur. On system restart, the Volume Manager will recover only those regions of the volume that are marked as dirty in the dirty region log.

Log subdisks are used to store the dirty region log of a volume that has DRL enabled. A volume with DRL has at least one log subdisk; multiple log subdisks can be used to mirror the dirty region log. Each log subdisk is associated with one of the volume's plexes. Only one log subdisk can exist per plex. If the plex contains only a log subdisk and no data subdisks, that plex can be referred to as a log plex. The log subdisk can also be associated with a regular plex containing data subdisks, in which case the log subdisk risks becoming unavailable in the event that the plex must be detached due to the failure of one of its data subdisks.

If the vxassist command is used to create a dirty region log, it creates a log plex containing a single log subdisk, by default. A dirty region log can also be created "manually" by creating a log subdisk and associating it with a plex. In the latter case, the plex may contain both a log subdisk and data subdisks.

Only a limited number of bits can be marked dirty in the log at any time. The dirty bit for a region is not cleared immediately after writing the data to the region. Instead, it remains marked as dirty until the corresponding volume region becomes the least recently used. If a bit for a given region is already marked dirty and another write to the same region occurs, it is not necessary to write the log to the disk before the write operation can occur.

Note: DRL adds a small I/O overhead for most write access patterns.

Redo Log Volume Configuration

A redo log is a log of changes to the database data. No logs of the changes to the redo logs are kept by the database, so the database itself cannot provide information about which sections require resilvering. Redo logs are also written sequentially, and since traditional dirty region logs are most useful with randomly-written data, they are of minimal use for reducing recovery time for redo logs. However, VxVM can significantly reduce the number of dirty regions by modifying the behavior of its Dirty Region Logging feature to take advantage of sequential access patterns. This decreases the amount of data needing recovery and significantly reduces recovery time impact on the system.

The enhanced interfaces for redo logs allow the database software to inform VxVM when a volume is to be used as a redo log. This allows VxVM to modify the volume's DRL behavior to take advantage of the access patterns. Since the improved recovery time depends on dirty region logs, redo log volumes should be configured as mirrored volumes with dirty region logs.

Volume Manager Rootability

The Volume Manager provides the capability of placing the root and stand file systems and the initial swap device under Volume Manager control -- this is called rootability. The root disk (that is, the disk containing the root file system) can be put under VxVM control through the process of encapsulation, which converts existing partitions on that disk to volumes. Once under VxVM control, the root and swap devices appear as volumes and provide the same characteristics as other VxVM volumes. A volume that is configured for use as a swap area is referred to as a swap volume; a volume that contains the root file system is referred to as a root volume; and a volume that contains the stand file system is referred to as a stand volume.

It is possible to mirror the rootvol, swapvol, and standvol volumes, which are required for a successful boot of the system. This provides complete redundancy and recovery capability in the event of disk failure. Without Volume Manager rootability, the loss of the root, swap, or stand partition would prevent the system from being booted from surviving disks.

Mirroring disk drives critical to booting ensures that no single disk failure will leave the system unusable. Therefore, a suggested configuration would be to mirror the critical disk onto another available disk (using the vxdiskadm command). If the disk containing the root, stand, and swap partitions fails, the system can be rebooted from the disk containing the root mirror. For more information on mirroring the boot (root) disk and system recovery procedures, see the "Recovery" appendix.

Booting With Root Volumes

Ordinarily, when the operating system is booted, the root file system, stand file system, and swap area need to be available for use very early in the boot procedure (which is long before user processes can be run to load the Volume Manager configuration and start volumes). The root, stand, and swap device configurations must be completed prior to starting the Volume Manager. Starting VxVM's vxconfigd daemon as part of the init process is too late to configure volumes for use as a root or swap device.

To circumvent this restriction, the mirrors of the rootvol, standvol, and swapvol volumes can be accessed by the system during startup. During startup, the system sees the rootvol, standvol, and swapvol volumes as regular partitions and accesses them using standard partition numbering. Therefore, rootvol, standvol, and swapvol volumes must be created from contiguous disk space that is also mapped by a single partition for each. Due to this restriction, it is not possible to stripe or span the primary plex (that is, the plex used for booting) of a rootvol, standvol, or swapvol volume. Similarly, any mirrors of these volumes that might need to be used for booting cannot be striped or spanned.

Boot-time Volume Restrictions

The rootvol, standvol, and swapvol volumes differ from other volumes in that they have very specific restrictions on the configuration of the volumes:

The root volume (rootvol) must exist in the default disk group, rootdg. Although other volumes named rootvol may be created in disk groups other than rootdg, only the rootvol in rootdg can be used to boot the system.
A rootvol volume has a specific minor device number: minor device 0. Similarly, swapvol has minor device number 1.
Restricted mirrors of rootvol, standvol, and swapvol devices will have "overlay" partitions created for them. An "overlay" partition is one that exactly encompasses the disk space occupied by the restricted mirror. During boot (before the rootvol, standvol, and swapvol volumes are fully configured), the default volume configuration uses the overlay partition to access the data on the disk. (See "Booting With Root Volumes.")
Although it is possible to add a striped mirror to a rootvol device for performance reasons, you cannot stripe the primary plex or any mirrors of rootvol that may be needed for system recovery or booting purposes if the primary plex fails.
rootvol, standvol, and swapvol cannot be spanned or contain a primary plex with multiple non-contiguous subdisks.
When mirroring parts of the boot disk, the disk being mirrored to must be large enough to hold the data on the original plex, or mirroring may not work.
rootvol, standvol, and swapvol cannot be Dirty Region Logging volumes.

In addition to these requirements, it is a good idea to have at least one contiguous, cylinder-aligned mirror for each of the volumes for root and swap. This makes it easier to convert these from volumes back to regular disk partitions (during an operating system upgrade, for example).

Volume Manager Daemons

Two daemons must be running in order for the Volume Manager to work properly:

vxconfigd
vxiod

The Volume Manager Configuration Daemon

The Volume Manager configuration daemon (vxconfigd) is responsible for maintaining Volume Manager disk and disk group configurations. vxconfigd communicates configuration changes to the kernel and modifies configuration information stored on disk. vxconfigd must be running before Volume Manager operations can be performed.

Starting the Volume Manager Configuration Daemon

vxconfigd is invoked by startup scripts during the boot procedure.

To determine whether the volume daemon is enabled, enter the following command:

vxdctl mode

The following message appears if vxconfigd is both running and enabled:

mode: enabled

The following message appears if vxconfigd is running, but not enabled:

mode: disabled

To enable the volume daemon, enter the following:

vxdctl enable

The following message appears if vxconfigd is not running:

mode: not-running

If the latter message appears, start vxconfigd as follows:

vxconfigd

Once started, vxconfigd automatically becomes a background process.

By default, vxconfigd issues errors to the console. However, vxconfigd can be configured to log errors to a log file.

For more information on the vxconfigd daemon, refer to the vxconfigd(1M) and vxdctl(1M) manual pages.

The Volume I/O Daemon

The volume extended I/O daemon (vxiod) allows for some extended I/O operations without blocking calling processes.

For more detailed information about vxiod, refer to the vxiod (1M) manual page.

Starting the Volume I/O Daemon

vxiod daemons are started at system boot time. There are typically several vxiod daemons running at all times. Rebooting after your initial installation should start vxiod.

Verify that vxiod daemons are running by typing the following command:

vxiod

Since vxiod is a kernel thread and is not visible to users via ps, this is the only way to see if any vxiods are running.

If any vxiod daemons are running, the following should be displayed:

10 volume I/O daemons running

where 10 is the number of vxiod daemons currently running.

If no vxiod daemons are currently running, start some by entering the command:

vxiod set 10

where 10 may be substituted by the desired number of vxiod daemons. It is generally recommended that at least one vxiod daemon exist per CPU.

Volume Manager Interfaces

The Volume Manager supports the following user interfaces:

Visual Administrator -- The Visual Administrator is a graphical user interface to the Volume Manager. The Visual Administrator provides visual elements such as icons, menus, and forms to ease the task of manipulating Volume Manager objects. In addition, the Visual Administrator acts as an interface to some common file system operations.
Command Line Interface -- The Volume Manager command set consists of a number of comprehensive commands that range from simple commands requiring minimal user input to complex commands requiring detailed user input. Many of the Volume Manager commands require a thorough understanding of Volume Manager concepts. Most Volume Manager commands require appropriate privileges.
Volume Manager Support Operations -- The Volume Manager Support Operations interface (vxdiskadm) provides a menu-driven interface for performing disk and volume administration functions.

Volume Manager objects created by one interface are fully inter-operable and compatible with those created by the other interfaces.

Disk Array Overview

This section provides an overview of traditional disk arrays.

Performing I/O to disks is a slow process because disks are physical devices that require time to move the heads to the correct position on the disk before reading or writing. If all of the read or write operations are done to individual disks, one at a time, the read-write time can become unmanageable. Performing these operations on multiple disks can help to reduce this problem.

A disk array is a collection of disks that appears to the system as one or more virtual disks (also referred to as volumes). The virtual disks created by the software controlling the disk array look and act (to the system) like physical disks. Applications that interact with physical disks should work exactly the same with the virtual disks created by the array.

Data is spread across several disks within an array, which allows the disks to share I/O operations. The use of multiple disks for I/O improves I/O performance by increasing the data transfer speed and the overall throughput for the array.

Figure 23 illustrates a standard disk array.

Figure 23 Standard Disk Array

Redundant Arrays of Inexpensive Disks (RAID)

A Redundant Array of Inexpensive Disks (RAID) is a disk array set up so that part of the combined storage capacity is used for storing duplicate information about the data stored in the array. The duplicate information allows you to regenerate the data in case of a disk failure.

Several levels of RAID exist. These are introduced in the following sections.

Note: The Volume Manager supports RAID levels 0, 1, and 5 only.

For information on the Volume Manager's implementations of RAID, refer to "Volume Manager RAID Implementations."

RAID-0

Although it does not provide redundancy, striping is often referred to as a form of RAID, known as RAID-0. The Volume Manager's implementation of striping is described in "Striping (RAID-0)." RAID-0 offers a high data transfer rate and high I/O throughput, but suffers lower reliability and availability than a single disk.

RAID-1

Mirroring is a form of RAID, which is known as RAID-1. The Volume Manager's implementation of mirroring is described in "Mirroring (RAID-1)." Mirroring uses equal amounts of disk capacity to store the original plex and its mirror. Everything written to the original plex is also written to any mirrors. RAID-1 provides redundancy of data and offers protection against data loss in the event of physical disk failure.

RAID-2

RAID-2 uses bitwise striping across disks and uses additional disks to hold Hamming code check bits. RAID-2 is described in a University of California at Berkeley research paper entitled A Case for Redundant Arrays of Inexpensive Disks (RAID), by David A. Patterson, Garth Gibson, and Randy H. Katz (1987).

RAID-2 deals with error detection, but does not provide error correction. RAID-2 also requires large system block sizes, which limits its use.

RAID-3

RAID-3 uses a parity disk to provide redundancy. RAID-3 distributes the data in stripes across all but one of the disks in the array. It then writes the parity in the corresponding stripe on the remaining disk. This disk is the parity disk.

Figure 24 illustrates a RAID-3 disk array.

Figure 24 RAID-3 Disk Array

The user data is striped across the data disks. Each stripe on the parity disk contains the result of an exclusive OR (XOR) procedure done on the data in the data disks. If the data on one of the disks is inaccessible due to hardware or software failure, data can be restored by XORing the contents of the remaining data disks with the parity disk. The data on the failed disk can be rebuilt from the output of the XOR process.

RAID-3 typically uses a very small stripe unit size (also historically known as a stripe width), sometimes as small as one byte per disk (which requires special hardware) or one sector (block) per disk.

Figure 25 illustrates a data write to a RAID-3 array.

Figure 25 Data Writes to RAID-3

The parity disk model uses less disk space than mirroring, which uses equal amounts of storage capacity for the original data and the copy.

The RAID-3 model is often used with synchronized spindles in the disk devices. This synchronizes the disk rotation, providing constant rotational delay. This is useful in large parallel writes.

RAID-3 type performance can be emulated by configuring RAID-5 (described later) with very small stripe units.

RAID-4

RAID-4 introduces the use of independent-access arrays (also used by RAID-5). With this model, the system does not typically access all disks in the array when executing a single I/O procedure. This is achieved by ensuring that the stripe unit size is sufficiently large that the majority of I/Os to the array will only affect a single disk (for reads).

An array attempts to provide the highest rate of data transfer by spreading the I/O load as evenly as possible across all the disks in the array. In RAID-3, the I/O load is spread across the data disks, as shown in Figure 25, and each write is executed on all the disks in the array. The data in the data disk is XORed and the parity is written to the parity disk.

RAID-4 maps data and uses parity in the same manner as RAID-3, by striping the data across all the data disks and XORing the data for the information on the parity disk. The difference between RAID-3 and RAID-4 is that RAID-3 accesses all the disks at one time and RAID-4 accesses each disk independently. This allows the RAID-4 array to execute multiple I/O requests simultaneously (provided they map to different member disks), while RAID-3 can only execute one I/O request at a time.

RAID-4's read performance is much higher than its write performance. It performs well with applications requiring high read I/O rates. RAID-4 performance is not as high in small, write-intensive applications.

The parity disk can cause a bottleneck in the performance of RAID-4. This is because all the writes that are taking place simultaneously on the data disks must each wait its turn to write to the parity disk. The transfer rate of the entire RAID-4 array in a write-intensive application is limited to the transfer rate of the parity disk.

Since RAID-4 is limited to parity on one disk only, it is less useful than RAID-5.

RAID-5

RAID-5 is similar to RAID-4, using striping to spread the data over all the disks in the array and using independent access. However, RAID-5 differs from RAID-4 in that the parity is striped across all the disks in the array, rather than being concentrated on a single parity disk. This breaks the write bottleneck caused by the single parity disk write in the RAID-4 model.

Figure 26 illustrates parity locations in a RAID-5 array configuration. Every stripe has a column containing a parity stripe unit and columns containing data. The parity is spread over all of the disks in the array, reducing the write time for large independent writes because the writes do not have to wait until a single parity disk can accept the data.

Figure 26 Parity Locations in a RAID-5 Model

For additional information on RAID-5 and how it is implemented by the Volume Manager, refer to "Volume Manager and RAID-5."

[Next] [Previous] [Top] [Contents] [Index]