VxFS System Administrator's Guide

The VERITAS File System

Chapter 1

Introduction

The VERITAS File System (VxFS or vxfs) is an extent based, journaling file system intended for use with SCO UnixWare(TM). VxFS provides enhancements that increase SCO UnixWare usability and make the UNIX system more viable for use in the commercial marketplace. VxFS is particularly useful in environments that require high performance and availability and deal with large volumes of data.

The VERITAS File System is available in two forms:

base VxFS
the separately licensed VxFS Advanced feature set, which provides additional functionality and better performance

Note: This guide is intended for use with both the VxFS and VxFS Advanced feature set. Some of the material covered applies only to the VxFS Advanced feature set.

The following topics are covered in this chapter:

VxFS Features

This chapter provides an overview of most of the VERITAS File System features. Some features are described in more detail in later chapters. The basic features include:

extent based allocation
extent attributes
fast file system recovery
access control lists (ACLs)

In addition, the VxFS Advanced feature set offers the following features:

online administration
online backup
enhanced application interface
enhanced mount options
improved synchronous write performance
support for large file systems (up to 1 terabyte)
support for large files (up to 2 terabytes)
enhanced I/O performance
support for BSD style quotas

The VxFS file system supports a maximum file system size of one terabyte, a file size of two terabytes, and file names up to 255 characters long. VxFS does not have inherent limitations on the maximum number of concurrently mounted file systems or concurrently accessed files. Logical block sizes of 1024 bytes (default), 2048 bytes, 4096 bytes, and 8192 bytes are supported.

The VxFS file system supports all s5 and ufs file system features and facilities except for the following:

support for removing or renaming "." and ".." directory entries (these particular operations are disallowed to preserve file system sanity)

VxFS also provides support for the Discretionary Access Control features of the sfs file system.

Disk Layout Options

Three disk layout formats are available with VxFS:

Version 1

The Version 1 disk layout is the original layout used with earlier releases of VxFS.

Version 2

The Version 2 disk layout supports features such as:

filesets
dynamic inode allocation
enhanced security

Version 4

Version 4 is the new and default VxFS disk layout. It adds support for:

files up to 2 terabytes
file systems up to 1 terabyte
quotas

See Chapter 2, "Disk Layout," for a description of the disk layouts.

Note: The Version 3 disk layout is not supported on SCO UnixWare.

File System Performance Enhancements

The s5 file system and the ufs file systems supplied with SCO UnixWare use block based allocation schemes which provide good random access to files and acceptable latency on small files. For larger files, however, this block based architecture limits throughput. This makes the s5 and ufs file systems less than optimal for commercial environments.

VxFS addresses this file system performance issue by using a different allocation scheme and by providing increased user control over allocation, I/O, and caching policies. The following performance enhancing features are provided by VxFS:

extent based allocation
enhanced mount options
data synchronous I/O
direct I/O and discovered direct I/O
caching advisories
enhanced directory features
explicit file alignment, extent size, and preallocation controls
tuneable I/O parameters
tuneable indirect data extent size
integration with VERITAS Volume Manager(TM) (VxVM®)

Details on the use of some of the preceding features can be found in the following sections and also in Chapter 5, "Performance and Tuning," and Chapter 6, "Application Interface."

To provide a conceptual understanding of how the VxFS allocation scheme differs from block based allocation, an overview of this architecture is covered in the section entitled "Extent based Allocation."

Extent Based Allocation

Disk space is allocated by the system in 512 byte sectors, which are grouped to form a logical block. VxFS supports logical block sizes of 1024, 2048, 4096, and 8192 bytes. The default block size is 1K for file systems up to 8 GB, 2K for file systems up to 16 GB, 4K for file systems up to 32 GB, and 8K for larger file systems.

An extent is defined as one or more adjacent blocks of data within the file system. An extent is presented as an address-length pair, which identifies the starting block address and the length of the extent (in file system or logical blocks). When storage is added to a file on a VxFS file system, it is grouped in extents, as opposed to being allocated a block at a time (as is done with thes5 and ufs file systems).

By allocating disk space to files in extents, disk I/O to and from a file can be done in units of multiple blocks. This type of I/O can occur if storage is allocated in units of consecutive blocks. For sequential I/O, multiple block operations are considerably faster than block-at-a-time operations. Almost all disk drives accept I/O operations of multiple blocks.

Extent allocation makes the interpretation of addressed blocks from the inode structure only slightly different from that of block based inodes. The ufs file system inode structure contains the addresses of 12 direct blocks, one indirect block, and one double indirect block. An indirect block contains the addresses of other blocks. The ufs indirect block size is 8K and each address is 4 bytes long. A ufs inode therefore can address 12 blocks directly and up to 2048 more blocks through one indirect address.

A VxFS inode is similar to the ufs inode. It references 10 direct extents, each of which are pairs of starting block addresses and lengths in blocks. The VxFS inode also points to two indirect address extents, which contain the addresses of other extents:

The first indirect address extent is used for single indirection, where each entry in the extent indicates the starting block number of an indirect data extent.
The second indirect address extent is used for double indirection, where each entry in the extent indicates the starting block number of a single indirect address extent.

Each indirect address extent is 8K long and contains 2048 entries. All indirect data extents for a file must be the same size: this value is set when the first indirect data extent is allocated and it is stored in the inode. Directory inodes always use an 8K indirect data extent size. By default, regular file inodes also use an 8K indirect data extent size (this can be changed with vxtunefs), but they allocate and use the indirect data extents in clusters to simulate larger extents.

Typed Extents

Note: The information in this section applies to the Version 4 disk layout.

In Version 4, VxFS introduced a new type of inode block map organization for indirect extents known as the typed extents. Each entry in the block map consists of a typed descriptor record containing a type, offset, starting block, and number of blocks.

Indirect as well as data extents use this format to identify logical file offsets and physical disk locations of any given extent. The extent descriptor fields are defined as follows:

type: Uniquely identifies and defines an extent descriptor record, and defines the record's length and format.
offset: Represents the logical file offset in blocks for a given descriptor. Used to optimize lookups and to eliminate hole descriptor entries.
starting block The starting file system block of the extent.
number of blocks The number of contiguous blocks in the extent

Some notes about typed extents:

Indirect address blocks are fully typed and may have variable lengths up to a maximum of 8K (this is the optimum size). On a fragmented file system, indirect extents may be smaller than 8K depending on space availability. VxFS always tries to obtain 8K indirect extents, but will use smaller indirects if needed.
Indirect Data extents are variable in size. This allows files which must go to indirects to continue to allocate large, contiguous extents and take full advantage of VxFS's optimized I/O.
Holes in sparse files require no storage. Since a typed record contains the offset and length of a descriptor, holes are eliminated entirely. A hole is determined by adding the offset and length of a descriptor and comparing the result with the offset of the next record.
There are no limits on the levels of indirection. It is expected however, that fewer levels will be seen with this format given that data extents are of variable lengths.
New types can be added in the future. Since this format uses a type indicator to determine it's record format and content, new types can be added to accommodate future requirements and new functionality.

Currently, the typed format is used on regular files only when indirection is needed. Typed records are longer than the previous format and therefore less direct entries can be used in the inode. Newly created files start out using the old format which allows for 10 direct extents in the inode. The inode's block map is converted to the typed format when indirection is needed. This allows us to take full advantage of both formats.

Extent Attributes

The VxFS file system allocates disk space to files in groups of one or more extents. VxFS also allows applications to control some aspects of the extent allocation for a given file. Extent attributes are the extent allocation policies associated with a file.

The setext and getext commands allow the administrator to set or view extent attributes associated with a file, as well as to preallocate space for a file. Refer to Chapter 3, "Extent Attributes," Chapter 6, "Application Interface," and the getext(1) and setext(1) manual pages for discussions on how to use extent attributes.

The vxtunefs command allows the administrator to set or view the default indirect data extent size. Refer to Chapter 5, "Performance and Tuning," and the vxtunefs(1M) manual page for discussions on how to use the indirect data extent size feature.

Fast File System Recovery

The s5 and ufs file systems rely on the full structural verification by the fsck utility as the only means to recover from a system failure. This involves checking the entire structure, verifying that the file system is intact, and correcting any inconsistencies that are found. For large disk configurations, this process can be very time consuming (see the fsck_vxfs(1M) manual page for more information).

The VxFS file system provides recovery only seconds after a system failure by utilizing a tracking feature called intent logging. Intent logging is a scheme that records pending changes to the file system structure. These changes are recorded in a circular intent log.

During system failure recovery, the VxFS fsck utility performs an intent log replay, which scans the intent log, nullifying or completing file system operations that were active when the system failed. The file system can then be mounted without completing a full structural check of the entire file system. Except for the fact that VxFS file system can be recovered in a few seconds, the intent log recovery feature is not readily apparent to either the user or the system administrator.

When the disk has a hardware failure, the intent log replay may not be able to completely recover the damaged file system structure. In such cases, the full structural mode of fsck provided with VxFS must be run.

Enhanced Security

The VxFS file system supports the Discretionary Access Control (DAC) mechanism to provide control over user access to files. Permission to access a file with DAC is provided in two forms:

Permission Bits: Permission bits control user access to files. These can be set to specify whether the owner, group, and others (i.e., everyone else) have permission to read, write, or execute a file. Specific users other than the owner cannot be allocated file permissions using this approach. Refer to the ls(1) and chmod(1) manual pages for information on viewing and setting file permissions.
Access Control Lists: An Access Control List (ACL) is composed of a series of entries that identify specific users or groups and their access privileges for a particular file. A file may have its own ACL or may share an ACL with other files. ACLs have the advantage of being able to specify detailed access permissions for multiple users and groups. Refer to the getacl(1) and setacl(1) manual pages for information on viewing and setting ACLs.

Online System Administration

A VxFS file system can be defragmented and resized while it remains online and accessible to users. The following sections contain detailed information about these features.

Defragmentation

Free resources are originally aligned in the most efficient order possible and are allocated to files in a way that is considered by the system to provide optimal performance. When a file system is active for extended periods of time, files grow, shrink, are created, and are removed. Over time, the original ordering of free resources is lost. As this process continues, the file system tends to spread further and further along the disk, leaving unused gaps or fragments between areas that are in use. This process is known as fragmentation. Fragmentation leads to degraded performance because the file system has fewer choices when selecting an extent (a group of contiguous data blocks) to assign to a file.

The s5 file system uses the dcopy utility to reorganize a file system and remove fragmentation, but it has two drawbacks:

The file system must be unmounted.
dcopy is time consuming.

The ufs file system uses the concept of cylinder groups to limit fragmentation. Cylinder groups are self contained sections of a file system that are composed of inodes, data blocks, and bitmaps that indicate free inodes and data blocks. Allocation strategies in ufs attempt to place inodes and data blocks in close proximity. This reduces fragmentation but does not eliminate it.

The VxFS file system provides the online administration utility fsadm to resolve the problem of fragmentation. One of the functions of the fsadm utility is to defragment a mounted file system. To defragment, the fsadm utility:

removes unused space from directories
makes all small files contiguous
consolidates free blocks for file system use

The fsadm utility can be run on demand; it should be scheduled regularly as a cron job (see the fsadm_vxfs(1M) manual page for more information).

Resizing

When a file system is created, it is assigned a specific size. Changes in system usage may result in file systems that are too small or too large for the new usage.

With the ufs and s5 file systems, there are traditionally three solutions to the problem of a file system that is too small:

Move some users to a different file system.
Move a subdirectory of the file system to a new file system.
Copy the entire file system to a larger file system.

When a file system is too large, most file systems make reclaiming the unused space a matter of off-loading the contents of the file system and rebuilding it to a new size. The solutions provided by the ufs and s5 file systems are undesirable as they require that the file system be unmounted, and users are unable to access the file system while it is being modified.

The VxFS file system utility fsadm provides a mechanism to solve these problems without unmounting the file system or interrupting users' productivity. fsadm enables the VxFS file system to be resized while it is mounted. A file system can be expanded or shrunk via fsadm. However, since the VxFS file system may only be mounted on one device, expanding the file system means that the underlying device must also be expandable while the file system is mounted.

VxVM allows expandability by providing virtual disks that can be expanded while being accessed. The VxFS and VxVM packages work together to provide online expansion capability. For additional information about the online expansion capabilities of VxVM, refer to the VERITAS Volume Manager System Administrator's Guide.

Online Backup

The VxFS file system provides a method for performing online backup of data using the "snapshot" feature of VxFS. A snapshot image of a mounted file system is created by "snapshot" mounting another file system, which then becomes an exact read-only copy of the first file system. The original file system is said to be snapped, and the copy is called the snapshot. The snapshot is a consistent view of the snapped file system at the point in time when the snapshot was made.

When changes are made to the snapped file system, the old data is first copied to the snapshot so that it is retained. When the snapshot is read, the old data is returned if the data was changed, or the current data from the snapped file system is returned. Backups are made by one of the following methods:

copying selected files from the snapshot file system (using find and cpio)
backing up the entire file system (using volcopy or fscat)
doing a full or incremental backup (using vxdump)

For detailed information about performing online backups, see Chapter 4, "Online Backup" and the fscat(1M), volcopy(1M), and vxdump(1M) manual pages.

Application Interface

The VxFS file system conforms to the System V Interface Definition (SVID) requirements and supports access using the Network File System (NFS). In addition to supporting these standard interfaces, VxFS provides enhancements that can be taken advantage of by applications that require performance features not provided by other file systems. These enhancements are introduced in this section and covered in detail in Chapter 6, "Application Interface."

Application Transparency

In most cases, any application designed to run on thes5 or ufs file systems should run transparently on the VxFS file system. The only exceptions are applications that depend on the pathname truncation that occurs when using the s5 file system. These applications are not portable to vxfs or ufs, because s5 truncates pathnames to 14 characters.

Applications that run on a ufs file system should function identically on a VxFS file system.

Expanded Application Facilities

The VxFS file system provides some facilities frequently associated with commercial applications. These facilities make it possible to

preallocate space for a file
specify a fixed extent size for a file
bypass the system buffer cache for file I/O
specify the expected access pattern for a file

Since these facilities are provided using VxFS-specific ioctl system calls, most existing UNIX system applications do not use these facilities. The cp, cpio, and mv utilities use these facilities to preserve extent attributes and allocate space more efficiently. The current attributes of a file can be listed using the getext(1) and or ls(1) command. Custom applications can use these facilities to receive the benefits of the resulting performance improvement. For portability reasons, these applications should check what file system type they are using before using these interfaces.

Extended `mount` Options

The VxFS file system supports extended mount options to specify:

enhanced data integrity modes
enhanced performance mode
temporary file system modes
improved synchronous writes

Details pertaining to the VxFS mount options can be found in Chapter 5, "Performance and Tuning," and in the mount_vxfs(1M) manual page.

Enhanced Data Integrity Modes

Note: Performance tradeoffs are associated with these mount options.

Thes5 and ufs file systems are "buffered" in the sense that resources are allocated to files and data is written asynchronously to files. File systems are buffered in this way to provide better performance. In general, the buffering schemes work well without compromising data integrity.

If a system failure occurs while a process is allocating space to a file, uninitialized data or data from another file may be present in the extended file after reboot. Also, data written shortly before the system failure may be lost.

Using `blkclear` for Data Integrity

In environments where performance is more important than absolute data integrity, the preceding situation is not of great concern. However, for environments where data integrity is critical, the VxFS file system provides a mount -o blkclear option that guarantees that uninitialized data does not appear in a file.

Using `closesync` for Data Integrity

The VxFS file system provides a mount -o mincache=closesync option, which is useful in desktop environments where users are likely to shut off the power on the machine without halting it first. With the closesync mode, only files that are currently being written when the system crashes or is turned off can lose data. In this mode, any changes to the file are flushed to disk when the file is closed.

Enhanced Performance Mode

Thes5 and ufs file systems are asynchronous in the sense that structural changes to the file system are not immediately written to disk. File systems are designed this way to provide better performance. However, if a system failure occurs, recent changes to the file system may be lost: attribute changes to files may disappear, and recently created files may be removed.

The default logging mode provided by VxFS (mount -o log) guarantees that all structural changes to the file system have been logged to disk before the system call returns to the application. If a system failure occurs, fsck replays any recent changes so that no metadata is lost. Recently written file data is lost unless a request was made to sync it to disk.

Using `delaylog` for Enhanced Performance

The VxFS file system provides a mount -o delaylog option, which can be used to increase performance. With the delaylog option, the logging of some structural changes is delayed. If a system failure occurs, recent changes may be lost. This option provides at least the same correctness guarantees that traditional UNIX file systems provide for system failure, along with fast file system recovery.

Temporary File System Modes

On most UNIX systems, temporary file system directories (such as
/tmp and /usr/tmp) are commonly used to hold files that do not need to be retained when the system reboots. Since such file systems are temporary, there is no need for the underlying file system to maintain a high degree of structural integrity for these directories.

Using `tmplog` For Temporary File Systems

The VxFS file system provides a mount -o tmplog option that allows the user to get higher performance on temporary file systems. With this option enabled, the logging of practically all operations is delayed for improved performance.

Using `nolog` For Temporary File Systems

The VxFS file system provides a nolog option to mount that disables intent logging. With this option enabled, system performance is considerably improved. However, in the event of a system failure, it is likely that any recently changed files on a nolog file system will contain random data.

Note: The nolog option is only recommended for file systems that will be remade with mkfs after every system failure.

Since the intent log is disabled, fast file system recovery does not work with this option: a full structural check must be run instead.

Improved Synchronous Writes

VxFS provides superior performance for synchronous write applications.

The datainlog option to mount greatly improves the performance of small synchronous writes (typically used by Network File System servers). datainlog is a default option to mount.

The use of the convosync=dsync option to mount improves the performance of applications that require synchronous data writes but not synchronous inode time updates.

Note: Use of the convosync=dsync option violates POSIX semantics.

Enhanced I/O Performance

VxFS provides enhanced I/O performance by an aggressive I/O clustering policy, providing integration with the VxVM, and allowing the system administrator to set application specific parameters on a per-file system basis.

Enhanced I/O Clustering

I/O clustering is a technique of grouping multiple I/O operations together for improved performance. The VxFS I/O clustering policies provide more aggressive I/O clustering than other file systems and offer higher I/O throughput when using large files. When accessing large files, performance is comparable to that provided by raw disk.

VxVM Integration

VxFS interfaces with VxVM to determine the I/O characteristics of the underlying volume and perform I/O accordingly. It also uses this information at mkfs time to perform proper allocation unit alignments to prepare for efficient I/O operations from the kernel.

As part of the VxFS/VxVM integration, VxVM exports a set of I/O parameters to achieve better I/O performance. This interface can be used to achieve enhanced performance for different volume configurations (such as RAID-5, striped, and mirrored volumes). For a RAID-5 volume, full stripe writes are important for good I/O performance. VxFS uses these parameters to issue appropriate I/O requests to VxVM to get better performance from the system.

System administrators can also set application specific parameters on a per-file system basis to improve I/O performance.

Default Indirect Extent Size

: On disk layout versions 1 and 2, this value can be set up to determine the indirect data extent size. All the indirect extents would be allocated in this size, provided a fixed extent size is not set and the file does not already have indirect extents. The Version 4 disk layout uses typed extents, which have variable sized indirects.

Discovered Direct I/O

: This value defines the minimum I/O size above which all the sizes would be performed as direct I/O.

Maximum Direct I/O Size

: This value defines the maximum size of a single direct I/O.

For a discussion on the usage of VxVM integration and performance benefits, refer to Chapter 5, "Performance and Tuning," Chapter 6, "Application Interface," and the vxtunefs(1M) and tunefstab(4) manual pages.

Quotas

The VxFS file system supports the Berkeley Software Distribution (BSD) style user quotas, which can be used to allocate per-user quotas on VxFS file systems. The quota system limits the use of two principal resources of a file system: files and data blocks. The system administrator can assign users quotas for each of these resources. A quota consists of two limits for each resource:

The hard limit represents an absolute limit on data blocks or files. The user may never exceed the hard limit under any circumstances.
The soft limit is lower than the hard limit and may be exceeded for a limited amount of time. This allows users to temporarily exceed limits if needed, as long as they are back under those limits before the allowed time limit expires.

The system administrator is responsible for assigning hard and soft limits to users.

For additional information on quotas, refer to Chapter 7, "Quotas."

Support for Large Files

The changes implemented with the Version 4 disk layout have greatly expanded file system scalability. Because file system structures are no longer in fixed locations, VxFS can now support files up to two terabytes in size (see Chapter 2, "Disk Layout").

File systems can be created or mounted with or without large files by specifying the largefiles or nolargefiles option of the mkfs or mount commands.

If nolargefiles is specified, a file system will not contain any files 2 gigabytes or larger, and large files cannot be created. If largefiles is specified, the file system allows files 2 gigabytes or larger (see the mount_vxfs(1M) and mkfs_vxfs(1M) manual pages).

Note: Be careful when enabling large file system capability. System administration utilities such may experience problems if they are not large file aware.