Specifying the locale

Regular expressions and locales

Regular expressions are interpreted differently for different locales. Locale definitions of the collating order of a character set may differ, so that regular expressions containing collating elements or ranges evaluate differently. If letters are defined as being equivalent in collating order, this might change the order of evaluation. Character classes also vary between locales. For example, the extended regular expression,

[A-z]

is intended to recognize all upper- or lowercase characters in English. However, this fails to recognize accented characters in the ISO8859-1 character set (with values from 0xC0 to 0xFF in hexadecimal). To recognize all upper- or lowercase characters, use:

[[:alpha:]]

This expression recognizes all characters in the set that match the set alpha defined within the current locale. In the POSIX locale, this includes the defined sets upper and lower. In other locales, it should include all the letters of the alphabet.

Because the interpretation of regular expressions is dependent on the locale, take care when using regular expressions in shell scripts that might be used in more than one locale. Also, when constructing a new locale definition ensure that the character classes you define correspond to the desired regular expressions.

See regexp(5) for rules on constructing regular expressions.