Tuesday, August 5, 2008

grep: locale

I've spent almost an hour with "grep -RIE 'class [a-z]+[A-Z]+' *h" trying to find classes which have names beginning with lowercase letter and contain capitals.
That grep command ignored case and I got classes with all lowercase letters also.
I've dug into the man pages and found the next paragraph:

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.
Yes, "LC_ALL=C grep -RIE 'class [a-z]+[A-Z]+' *h" worked for me but I didn't expect such behaivior with UTF-8 locale.

Googling a bit I've found some pages contain:
  • Collating symbols. These look like [.element.], where element is a collating element (i.e. a symbolic name for a multi-character string), and match the value of the collating element in the current locale. This doesn't seem to work in GNU grep.
  • On some locales it might include both the uppercase and lowercase of a given character. In the POSIX locale, this always expands to only the character given. 
So '[A-Z]'  is only A,B,C,...,Z for POSIX/C locale.

No comments: