Info: (grep) Matching Non-ASCII

grep: Matching Non-ASCII

 
 3.8 Matching Non-ASCII and Non-printable Characters
 ===================================================
 
 In a regular expression, non-ASCII and non-printable characters other
 than newline are not special, and represent themselves.  For example, in
 a locale using UTF-8 the command ‘grep 'Λ ω'’ (where the white space
 between ‘Λ’ and the ‘ω’ is a tab character) searches for ‘Λ’ (Unicode
 character U+039B GREEK CAPITAL LETTER LAMBDA), followed by a tab (U+0009
 TAB), followed by ‘ω’ (U+03C9 GREEK SMALL LETTER OMEGA).
 
    Suppose you want to limit your pattern to only printable characters
 (or even only printable ASCII characters) to keep your script readable
 or portable, but you also want to match specific non-ASCII or non-null
 non-printable characters.  If you are using the ‘-P’ (‘--perl-regexp’)
 option, PCREs give you several ways to do this.  Otherwise, if you are
 using Bash, the GNU project’s shell, you can represent these characters
 via ANSI-C quoting.  For example, the Bash commands ‘grep $'Λ\tω'’ and
 ‘grep $'\u039B\t\u03C9'’ both search for the same three-character string
 ‘Λ ω’ mentioned earlier.  However, because Bash translates ANSI-C
 quoting before ‘grep’ sees the pattern, this technique should not be
 used to match printable ASCII characters; for example, ‘grep $'\u005E'’
 is equivalent to ‘grep '^'’ and matches any line, not just lines
 containing the character ‘^’ (U+005E CIRCUMFLEX ACCENT).
 
    Since PCREs and ANSI-C quoting are GNU extensions to POSIX, portable
 shell scripts written in ASCII should use other methods to match
 specific non-ASCII characters.  For example, in a UTF-8 locale the
 command ‘grep "$(printf '\316\233\t\317\211\n')"’ is a portable albeit
 hard-to-read alternative to Bash’s ‘grep $'Λ\tω'’.  However, none of
 these techniques will let you put a null character directly into a
 command-line pattern; null characters can appear only in a pattern
 specified via the ‘-f’ (‘--file’) option.
Info Catalog
grep: Character Encoding
grep: Regular Expressions