Use of this document

This is a study note for using grep() family in \(base\) backage and \(stringr\) package for Regular Expression. For more details on the study material see:

Prerequisites

# essential
library(stringr)

1. Overview

\(stringr\) is built on top of the \(stringi\) package. \(stringr\) is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. \(stringi,\) on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: \(stringi\) has 234 functions to \(stringr\)'s 46.

Reference: Book: R for data science

1.1 Syntax standards and Modes

There are different syntax standards for regular expressions, and R offers two:

  • POSIX extended regular expressions. (avaliable in both \(base\) and \(stringr\))
  • Perl-like regular expressions. (only avaliable in \(base\))
grep() series functions operates in one of three modes
fixed = TRUE fixed = FALSE (default)
perl= TRUE N.A. use Perl-style regular expressions
perl= FALSE(default) use exact matching use POSIX 1003.2 extended regular expressions (default)

1.2 Common Functions

Functions \(base\) \(stringr\)
Detect, identify grep(..., value = FALSE) stringr::str_detect(), stringr::str_which(), stringr::str_count()
Locate regexpr(), gregexpr() stringr::str_locate(), string::str_locate_all()
subset grep(..., value = TRUE) stringr::str_sub(), stringr::str_subset(), stringr::str_extract(), stringr::str_extract_all(), stringr::str_match()
Multate, replace gsub() stringr::str_sub(),stringr::str_replace(), stringr::str_replace_all(), stringr::str_to_lower(),stringr::str_to_upper(),stringr::str_to_title()
Join and Split strsplit() stringr::str_c(), stringr::str_dup(),stringr::str_split_fixed(), stringr::str_split(),stringr::str_glue(),stringr::str_blue_data()
Order - stringr::str_order(),stringr::str_sort()

1.3 Key Construction

A key is a string, which is constructed by regular expressions under certain Syntax standards to match specifict pattern.

Example key using POSIX standard in R:

  • phone number "\D?\d{3}\D?\d{3}\D?\d{4}" for ddd_ddd_dddd
  • email: "\\w{1,}@\\w{1,}\\.[A-z]{1,}" for __@__.__

The composition of a key:

  • Grouping
    • ...: Exact (AND) matcing a fixed pattern in ....
    • ...|... Exat (OR) matcing either fixed pattern in ....
    • [...]{...} / [...]: Fuzzy matching. Match a pattern defined by the Character classes or a character list in [...] followed by specification of Quantifiers . Default for unspecified quantifier is {1} meaning any.
  • Positioning
    • "^...$": matches the start and the end of the string.
    • "\\b..." / "\\B...": matches the string provided it is / is not at an edge of a word.
  • Retrieve
    • (...): retrieve the matched expression in (...) for replacement or return.

2. Regular expression syntax

Category Description
metacharacters retain for information about repeats and location
Operators define the matching operation
Character classes define the pattern type, often couple with Repetition operator, [...]
Quantifiers Quantifiers specify how many repetitions of the pattern
Position of pattern within the string define where to operate the matchin in the string
Escape sequences alternative for above syntax

2.1 Metacharacters

Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string. This is accomplished with the help of metacharacters that have specific meaning. i.e. $ * + . ? [ ] ^ { } | ( ) \. Moreover, \ suppress the special meaning of metacharacters in regular expression, similar to its usage in escape sequences. Normally, we would escape things using \, but in R, that is a special character too, so the \ in \? also has to be escaped. i.e. \\. and \\? will recognize . and ? respectively.

2.2 Operators

Essential Operator
Operators Description
[...]: Repetition operator a character list matches any one of the characters inside the square brackets. We can also use - inside the brackets to specify a range of characters.
|: OR operator matches patterns on either side of the |.
(...): Capture operator grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \\N, with N being the No. of (...) used. This is called backreference.

2.3 Character classes

  • [:...:]: the Repetition operators has to be used inside square brackets, e.g. "[[:digit:]]".
Character classes Description
[:digit:] or digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9]
\D non-digits, equivalent to [^0-9]
[:lower:] lower-case letters, equivalent to [a-z]
[:upper:] upper-case letters, equivalent to [A-Z]
[:alpha:] alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z].
[:alnum:] alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9].
\w word characters, equivalent to [[:alnum:]_] or [A-z0-9_]
\W not word, equivalent to [^A-z0-9_]
[:xdigit:] hexadecimal digits (base 16), 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f, equivalent to [0-9A-Fa-f]
[:blank:] blank characters, i.e. space and tab.
[:space:] space characters: tab, newline, vertical tab, form feed, carriage return, space.
\s space
\S not space
[:punct:] punctuation characters, ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ { | } ~
[:graph:] graphical (human readable) characters: equivalent to [[:alnum:][:punct:]]
[:print:] printable characters, equivalent to [[:alnum:][:punct:]\\s]
[:cntrl:] control characters, like \n or \r, [\x00-\x1F\x7F]

2.4 Quantifiers

Quantifiers specify how many repetitions of the pattern.

Quantifiers Description
. matches any single character
* matches pattern at least 0 times.
+ matches pattern at least 1 times.
? matches pattern at most 1 times.
{n} matches pattern exactly n times.
{n,} matches pattern at least n times.
{n,m} matches pattern between n and m times.

2.5 Position of pattern within the string

Syntax Description
^ matches the start of the string.
$ matches the end of the string.
\b matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.
\B matches the empty string provided it is not at an edge of a word.

2.6 Escape sequences

An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly. All escape sequences consist of two or more characters,

  • the first of which is the backslash, \ (called the “Escape character”);
  • the remaining characters determine the interpretation of the escape sequence.
Quote Description
\n newline
\r carriage return
\t tab
\b backspace
\a alert (bell)
\f form feed
\v vertical tab
\\ backslash \
\' ASCII apostrophe '
\" ASCII quotation mark "
\nnn character with given octal code (1, 2 or 3 digits)
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1–4 hex digits)
\Unnnnnnnn Unicode character with given code (1–8 hex digits)

Reference: Escape sequences in C