This is a study note for using grep()
family in \(base\) backage and \(stringr\) package for Regular Expression. For more details on the study material see:
# essential
library(stringr)
\(stringr\) is built on top of the \(stringi\) package. \(stringr\) is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. \(stringi,\) on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: \(stringi\) has 234 functions to \(stringr\)'s 46.
Reference: Book: R for data science
There are different syntax standards for regular expressions, and R offers two:
fixed = TRUE |
fixed = FALSE (default) |
|
---|---|---|
perl= TRUE |
N.A. | use Perl-style regular expressions |
perl= FALSE (default) |
use exact matching | use POSIX 1003.2 extended regular expressions (default) |
Functions | \(base\) | \(stringr\) |
---|---|---|
Detect, identify | grep(..., value = FALSE) |
stringr::str_detect() , stringr::str_which() , stringr::str_count() |
Locate | regexpr() , gregexpr() |
stringr::str_locate() , string::str_locate_all() |
subset | grep(..., value = TRUE) |
stringr::str_sub() , stringr::str_subset() , stringr::str_extract() , stringr::str_extract_all() , stringr::str_match() |
Multate, replace | gsub() |
stringr::str_sub() ,stringr::str_replace() , stringr::str_replace_all() , stringr::str_to_lower() ,stringr::str_to_upper() ,stringr::str_to_title() |
Join and Split | strsplit() |
stringr::str_c() , stringr::str_dup() ,stringr::str_split_fixed() , stringr::str_split() ,stringr::str_glue() ,stringr::str_blue_data() |
Order | - | stringr::str_order() ,stringr::str_sort() |
A key is a string, which is constructed by regular expressions under certain Syntax standards to match specifict pattern.
Example key using POSIX standard in R:
"\D?\d{3}\D?\d{3}\D?\d{4}"
for ddd_ddd_dddd
"\\w{1,}@\\w{1,}\\.[A-z]{1,}"
for __@__.__
The composition of a key:
...
: Exact (AND) matcing a fixed pattern in ...
....|...
Exat (OR) matcing either fixed pattern in ...
.[...]{...}
/ [...]
: Fuzzy matching. Match a pattern defined by the Character classes or a character list in [...]
followed by specification of Quantifiers . Default for unspecified quantifier is {1}
meaning any."^...$"
: matches the start and the end of the string."\\b..."
/ "\\B..."
: matches the string provided it is / is not at an edge of a word.(...)
: retrieve the matched expression in (...)
for replacement or return.Category | Description |
---|---|
metacharacters | retain for information about repeats and location |
Operators | define the matching operation |
Character classes | define the pattern type, often couple with Repetition operator, [...] |
Quantifiers | Quantifiers specify how many repetitions of the pattern |
Position of pattern within the string | define where to operate the matchin in the string |
Escape sequences | alternative for above syntax |
Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string. This is accomplished with the help of metacharacters that have specific meaning. i.e. $ * + . ? [ ] ^ { } | ( ) \
. Moreover, \
suppress the special meaning of metacharacters in regular expression, similar to its usage in escape sequences. Normally, we would escape things using \
, but in R, that is a special character too, so the \
in \?
also has to be escaped. i.e. \\.
and \\?
will recognize .
and ?
respectively.
Operators | Description |
---|---|
[...] : Repetition operator |
a character list matches any one of the characters inside the square brackets. We can also use - inside the brackets to specify a range of characters. |
| : OR operator |
matches patterns on either side of the | . |
(...) : Capture operator |
grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \\N , with N being the No. of (...) used. This is called backreference. |
[:...:]
: the Repetition operators has to be used inside square brackets, e.g. "[[:digit:]]"
.Character classes | Description |
---|---|
[:digit:] |
or digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9] |
\D |
non-digits, equivalent to [^0-9] |
[:lower:] |
lower-case letters, equivalent to [a-z] |
[:upper:] |
upper-case letters, equivalent to [A-Z] |
[:alpha:] |
alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z] . |
[:alnum:] |
alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9] . |
\w |
word characters, equivalent to [[:alnum:]_] or [A-z0-9_] |
\W |
not word, equivalent to [^A-z0-9_] |
[:xdigit:] |
hexadecimal digits (base 16), 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f, equivalent to [0-9A-Fa-f] |
[:blank:] |
blank characters, i.e. space and tab. |
[:space:] |
space characters: tab, newline, vertical tab, form feed, carriage return, space. |
\s |
space |
\S |
not space |
[:punct:] |
punctuation characters, ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ { | } ~ |
[:graph:] |
graphical (human readable) characters: equivalent to [[:alnum:][:punct:]] |
[:print:] |
printable characters, equivalent to [[:alnum:][:punct:]\\s] |
[:cntrl:] |
control characters, like \n or \r , [\x00-\x1F\x7F] |
Quantifiers specify how many repetitions of the pattern.
Quantifiers | Description |
---|---|
. |
matches any single character |
* |
matches pattern at least 0 times. |
+ |
matches pattern at least 1 times. |
? |
matches pattern at most 1 times. |
{n} |
matches pattern exactly n times. |
{n,} |
matches pattern at least n times. |
{n,m} |
matches pattern between n and m times. |
Syntax | Description |
---|---|
^ |
matches the start of the string. |
$ |
matches the end of the string. |
\b |
matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string. |
\B |
matches the empty string provided it is not at an edge of a word. |
An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly. All escape sequences consist of two or more characters,
\
(called the “Escape character”);Quote | Description |
---|---|
\n |
newline |
\r |
carriage return |
\t |
tab |
\b |
backspace |
\a |
alert (bell) |
\f |
form feed |
\v |
vertical tab |
\\ |
backslash \ |
\' |
ASCII apostrophe ' |
\" |
ASCII quotation mark " |
\nnn |
character with given octal code (1, 2 or 3 digits) |
\xnn |
character with given hex code (1 or 2 hex digits) |
\unnnn |
Unicode character with given code (1–4 hex digits) |
\Unnnnnnnn |
Unicode character with given code (1–8 hex digits) |
Reference: Escape sequences in C