Applies a set of regular-expression-based text normalization rules to one or more strings. All performed replacements are displayed on the console by default
(verbose = TRUE
).
Usage
str_normalize(
string,
rules = yay::regex_text_normalization,
n_context_chrs = 20L,
verbose = TRUE
)
Arguments
- string
Input vector. Either a character vector, or something coercible to one.
- rules
A tibble of regular expression patterns and replacements. It must have the columns
pattern
andreplacement
.pattern
can optionally be a list column condensing multiple patterns to the same replacement rule. Patterns are interpreted as regular expressions as described instringi::stringi-search-regex()
. Replacements are interpreted as-is, except that references of the form\1
,\2
, etc. will be replaced with the contents of the respective matched group (created in patterns using()
). Pattern-replacement pairs are processed in the order given, meaning that first listed pairs are applied before later listed ones.- n_context_chrs
The (maximum) number of characters displayed around the actual
string
and its replacement. The number refers to a single side ofstring
/replacement, so the total number of context characters is at the maximum2 * n_context_chrs
. Only relevant ifverbose = TRUE
.- verbose
Whether or not to display replacements on the console.
See also
Regular expression rules: regex_text_normalization
regex_file_normalization
Other string functions:
str_normalize_file()
,
str_replace_file()
,
str_replace_verbose()
Examples
"This kind of “text normalization” is e.g. useful to apply before feeding stuff to ‘Pandoc’" |>
yay::str_normalize()
#> 1× - This kind of “text normalization” …
#> + This kind of "text normalization” …
#> 1× - … “text normalization” is e.g. useful to a…
#> + … “text normalization" is e.g. useful to a…
#> 1× - …re feeding stuff to ‘Pandoc’
#> + …re feeding stuff to 'Pandoc’
#> 1× - …ing stuff to ‘Pandoc’
#> + …ing stuff to ‘Pandoc'
#> [1] "This kind of \"text normalization\" is e.g. useful to apply before feeding stuff to 'Pandoc'"