[1] 1 4
Session 02: Text-As-Data: Preparation
grep() function searches for patterns in strings using regular expressionsgrep(): Returns indices of matching stringsgrepl(): Returns TRUE/FALSE for each matchsub(): Replaces the first matchgsub(): Replaces all matchescat matches “cat”)
grep("cat", "cat") returns 1grep("cat", c("cat", "dog", "bird", "cat")) return?grepl("cat", c("cat", "dog", "bird", "cat")) does the same but returns a list of TRUE or FALSE values. matches any character^ matches the start of a string (outside of square brackets!)$ matches the end of a string* matches zero or more occurrences of the preceding character.. matches any character
grepl("c.t", c("cat", "dog", "bird", "cit", "caat")) return?^^ matches the start of a string; $ matches the end of a string
grepl("^b", c("cat", "dog", "bird", "cat")) return?^ & $^apples matches “apples” only if it appears at the start of the stringpie$ matches “pie” only if it appears at the end* (0 or more), + (1 or more), ? (0 or 1), {n} (n times)a* matches zero or more “a”sa{2} matches exactly two “a”s{n} (n times)grepl("a{2,3}", c("cat", "caat", "caaat", "caaaat")) return?Use [] to define a set of characters
[abc] matches “a”, “b”, or “c”
Example:
^ inside of square brackets negates the characters inside the square brackets
Example:
"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"[a-zA-Z0-9._%+-]+
[a-zA-Z0-9._%+-] matches any
(a-z)(A-Z)0-9)., _ (underscore), %, +, and -+: one or more of these characters must appear@ matches the literal “@”’s[a-zA-Z0-9.-]+ matches the domain name (the part after the “@” symbol but before the dot)
+ at the end means one or more of these characters must appear\\. matches a literal dot[a-zA-Z]{2,} matches the top-level domain (e.g., “de”, “com”, “org”)
[a-zA-Z] matches any letter{2,} means there must be at least two letters for the top-level domain (no upper limit)stringr packageThe aim of this Regulation is to contribute to the proper functioning
of the internal market for intermediary services by setting out harmonised rules
for a safe, predictable and trusted online environment that facilitates innovation
and in which fundamental rights enshrined in the Charter, including the principle
consumer protection, are effectively protected.
First of all, the texts have to be broken down into their components (words, punctuation marks, etc.)
Let’s assume that each line is a single document
Corpus (text collection) with five documents
dok1 <- "The aim of this Regulation is to contribute to the proper functioning"
dok2 <- "of the internal market for intermediary services by setting out harmonised rules"
dok3 <- "for a safe, predictable and trusted online environment that facilitates innovation"
dok4 <- "and in which fundamental rights enshrined in the Charter, including the principle"
dok5 <- "consumer protection, are effectively protected."
DSA <- cbind(dok1, dok2, dok3, dok4, dok5)
library(quanteda)
corp_DSA <- corpus(DSA)
toks_DSA <- tokens(corp_DSA)Tokens consisting of 5 documents.
text1 :
[1] "The" "aim" "of" "this" "Regulation"
[6] "is" "to" "contribute" "to" "the"
[11] "proper" "functioning"
text2 :
[1] "of" "the" "internal" "market" "for"
[6] "intermediary" "services" "by" "setting" "out"
[11] "harmonised" "rules"
text3 :
[1] "for" "a" "safe" "," "predictable"
[6] "and" "trusted" "online" "environment" "that"
[11] "facilitates" "innovation"
text4 :
[1] "and" "in" "which" "fundamental" "rights"
[6] "enshrined" "in" "the" "Charter" ","
[11] "including" "the"
[ ... and 1 more ]
text5 :
[1] "consumer" "protection" "," "are" "effectively"
[6] "protected" "."
Document-feature matrix of: 5 documents, 45 features (76.89% sparse) and 0 docvars.
features
docs the aim of this regulation is to contribute proper functioning
text1 2 1 1 1 1 1 2 1 1 1
text2 1 0 1 0 0 0 0 0 0 0
text3 0 0 0 0 0 0 0 0 0 0
text4 2 0 0 0 0 0 0 0 0 0
text5 0 0 0 0 0 0 0 0 0 0
[ reached max_nfeat ... 35 more features ]
quanteda’s tokens_wordstem() functionTokens consisting of 5 documents.
text1 :
[1] "aim" "Regul" "contribut" "proper" "function"
text2 :
[1] "intern" "market" "intermediari" "servic" "set"
[6] "harmonis" "rule"
text3 :
[1] "safe" "predict" "trust" "onlin" "environ" "facilit" "innov"
text4 :
[1] "fundament" "right" "enshrin" "Charter" "includ" "principl"
text5 :
[1] "consum" "protect" "effect" "protect"
lemmatize_words() function of the textstem package text11 text12 text13 text14 text15
"aim" "Regulation" "contribute" "proper" "function"
text21 text22 text23 text24 text25
"internal" "market" "intermediary" "service" "set"
text26 text27 text31 text32 text33
"harmonise" "rule" "safe" "predictable" "trust"
text34 text35 text36 text37 text41
"online" "environment" "facilitate" "innovation" "fundamental"
text42 text43 text44 text45 text46
"right" "enshrine" "Charter" "include" "principle"
text51 text52 text53 text54
"consumer" "protection" "effectively" "protect"
Document-feature matrix of: 5 documents, 28 features (80.00% sparse) and 0 docvars.
features
docs aim regul contribut proper function intern market intermediari servic
text1 1 1 1 1 1 0 0 0 0
text2 0 0 0 0 0 1 1 1 1
text3 0 0 0 0 0 0 0 0 0
text4 0 0 0 0 0 0 0 0 0
text5 0 0 0 0 0 0 0 0 0
features
docs set
text1 0
text2 1
text3 0
text4 0
text5 0
[ reached max_nfeat ... 18 more features ]