Forschungspraktikum 1+2: Computational Social Science

Session 02: Text-As-Data: Preparation

Dr. Christian Czymara

Agenda

  • Regular Expressions
  • The structure of text-as-data
  • Preprocessing steps
  • Document-Term-Matrix
  • Tutorial: Preparing a scientific paper

Change in the syllabus structure

Regular Expressions

Regular Expressions

  • Regular Expressions (RegEx) are patterns used to match character combinations in strings
  • In R, RegEx are commonly used for text manipulation and pattern matching
  • For example, we can identify or count the occurrence of certain words, word parts, or word combinations
  • The grep() function searches for patterns in strings using regular expressions
  • It returns the indices of the elements that match the pattern

Common Functions in R for RegEx

  • grep(): Returns indices of matching strings
  • grepl(): Returns TRUE/FALSE for each match
  • sub(): Replaces the first match
  • gsub(): Replaces all matches

Literal Characters

  • Literal Characters: Match exactly what you type (e.g., cat matches “cat”)
    • For example, grep("cat", "cat") returns 1
    • What will grep("cat", c("cat", "dog", "bird", "cat")) return?
    grep("cat", c("cat", "dog", "bird", "cat"))
    [1] 1 4
    • grepl("cat", c("cat", "dog", "bird", "cat")) does the same but returns a list of TRUE or FALSE values
    grepl("cat", c("cat", "dog", "bird", "cat"))
    [1]  TRUE FALSE FALSE  TRUE

Metacharacters

  • Metacharacters: Special characters that have different meanings
  • Examples:
    • . matches any character
    • ^ matches the start of a string (outside of square brackets!)
    • $ matches the end of a string
    • * matches zero or more occurrences of the preceding character

Example .

  • . matches any character
    • What will grepl("c.t", c("cat", "dog", "bird", "cit", "caat")) return?
    text <- c("cat", "dog", "bird", "cit", "caat")
    
    grepl("c.t", text)
    [1]  TRUE FALSE FALSE  TRUE FALSE

Example ^

  • ^ matches the start of a string; $ matches the end of a string
    • What will grepl("^b", c("cat", "dog", "bird", "cat")) return?
    text <- c("cat", "dog", "bird", "cat")
    
    grepl("^b", text)
    [1] FALSE FALSE  TRUE FALSE

Examples ^ & $

  • ^apples matches “apples” only if it appears at the start of the string
text <- c("I love apples", "apples are awesome")

grepl("^apples", text)
[1] FALSE  TRUE
  • pie$ matches “pie” only if it appears at the end
text <- c("I love apple pie", "apple pie is awesome")

grepl("pie$", text)
[1]  TRUE FALSE

Special Characters

  • Quantifiers: Control how many times a pattern occurs
  • * (0 or more), + (1 or more), ? (0 or 1), {n} (n times)
  • Examples:
    • a* matches zero or more “a”s
    • a{2} matches exactly two “a”s

Example with Quantifiers

  • {n} (n times)
  • What will grepl("a{2,3}", c("cat", "caat", "caaat", "caaaat")) return?
text <- c("cat", "caat", "caaat", "caaaat")

grepl("a{2,3}", text)
[1] FALSE  TRUE  TRUE  TRUE
  • All elements with (at least) 2 to 3 “a”’s

Character Sets

  • Use [] to define a set of characters

  • [abc] matches “a”, “b”, or “c”

  • Example:

    text <- c("cat", "dog", "bird")
    
    grepl("[ao]", text)
    [1]  TRUE  TRUE FALSE

Negating Characters

  • ^ inside of square brackets negates the characters inside the square brackets

  • Example:

    text <- c("cat", "dog", "bird")
    
    grepl("[^ao]", text) # all three strings contain characters that are not "a" or "o"
    [1] TRUE TRUE TRUE
    text <- c("cat", "dog", "bird")
    
    grepl("[^cat]", text)
    [1] FALSE  TRUE  TRUE

Practical Example: Email Addresses

  • Emails consist of three parts: name + @ + domain + . + top-level domain
  • E.g.: cc@soz.uni-frankfurt.de
  • RegEx pattern: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

Email Addresses: Name

  • A name can be all characters: [a-zA-Z0-9._%+-]+
    • [a-zA-Z0-9._%+-] matches any
      • lowercase letter (a-z)
      • uppercase letter (A-Z)
      • digit (0-9)
      • or the characters ., _ (underscore), %, +, and -
      • +: one or more of these characters must appear

Email Addresses: @ and domain

  • @ matches the literal “@”’s
  • [a-zA-Z0-9.-]+ matches the domain name (the part after the “@” symbol but before the dot)
    • Any letter, digit, dot, or dash
    • + at the end means one or more of these characters must appear

Email Addresses: Dot

  • \\. matches a literal dot
  • Backslashes are needed to escape the dot since a dot (.) normally matches any character (see prior slides)

Email Addresses: Top-level domain

  • [a-zA-Z]{2,} matches the top-level domain (e.g., “de”, “com”, “org”)
    • [a-zA-Z] matches any letter
    • {2,} means there must be at least two letters for the top-level domain (no upper limit)

Email Addresses

  • Identify emails in text
scraped_text <- c("E-Mail: cc@soz.uni-frankfurt.de", "Contact: jane.doe@example.com", "This is not an email")
email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
grepl(email_pattern, scraped_text)
[1]  TRUE  TRUE FALSE
  • To extract the email addresses, use the stringr package
library(stringr)
str_extract_all(scraped_text, email_pattern)
[[1]]
[1] "cc@soz.uni-frankfurt.de"

[[2]]
[1] "jane.doe@example.com"

[[3]]
character(0)

Tutorial 02: Exercises 1. to 2.

Preparation of Text-as-Data

The Digital Services Act

The aim of this Regulation is to contribute to the proper functioning
of the internal market for intermediary services by setting out harmonised rules
for a safe, predictable and trusted online environment that facilitates innovation
and in which fundamental rights enshrined in the Charter, including the principle 
consumer protection, are effectively protected.
  • First of all, the texts have to be broken down into their components (words, punctuation marks, etc.)

  • Let’s assume that each line is a single document

  • Corpus (text collection) with five documents

Vocabulary

  • The texts are divided into their components, using quanteda
dok1 <- "The aim of this Regulation is to contribute to the proper functioning"

dok2 <- "of the internal market for intermediary services by setting out harmonised rules"

dok3 <- "for a safe, predictable and trusted online environment that facilitates innovation"

dok4 <- "and in which fundamental rights enshrined in the Charter, including the principle"

dok5 <- "consumer protection, are effectively protected."

DSA <- cbind(dok1, dok2, dok3, dok4, dok5)

library(quanteda)

corp_DSA <- corpus(DSA)

toks_DSA <- tokens(corp_DSA)

Vocabulary

toks_DSA
Tokens consisting of 5 documents.
text1 :
 [1] "The"         "aim"         "of"          "this"        "Regulation" 
 [6] "is"          "to"          "contribute"  "to"          "the"        
[11] "proper"      "functioning"

text2 :
 [1] "of"           "the"          "internal"     "market"       "for"         
 [6] "intermediary" "services"     "by"           "setting"      "out"         
[11] "harmonised"   "rules"       

text3 :
 [1] "for"         "a"           "safe"        ","           "predictable"
 [6] "and"         "trusted"     "online"      "environment" "that"       
[11] "facilitates" "innovation" 

text4 :
 [1] "and"         "in"          "which"       "fundamental" "rights"     
 [6] "enshrined"   "in"          "the"         "Charter"     ","          
[11] "including"   "the"        
[ ... and 1 more ]

text5 :
[1] "consumer"    "protection"  ","           "are"         "effectively"
[6] "protected"   "."          

Document-Feature-Matrix

Document-Feature-Matrix

  • What we need is a table (matrix)
  • … in which the texts (documents) are in the rows
  • … the words, characters, etc. (features) are in the columns
  • … and in the cells counts how often a feature occurs in the respective text

Document-Feature-Matrix

dfm_DSA <- dfm(toks_DSA)

dfm_DSA
Document-feature matrix of: 5 documents, 45 features (76.89% sparse) and 0 docvars.
       features
docs    the aim of this regulation is to contribute proper functioning
  text1   2   1  1    1          1  1  2          1      1           1
  text2   1   0  1    0          0  0  0          0      0           0
  text3   0   0  0    0          0  0  0          0      0           0
  text4   2   0  0    0          0  0  0          0      0           0
  text5   0   0  0    0          0  0  0          0      0           0
[ reached max_nfeat ... 35 more features ]

Almost done

  • Better: Keep only actual words as features (no numbers, punctuation, etc.)
toks_DSA_2 <- tokens(corp_DSA,
                     remove_punct = T,
                     remove_numbers = T,
                     remove_symbols = T,
                     remove_separators = T)
  • Remove stop words (common words without real meaning)
toks_DSA_2 <-  tokens_remove(toks_DSA_2, stopwords(), case_insensitive = TRUE)
  • Moreover, there are still several words occurring in different forms (e.g., singular and plural) counted as different terms

Solution 1: Stemming

  • Stemming means to reduce words to their stem, i. e. remove their suffixes
  • Heuristic-based process, which and often results in non-words
  • For example, “programming”, “programs, and”programmed” all become “program”
  • But, e.g., “happily” becomes “happi”; “better” becomes “bett”

Stemming in R

  • Easily implemented with quanteda’s tokens_wordstem() function
toks_DSA_2_backup <- toks_DSA_2
toks_DSA_2 <- tokens_wordstem(toks_DSA_2)
toks_DSA_2
Tokens consisting of 5 documents.
text1 :
[1] "aim"       "Regul"     "contribut" "proper"    "function" 

text2 :
[1] "intern"       "market"       "intermediari" "servic"       "set"         
[6] "harmonis"     "rule"        

text3 :
[1] "safe"    "predict" "trust"   "onlin"   "environ" "facilit" "innov"  

text4 :
[1] "fundament" "right"     "enshrin"   "Charter"   "includ"    "principl" 

text5 :
[1] "consum"  "protect" "effect"  "protect"

Solution 2: Lemmatization

  • Alternatively, lemmatization refers to the process of reducing words to their lemma
  • Lemma is the valid base form found in the dictionary (also called canonical, citation, or dictionary form)
  • However, it requires more linguistic knowledge and is significantly slower than stemming
  • “happily” becomes “happy”; “better” becomes “good”

Lemmatization in R

  • lemmatize_words() function of the textstem package
library(textstem)

toks_DSA_2_lemm <- lemmatize_words(unlist(toks_DSA_2_backup))

toks_DSA_2_lemm
        text11         text12         text13         text14         text15 
         "aim"   "Regulation"   "contribute"       "proper"     "function" 
        text21         text22         text23         text24         text25 
    "internal"       "market" "intermediary"      "service"          "set" 
        text26         text27         text31         text32         text33 
   "harmonise"         "rule"         "safe"  "predictable"        "trust" 
        text34         text35         text36         text37         text41 
      "online"  "environment"   "facilitate"   "innovation"  "fundamental" 
        text42         text43         text44         text45         text46 
       "right"     "enshrine"      "Charter"      "include"    "principle" 
        text51         text52         text53         text54 
    "consumer"   "protection"  "effectively"      "protect" 

Stemming vs. Lemmatization

  • Let us compare the third part of the sentence of the Digital Services Act paragraph in its stemmed and its lemmatized version:
dok3 # original text
[1] "for a safe, predictable and trusted online environment that facilitates innovation"
toks_DSA_2[3] # stemmed terms
Tokens consisting of 1 document.
text3 :
[1] "safe"    "predict" "trust"   "onlin"   "environ" "facilit" "innov"  
toks_DSA_2_lemm[13:19] # lemmatized terms
       text31        text32        text33        text34        text35 
       "safe" "predictable"       "trust"      "online" "environment" 
       text36        text37 
 "facilitate"  "innovation" 

Cleaned Document-Feature-Matrix

  • Create DFM
dfm_DSA_2 <- dfm(toks_DSA_2)

dfm_DSA_2
Document-feature matrix of: 5 documents, 28 features (80.00% sparse) and 0 docvars.
       features
docs    aim regul contribut proper function intern market intermediari servic
  text1   1     1         1      1        1      0      0            0      0
  text2   0     0         0      0        0      1      1            1      1
  text3   0     0         0      0        0      0      0            0      0
  text4   0     0         0      0        0      0      0            0      0
  text5   0     0         0      0        0      0      0            0      0
       features
docs    set
  text1   0
  text2   1
  text3   0
  text4   0
  text5   0
[ reached max_nfeat ... 18 more features ]

Tutorial 02: Exercise 3. to 6.