Forschungspraktikum 1+2: Computational Social Science

Session 02: Text-As-Data: Preparation

Dr. Christian Czymara

Agenda

Regular Expressions
The structure of text-as-data
Preprocessing steps
Document-Term-Matrix
Tutorial: Preparing a scientific paper

Change in the syllabus structure

See updated syllabus
Deadline for term paper will remain 1. April 2025

Regular Expressions

Regular Expressions (RegEx) are patterns used to match character combinations in strings
In R, RegEx are commonly used for text manipulation and pattern matching
For example, we can identify or count the occurrence of certain words, word parts, or word combinations
The grep() function searches for patterns in strings using regular expressions
It returns the indices of the elements that match the pattern

Common Functions in R for RegEx

grep(): Returns indices of matching strings
grepl(): Returns TRUE/FALSE for each match
sub(): Replaces the first match
gsub(): Replaces all matches

Literal Characters

Literal Characters: Match exactly what you type (e.g., cat matches “cat”)
- For example, grep("cat", "cat") returns 1
- What will grep("cat", c("cat", "dog", "bird", "cat")) return?
```
grep("cat", c("cat", "dog", "bird", "cat"))
```
```
[1] 1 4
```
- grepl("cat", c("cat", "dog", "bird", "cat")) does the same but returns a list of TRUE or FALSE values
```
grepl("cat", c("cat", "dog", "bird", "cat"))
```
```
[1]  TRUE FALSE FALSE  TRUE
```

Metacharacters

Metacharacters: Special characters that have different meanings
Examples:
- . matches any character
- ^ matches the start of a string (outside of square brackets!)
- $ matches the end of a string
- * matches zero or more occurrences of the preceding character

Example `.`

. matches any character

What will grepl("c.t", c("cat", "dog", "bird", "cit", "caat")) return?

text <- c("cat", "dog", "bird", "cit", "caat")

grepl("c.t", text)

[1]  TRUE FALSE FALSE  TRUE FALSE

Example `^`

^ matches the start of a string; $ matches the end of a string
- What will grepl("^b", c("cat", "dog", "bird", "cat")) return?
```
text <- c("cat", "dog", "bird", "cat")

grepl("^b", text)
```
```
[1] FALSE FALSE  TRUE FALSE
```

Examples `^` & `$`

^apples matches “apples” only if it appears at the start of the string

text <- c("I love apples", "apples are awesome")

grepl("^apples", text)

[1] FALSE  TRUE

pie$ matches “pie” only if it appears at the end

text <- c("I love apple pie", "apple pie is awesome")

grepl("pie$", text)

[1]  TRUE FALSE

Special Characters

Quantifiers: Control how many times a pattern occurs
* (0 or more), + (1 or more), ? (0 or 1), {n} (n times)
Examples:
- a* matches zero or more “a”s
- a{2} matches exactly two “a”s

Example with Quantifiers

{n} (n times)
What will grepl("a{2,3}", c("cat", "caat", "caaat", "caaaat")) return?

text <- c("cat", "caat", "caaat", "caaaat")

grepl("a{2,3}", text)

[1] FALSE  TRUE  TRUE  TRUE

All elements with (at least) 2 to 3 “a”’s

Character Sets

Use [] to define a set of characters
[abc] matches “a”, “b”, or “c”

Example:

text <- c("cat", "dog", "bird")

grepl("[ao]", text)

[1]  TRUE  TRUE FALSE

Negating Characters

^ inside of square brackets negates the characters inside the square brackets

Example:

text <- c("cat", "dog", "bird")

grepl("[^ao]", text) # all three strings contain characters that are not "a" or "o"

[1] TRUE TRUE TRUE

text <- c("cat", "dog", "bird")

grepl("[^cat]", text)

[1] FALSE  TRUE  TRUE

Practical Example: Email Addresses

Emails consist of three parts: name + @ + domain + . + top-level domain
E.g.: cc@soz.uni-frankfurt.de
RegEx pattern: "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"

Email Addresses: Name

A name can be all characters: [a-zA-Z0-9._%+-]+
- [a-zA-Z0-9._%+-] matches any
  - lowercase letter (a-z)
  - uppercase letter (A-Z)
  - digit (0-9)
  - or the characters ., _ (underscore), %, +, and -
  - +: one or more of these characters must appear

Email Addresses: @ and domain

@ matches the literal “@”’s
[a-zA-Z0-9.-]+ matches the domain name (the part after the “@” symbol but before the dot)
- Any letter, digit, dot, or dash
- + at the end means one or more of these characters must appear

Email Addresses: Dot

\\. matches a literal dot
Backslashes are needed to escape the dot since a dot (.) normally matches any character (see prior slides)

Email Addresses: Top-level domain

[a-zA-Z]{2,} matches the top-level domain (e.g., “de”, “com”, “org”)
- [a-zA-Z] matches any letter
- {2,} means there must be at least two letters for the top-level domain (no upper limit)

Email Addresses

Identify emails in text

scraped_text <- c("E-Mail: cc@soz.uni-frankfurt.de", "Contact: jane.doe@example.com", "This is not an email")
email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
grepl(email_pattern, scraped_text)

[1]  TRUE  TRUE FALSE

To extract the email addresses, use the stringr package

library(stringr)
str_extract_all(scraped_text, email_pattern)

[[1]]
[1] "cc@soz.uni-frankfurt.de"

[[2]]
[1] "jane.doe@example.com"

[[3]]
character(0)

Tutorial 02: Exercises 1. to 2.

Preparation of Text-as-Data

The Digital Services Act

Example: The EU’s The Digital Services Act (Chapter 1, Article 1, Paragraph 1)

The aim of this Regulation is to contribute to the proper functioning
of the internal market for intermediary services by setting out harmonised rules
for a safe, predictable and trusted online environment that facilitates innovation
and in which fundamental rights enshrined in the Charter, including the principle 
consumer protection, are effectively protected.

First of all, the texts have to be broken down into their components (words, punctuation marks, etc.)
Let’s assume that each line is a single document
Corpus (text collection) with five documents

Vocabulary

The texts are divided into their components, using quanteda

dok1 <- "The aim of this Regulation is to contribute to the proper functioning"

dok2 <- "of the internal market for intermediary services by setting out harmonised rules"

dok3 <- "for a safe, predictable and trusted online environment that facilitates innovation"

dok4 <- "and in which fundamental rights enshrined in the Charter, including the principle"

dok5 <- "consumer protection, are effectively protected."

DSA <- cbind(dok1, dok2, dok3, dok4, dok5)

library(quanteda)

corp_DSA <- corpus(DSA)

toks_DSA <- tokens(corp_DSA)

Vocabulary

toks_DSA

Tokens consisting of 5 documents.
text1 :
 [1] "The"         "aim"         "of"          "this"        "Regulation" 
 [6] "is"          "to"          "contribute"  "to"          "the"        
[11] "proper"      "functioning"

text2 :
 [1] "of"           "the"          "internal"     "market"       "for"         
 [6] "intermediary" "services"     "by"           "setting"      "out"         
[11] "harmonised"   "rules"       

text3 :
 [1] "for"         "a"           "safe"        ","           "predictable"
 [6] "and"         "trusted"     "online"      "environment" "that"       
[11] "facilitates" "innovation" 

text4 :
 [1] "and"         "in"          "which"       "fundamental" "rights"     
 [6] "enshrined"   "in"          "the"         "Charter"     ","          
[11] "including"   "the"        
[ ... and 1 more ]

text5 :
[1] "consumer"    "protection"  ","           "are"         "effectively"
[6] "protected"   "."

Document-Feature-Matrix

What we need is a table (matrix)
… in which the texts (documents) are in the rows
… the words, characters, etc. (features) are in the columns
… and in the cells counts how often a feature occurs in the respective text

Document-Feature-Matrix

dfm_DSA <- dfm(toks_DSA)

dfm_DSA

Document-feature matrix of: 5 documents, 45 features (76.89% sparse) and 0 docvars.
       features
docs    the aim of this regulation is to contribute proper functioning
  text1   2   1  1    1          1  1  2          1      1           1
  text2   1   0  1    0          0  0  0          0      0           0
  text3   0   0  0    0          0  0  0          0      0           0
  text4   2   0  0    0          0  0  0          0      0           0
  text5   0   0  0    0          0  0  0          0      0           0
[ reached max_nfeat ... 35 more features ]

Almost done

Better: Keep only actual words as features (no numbers, punctuation, etc.)

toks_DSA_2 <- tokens(corp_DSA,
                     remove_punct = T,
                     remove_numbers = T,
                     remove_symbols = T,
                     remove_separators = T)

Remove stop words (common words without real meaning)

toks_DSA_2 <-  tokens_remove(toks_DSA_2, stopwords(), case_insensitive = TRUE)

Moreover, there are still several words occurring in different forms (e.g., singular and plural) counted as different terms

Solution 1: Stemming

Stemming means to reduce words to their stem, i. e. remove their suffixes
Heuristic-based process, which and often results in non-words
For example, “programming”, “programs, and”programmed” all become “program”
But, e.g., “happily” becomes “happi”; “better” becomes “bett”

Stemming in R

Easily implemented with quanteda’s tokens_wordstem() function

toks_DSA_2_backup <- toks_DSA_2
toks_DSA_2 <- tokens_wordstem(toks_DSA_2)
toks_DSA_2

Tokens consisting of 5 documents.
text1 :
[1] "aim"       "Regul"     "contribut" "proper"    "function" 

text2 :
[1] "intern"       "market"       "intermediari" "servic"       "set"         
[6] "harmonis"     "rule"        

text3 :
[1] "safe"    "predict" "trust"   "onlin"   "environ" "facilit" "innov"  

text4 :
[1] "fundament" "right"     "enshrin"   "Charter"   "includ"    "principl" 

text5 :
[1] "consum"  "protect" "effect"  "protect"

Solution 2: Lemmatization

Alternatively, lemmatization refers to the process of reducing words to their lemma
Lemma is the valid base form found in the dictionary (also called canonical, citation, or dictionary form)
However, it requires more linguistic knowledge and is significantly slower than stemming
“happily” becomes “happy”; “better” becomes “good”

Lemmatization in R

lemmatize_words() function of the textstem package

library(textstem)

toks_DSA_2_lemm <- lemmatize_words(unlist(toks_DSA_2_backup))

toks_DSA_2_lemm

        text11         text12         text13         text14         text15 
         "aim"   "Regulation"   "contribute"       "proper"     "function" 
        text21         text22         text23         text24         text25 
    "internal"       "market" "intermediary"      "service"          "set" 
        text26         text27         text31         text32         text33 
   "harmonise"         "rule"         "safe"  "predictable"        "trust" 
        text34         text35         text36         text37         text41 
      "online"  "environment"   "facilitate"   "innovation"  "fundamental" 
        text42         text43         text44         text45         text46 
       "right"     "enshrine"      "Charter"      "include"    "principle" 
        text51         text52         text53         text54 
    "consumer"   "protection"  "effectively"      "protect"

Stemming vs. Lemmatization

Let us compare the third part of the sentence of the Digital Services Act paragraph in its stemmed and its lemmatized version:

dok3 # original text

[1] "for a safe, predictable and trusted online environment that facilitates innovation"

toks_DSA_2[3] # stemmed terms

Tokens consisting of 1 document.
text3 :
[1] "safe"    "predict" "trust"   "onlin"   "environ" "facilit" "innov"

toks_DSA_2_lemm[13:19] # lemmatized terms

       text31        text32        text33        text34        text35 
       "safe" "predictable"       "trust"      "online" "environment" 
       text36        text37 
 "facilitate"  "innovation"

Cleaned Document-Feature-Matrix

Create DFM

dfm_DSA_2 <- dfm(toks_DSA_2)

dfm_DSA_2

Document-feature matrix of: 5 documents, 28 features (80.00% sparse) and 0 docvars.
       features
docs    aim regul contribut proper function intern market intermediari servic
  text1   1     1         1      1        1      0      0            0      0
  text2   0     0         0      0        0      1      1            1      1
  text3   0     0         0      0        0      0      0            0      0
  text4   0     0         0      0        0      0      0            0      0
  text5   0     0         0      0        0      0      0            0      0
       features
docs    set
  text1   0
  text2   1
  text3   0
  text4   0
  text5   0
[ reached max_nfeat ... 18 more features ]

Tutorial 02: Exercise 3. to 6.

Forschungspraktikum 1+2: Computational Social Science

Agenda

Change in the syllabus structure

Regular Expressions

Regular Expressions

Common Functions in R for RegEx

Literal Characters

Metacharacters

Example .

Example ^

Examples ^ & $

Special Characters

Example with Quantifiers

Character Sets

Negating Characters

Practical Example: Email Addresses

Email Addresses: Name

Email Addresses: @ and domain

Email Addresses: Dot

Email Addresses: Top-level domain

Email Addresses

Tutorial 02: Exercises 1. to 2.

Preparation of Text-as-Data

The Digital Services Act

Vocabulary

Vocabulary

Document-Feature-Matrix

Document-Feature-Matrix

Document-Feature-Matrix

Almost done

Solution 1: Stemming

Stemming in R

Solution 2: Lemmatization

Lemmatization in R

Stemming vs. Lemmatization

Cleaned Document-Feature-Matrix

Tutorial 02: Exercise 3. to 6.

Example `.`

Example `^`

Examples `^` & `$`