not trump trump
519 519
Session 06: Machine Learning with Text-as-Data
quanteda for Text Classification to Predict Trump’s Tweets[1] "@JUrciuoli19 I will."
[2] "\"\"\"@Jenism101: @realDonaldTrump @97Musick Trump is the only person we believe will actually set things right.\"\"\""
[3] "Unsolicited Ballots are uncontrollable, totally open to ELECTION INTERFERENCE by foreign countries, and will lead to massive chaos and confusion!"
[4] "RT @markknoller: Pres signed two broadband bills today:\n-requires Pres to ensure security of 5G and subsequent generation cell phone techno…"
[5] "\"\"\"@rebelgirl1213: @realDonaldTrump @Carrienguns America cannot survive another Bush! #NoMoreBush\"\"\""
[6] "I feel so badly for Mark Cuban-the Dallas Mavericks were just eliminated from the playoffs and his partners are pissed. Very sad!"
[1] "@Kat_La VOMIT! Ugh."
[2] "Take comfort progressive Australia. Abbott's Victory is the sickness that will force us to vomit out the poison. http://t.co/A5AIw3DdFO"
[3] "@mynameiselysee it looked like this. I want to vomit rn http://t.co/c5pD43TMuA"
[4] "About to vomit"
[5] "I'm not saying that if you voted Liberal you're a horrible vomit-flavoured garbage heap of rancid bigotry but um yeah I am saying that."
[6] "Oh fuck! I think I'm going to vomit.... \nThis is not romance. #TheBachelorAU is getting too tacky. #Sickly #VomitMaterial"
library(quanteda)
tweets_corpus <- corpus(tolower(tweets_combined$text),
docvars = tweets_combined)
toks_tweets <- tokens(tweets_corpus,
remove_punct = T,
remove_numbers = T,
remove_symbols = T,
remove_separators = T,
include_docvars = T)
# toks_tweets <- tokens_group(toks_tweets, groups = tweets_combined$is_trump)
toks_tweets <- tokens_remove(toks_tweets, stopwords())
dfm_tweets <- dfm(toks_tweets)Document-feature matrix of: 1,038 documents, 4,487 features (99.76% sparse) and 2 docvars.
features
docs bleh feeling vomit gonna bangerz close can just feel miley
text1 1 1 1 0 0 0 0 0 0 0
text2 0 0 1 1 1 1 1 1 1 1
text3 0 0 1 0 0 0 0 0 1 0
text4 0 0 1 0 0 0 0 0 0 0
text5 0 0 1 0 0 0 0 0 0 0
text6 0 0 1 0 0 0 0 0 1 0
[ reached max_ndoc ... 1,032 more documents, reached max_nfeat ... 4,477 more features ]
library(quanteda.textmodels)
# Train Support Vector Machine
trump_svm_model <- textmodel_svm(dfm_train, y = trump_observed_train)
trump_svm_model
Call:
textmodel_svm.dfm(x = dfm_train, y = trump_observed_train)
726 training documents; 3,488 fitted features.
Method: L2-regularized L2-loss support vector classification dual (L2R_L2LOSS_SVC_DUAL)
# Calculate precision and recall
tp_nb <- confusion_matrix_trump["trump", "trump"]
fp_nb <- confusion_matrix_trump["trump", "not trump"]
fn_nb <- confusion_matrix_trump["not trump", "trump"]
precision_nb <- tp_nb / (tp_nb + fp_nb)
recall_nb <- tp_nb / (tp_nb + fn_nb)
f1_score_nb <- 2 * (precision_nb * recall_nb) / (precision_nb + recall_nb)
precision_nb[1] 0.9870968
[1] 1
[1] 0.9935065
quantedaquanteda with text-as-dataquanteda.textmodels provides several text classificationtidymodelstextmodel_lm()textmodel_nb()textmodel_svm() (using e1071)textmodel_wordfish()| Model | Use Case |
|---|---|
textmodel_lm() |
Predicting continuous outcomes |
textmodel_nb() |
Text classification |
textmodel_svm() |
Complex classification |
textmodel_wordfish() |
Latent scaling (unsupervised) |
tidymodels (under construction)tidymodels with text-as-datatidymodels workflow we learned last week can also be applied to text-as-data
tidymodels workflow# Convert Document-Feature Matrix to data.frame
trump_dfm_df <- quanteda::convert(dfm_tweets, to = "data.frame")
trump_dfm_df$doc_id <- NULL
# Add is_trump
tidy_tweets <- cbind(trump_dfm_df, is_trump = as.factor(tweets_combined$is_trump))
library(tidymodels)
set.seed(1337)
data_split <- initial_split(tidy_tweets, prop = 0.8, strata = is_trump)
train_data <- training(data_split) # Training set
test_data <- testing(data_split) # Test settidymodels workflowtidymodels workflowlibrary(naivebayes)
library(discrim)
# Combine recipe and model in a workflow
nb_workflow <- workflow() %>%
add_recipe(tweets_recipe) %>%
add_model(nb_model)
# Train Model
set.seed(1337)
nb_fit <- fit(nb_workflow,
data = train_data)
# Predict
trump_predictions <- predict(nb_fit, new_data = test_data) Truth
Prediction not trump trump
not trump 0 0
trump 104 104