Quanteda - Tutorial.Rmd

---
title: "IDS Workshop - Quanteda"
subtitle: "How we met Quanteda - Analyzing the TV show ‘How I Met Your Mother’ with quanteda"
author: "Alexander Kraess, Augusto Fonseca, Jorge Roa"
date: "21/11/2022"
output: 
  html_document:
    toc: TRUE
    df_print: paged
    number_sections: FALSE
    highlight: tango
    theme: lumen
    toc_depth: 3
    toc_float: true
    self_contained: false
editor_options: 
  markdown: 
    wrap: sentence
---

```{r setup, include=FALSE}
rm(list = ls()) # to clean the workspace
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

obj_img <- image_read(path = "https://bit.ly/3twmH2Y")
```

------------------------------------------------------------------------
```{r, fig.align='left', echo=F, out.width = "30%"}
knitr::include_graphics("images/hertie_logo.png")


```


```{r, fig.align='center', echo=F, out.width = "60%"}
knitr::include_graphics("images/himym.png")


```


# 1. Welcome to our tutorial about the Quanteda package!

Quanteda is a brilliant package, and we hope that we can show you this power fullness! 
In this session You will practice the basic concepts, some filters and also some visualization features.

Fell free to access our GitHub page (https://github.com/jurjoroa/text_analysis_quanteda) to check our code and other details.


**Step 1** Load the packages which we will use (don't forget to install them before!).
```{r, message=F}

library(tidyverse)
library(tidytext)
library(quanteda)
library(quanteda.textstats)
library(quanteda.textplots)
library(stringr)
library(spacyr)
library(ggsci)
library(ggrepel)
library(RColorBrewer)
library(cowplot)
library(magick)
library(gghighlight)
library(readtext)
library(rvest)
library(xml2)
library(polite)
library(httr) #Package for working with HTTP organised by HTTP verbs 

```
# 2. Scraping Wikipedia to get the data

If you don't want to go through this, you might go to the session "3. Alternatively... import the database from our GitHub (https://github.com/jurjoroa/text_analysis_quanteda)!".

**Step 2.1** Scrap the website "Springfield! Springfield!" to get the episodes transcriptions.

Springfield! Springfield! hosts a database containing thousands of TV show episode scripts and movie scripts.
Read more: https://www.springfieldspringfield.co.uk/

```{r, message=F}

# 02.- Web scrap TV shows scripts ----------------------------------------------


## 02.01- Define URLS and read HTML---------------------------------------------

v_tv_show <- "how-i-met-your-mother"

v_url_web <- "http://www.springfieldspringfield.co.uk/"

session_information <- bow(v_url_web) #Do a bow with the polite package
session_information

v_url <- paste(v_url_web,"episode_scripts.php?tv-show=", v_tv_show, sep="")

rvest_himym <- session(v_url, 
                       add_headers(`From` = "jurjoo@gmail.com", 
                                   `UserAgent` = R.Version()$version.string))

html_url_scrape <- rvest_himym %>% read_html(v_url)

node_selector <- ".season-episode-title"

directory_path <- paste("texts/how-i-met-your-mother/", v_tv_show, sep="")

## 02.02.-Loop for download TV scripts-------------------------------------------

### 02.02.01.-scrape href nodes in .season-episode-title-------------------------

html_url_all_seasons <- html_nodes(html_url_scrape, node_selector) %>%
  html_attr("href")

### 02.02.02.-One loop for all our URL's----------------------------------------

for (x in html_url_all_seasons) {
  read_ur <- read_html(paste(v_url_web, x, sep="/"))
  
  Sys.sleep(runif(1, 0, 1)) #Be polite
  
  # Element node that was checked and that contain the place of the scripts.
  selector <- ".scrolling-script-container"
  # Scrape the text
  text_html <- html_nodes(read_ur, selector) %>% 
    html_text()
  
  # Last five characters of html_url_all_seasons for saving this to separate text files (This is our pattern).
  sub_data <- function(x, n) {
    substr(x, nchar(x) - n + 1, nchar(x))
  }
  seasons_final <- sub_data(x, 5)
  # Write each text file
  write.csv(text_html, file = paste(directory_path, "_", seasons_final, ".txt", sep=""), row.names = FALSE)
}


```


**Step 2.2 ** Scrap Wikipedia to get the episodes and characters metadata and clean it!

```{r, message=F}

# 03.- Webscrapp Tv Show tables--------------------------------------------------

## 03.01.- Information about tv episodes-----------------------------------------

url_himym <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_episodes"

rvest_himym_table <- session(url_himym, 
                             add_headers(`From` = "jurjoo@gmail.com", 
                                         `UserAgent` = R.Version()$version.string))

l_tables_himym <- rvest_himym_table %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE)

#This generates a list with all the tables that contain the page. In our case, 
#we want the table from the second element till the 10th. 
l_tables_himym <- l_tables_himym[c(2:10)]


### 03.01.01- Data cleaning to obtain clean tables---------------------------------

#Reduce the list in one data frame since all of the tables share the same structure 
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym)) 


#We do the same for the characters of HIMYM
url_himym_characters <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_characters"

rvest_himym_table_2 <- session(url_himym_characters, 
                               add_headers(`From` = "jurjoo@gmail.com", 
                                           `UserAgent` = R.Version()$version.string))

l_tables_himym_characters <- rvest_himym_table_2 %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = TRUE)

df_characters <- as.data.frame(l_tables_himym_characters[[1]]) %>% 
  select(Character)

df_characters_w <- df_characters %>% 
  filter(!stringr::str_starts(Character, "Futu"),
         !(Character %in% c("Character", "Main Characters", 
                            "Supporting Characters"))) %>% 
  mutate(name = str_extract(Character,"([^ ]+)"),
         name = replace(name, name == "Dr.", "Sonya"))

### 03.01.02- Data cleaning to wrangle html tables------------------------------

df_himym <- data.frame(Reduce(bind_rows, l_tables_himym)) 

df_himym_filt <- df_himym %>% filter(str_length(No.overall) < 4)

df_himym_filt_dupl <- df_himym %>% filter(str_length(No.overall) > 4)

df_himym_filt_dupl_1 <- df_himym_filt_dupl %>% 
  mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 1, 3))),
         No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 1, 2))),
         Prod.code = replace (Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 1, 6)))

df_himym_filt_dupl_2 <- df_himym_filt_dupl %>% 
  mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 4, 6))),
         No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 3, 4))),
         Title = replace(Title, Title == "\"The Magician's Code\"", "\"The Magician's Code Part 2\""),
         Title = replace(Title, Title == "\"The Final Page\"", "\"The Final Page Part 2\""),
         Title = replace(Title, Title == "\"Last Forever\"" , "\"Last Forever Part 2\"" ),
         Prod.code = replace(Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 7, 12)))

df_himym_final <- bind_rows(df_himym_filt, 
                            df_himym_filt_dupl_1, 
                            df_himym_filt_dupl_2) %>% 
  arrange(No.overall, No..inseason) %>% 
  mutate(year = str_extract(Original.air.date, '[0-9]{4}+'),
         Season = as.numeric(stringr::str_extract(Prod.code, "^.{1}"))) %>% 
  rename(Chapter = No..inseason)

df_himym_final$US.viewers.millions. <- as.numeric(str_replace_all(df_himym_final$US.viewers.millions., "\\[[0-9]+\\]", ""))


```


**Step 2.3** Load TV scripts (txt saved files) and merge data

```{r, message=F}

# 04.- Load TV scripts and merge data-------------------------------------------
df_texts_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")

v_season <- as.numeric(stringr::str_extract(df_texts_himym$doc_id, "\\d+"))

v_chapter <- as.numeric(stringi::stri_extract_last_regex(df_texts_himym$doc_id, "[0-9]+"))

df_texts_himym_w <- df_texts_himym %>% mutate(Season = v_season, Chapter = v_chapter)

df_himym_final_doc <- full_join(as.data.frame(df_texts_himym_w), df_himym_final, by = c("Season", "Chapter")) %>% 
  mutate(Season_w = paste("Season", Season),
         Title_season = paste0(Title, " S", Season, " EP", Chapter))


```

**Step 2.4** OPTIONAL: Save the final dataframe to be used as a corpus text

```{r, message=F}

# 05.- OPTIONAL: Save the final dataframe to be used as a corpus text -----------------------------


#save(df_himym_final_doc, file = "data/df_himym_final_doc.Rdata")
#save(df_characters_w, file = "data/df_characters_w.Rdata")


```


# 3. Alternatively... load dataframes files!

If you went through session 2, you already get the data raw and can jump to session "4. Uploading data into Quanteda corpus"

**Step 3.1** Load the files

```{r, message=F}
## 03.1- Load data- -----------------------------------------------------------

#If you want to know how we generated this data, go to the session 2 (or even the script 02_web_scrap)
load("data/df_himym_final_doc.Rdata")
load("data/df_characters_w.Rdata")


```

# 4. Uploading data into Quanteda corpus

OK! It's showtime! Let's upload all our data into a Quanteda corpus element.

**Step 4.1** First Step: Define a corpus

```{r, message=F}

# 01.- Quanteda analysis -------------------------------------------------------

# 02.- First step: Define a corpus ---------------------------------------------

corp_himym <- corpus(df_himym_final_doc)  #Build a new corpus from the texts

docnames(corp_himym) <- df_himym_final_doc$Title

summary(corp_himym, n = 15)

```
**Step 4.2** Convert corpus into tokens and wrangle it

```{r}

# 03.- Second step: Convert corpus into tokens and wrangle it ------------------

corp_himym_stat <- corp_himym

docnames(corp_himym_stat) <- df_himym_final_doc$Title_season


corp_himym_s1_simil <- corpus_subset(corp_himym_stat, Season == 1) #We want to analyze just the first season


toks_himym_s1 <- tokens(corp_himym_s1_simil, #corpus from all the episodes from the first season
                        remove_punct = TRUE, #Remove punctuation of our texts
                        remove_separators = TRUE, #Remove separators of our texts
                        remove_numbers = TRUE, #Remove numbers of our texts
                        remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_remove(stopwords("english")) #Remove stop words of our texts


```

**Step 4.3** Create a DFM.

```{r}

# 03.- Third step: Convert our tokens into a Document Feature Matrix -----------

toks_himym_dm_s1 <- toks_himym_s1 %>% 
                    dfm()

```


Let's check the new subsetted DFM


```{r, eval=F}

toks_himym_dm_s1

```

# 5. Now... let's have fun!

Now that we already upload the data and created the Quanteda elements, we can try some basics analysis.


**Step 5.1** Have you ever thought that you had seen an episode before? 
Let's check the similarity between episodes!


```{r, eval=F}
# 05.- Similarity between episodes --------------------------------------------

## 05.01.- textstat_simil function- --------------------------------------------

tstat_simil <- textstat_simil(toks_himym_dm_s1) #Check similarity between episodes of the first season

clust <- hclust(as.dist(tstat_simil)) #Convert our object into a cluster (For visualization purposes)

dclust <- as.dendrogram(clust)  #Convert our cluster into a dendogram (For visualization purposes)

dclust <- reorder(dclust, 1:22) #Order our visualization

#Seetle colors
nodePar <- list(lab.cex = 1, pch = c(NA, 19), 
                cex.axis = 1.5,
                cex = 2, col = "#0080ff")


## 05.02.- Plot Similarity between episodes--------------------------------------------


#Talk about different methods above the correlation 
par(mar = c(15, 7, 2, 1))

#Plot dendogram
plot(dclust, nodePar = nodePar,
     las = 1,
     cex.lab = 2, cex.axis = 2, cex.main = 2, cex.sub = 2,
     main = "How I Met Your Mother Season 1",
     type = "triangle",
     ylim = c(0,1),
     ylab = "Similarity between episodes (correlation %)",
     edgePar = list(col = 4:7, lwd = 7:7),
     panel.first = abline(h = c(seq(.10, 1, .10)), col = "grey80"))

rect.hclust(clust, k = 5, border = "red")

```

**Step 5.3**  Is there a correlation between the episodes?


```{r, eval=F}
# 06.- Distance between episodes (by correlation) ------------------------------

## 06.01.- textstat_dist function- ---------------------------------------------

tstat_dist <- textstat_dist(toks_himym_dm_s1)
clust_dist <- hclust(as.dist(tstat_dist))
dclust_dist <- as.dendrogram(clust_dist)

dclust_dist <- reorder(dclust_dist, 22:1)

nodePar_2 <- list(lab.cex = 1.2, pch = c(NA, 19), 
                  cex = 1.8, col = 11)

## 06.02.- Plot Distance between episodes (by correlation)----------------------

par(mar = c(15,7,2,1))

plot(dclust_dist, nodePar = nodePar_2,
     cex.lab = 2, cex.axis = 2, cex.main = 2, cex.sub = 2,
     main = "How I Met Your Mother Season 1",
     type = "triangle", ylim = c(0, 120),
     ylab = "Distance between episodes (correlation %)",
     edgePar = list(col = 11:19, lwd = 7:7),
     panel.first = abline(h = c(seq(10, 120, 10)), col = "grey80"))

rect.hclust(clust_dist, k = 5, border = "red")


```


**Step 5.4** What is the main actor for you? Does it depend on the season?


```{r, eval=F}
# 07.- Appearances of actors by season------------------------------------------

## 07.01.- Characters by season--------------------------------------------------

#Remember our second step: tokenize our corpus. 

toks_himym <- tokens(corp_himym, #corpus from all the episodes from the first season
                     remove_punct = TRUE, #Remove punctuation of our texts
                     remove_separators = TRUE,  #Remove separators of our texts
                     remove_numbers = TRUE, #Remove numbers of our texts
                     remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_remove(stopwords("english")) #Add additional words

#Remember our third step: DFM object

dfm_actors <- toks_himym %>% 
  tokens_select(c("Ted", "Marshall", "Lily", "Robin", "Barney", "Mother")) %>% #We just want to analyze these characters
  tokens_group(groups = Season) %>% #We group our tokens (scripts) by season
  dfm() #Transform the token into a DFM object

## 07.02.- textstat_frequency function------------------------------------------

df_final_actors <-  as.data.frame(textstat_frequency(dfm_actors, groups = c(1:9))) %>% 
                    mutate(Season = paste("Season", group),
                           `Principal Characters` = replace(feature, is.character(feature), str_to_title(feature))) %>% 
                    select(-feature)

## 07.03.- Plot frequency of actors--------------------------------------------

ggplot1 <- ggplot(df_final_actors, aes(x = group, y = frequency, group = `Principal Characters`, color = `Principal Characters`)) +
  geom_line(size = 1.5) +
  scale_color_manual(values = brewer.pal(n = 6, name = "Dark2")) +
  geom_point(size = 3.2) +
  scale_y_continuous(breaks = seq(0, 5600, by = 50), limits = c(0,560))+
  theme_minimal(base_size = 14) +
  labs(x = "Number of Season",
       y = "Frequencies of appreances",
       title = "Appearances of principal characters by Season",
       caption="Description: This plot show the number of times that the \n principal characters appears in HIMYM per season.")+
       theme(panel.grid.major=element_line(colour="#cfe7f3"),
             panel.grid.minor=element_line(colour="#cfe7f3"),
             plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
             #axis.text.x=element_text(size=15),
             #axis.text.y=element_text(size=15),
             plot.caption=element_text(size=12, hjust=.1, color="#939393"),
             legend.position="bottom",
             plot.margin = margin(t = 20,  # Top margin
                                  r = 50,  # Right margin
                                  b = 40,  # Bottom margin
                                  l = 10), # Left margin
             text=element_text(family="sans")) + 
#geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
#             arrow = arrow(length = unit(0.1, "cm")))+
  guides(colour = guide_legend(ncol = 6))

ggdraw(ggplot1) + draw_image(obj_img, x = .97, y = .97, 
                               hjust = 1.1, vjust = .7, 
                               width = 0.11, height = 0.1)

RColorBrewer::brewer.pal(n = 7, name = "Set1")

```

**Step 5.5** What is the most frequent characters that appears in the TV show?


```{r, eval=F}
# 08.- Wordcloud of PRINCIPAL characters that appears in HIMYM------------------

## 08.01.- Wordcloud steps------------------------------------------------------

### 08.01.01.- Second step: Tokens----------------------------------------------

toks_himym_characters <- tokens(corp_himym, #corpus from all the episodes from all season
                                remove_punct = TRUE, #Remove punctuation of our texts
                                remove_separators = TRUE, #Remove separators of our texts
                                remove_numbers = TRUE, #Remove numbers of our texts
                                remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_keep(c(unique(df_characters_w$name))) #This function allow us to keep just the tokens that we want. 

#In this case, we just want the characters.

### 08.01.02.- Third step: DFM object-------------------------------------------

dfm_general_characters <- toks_himym_characters %>%
                          dfm()

## 08.02.- Generate Wordcloud --------------------------------------------------

textplot_wordcloud(dfm_general_characters, 
                   rotation = 0.25,
                   font = "sans",
                   min_count = 1, #Minimum frequency
                   color = brewer.pal(11, "RdBu"))
#RColorBrewer::display.brewer.all()


```

**Step 5.6** What about the SECONDARY characters?


```{r, eval=F}
# 09.- Wordcloud of SECONDARY characters that appears in HIMYM------------------

## 09.01.- Wordcloud steps------------------------------------------------------

### 09.01.01.- Second step: Tokens----------------------------------------------

toks_himym_sec_characters <- tokens(corp_himym, #corpus from all the episodes from all season
                                    remove_punct = TRUE, #Remove punctuation of our texts
                                    remove_separators = TRUE, #Remove separators of our texts
                                    remove_numbers = TRUE, #Remove numbers of our texts
                                    remove_symbols = TRUE) %>% #Remove symbols of our texts
  tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
  tokens_remove(c("Ted", "Barney", "Lily", "Robin", "Marshall")) #But we remove the principal characters

### 09.01.02.- Third step: DFM object-------------------------------------------

dfm_general_sec_characters <- toks_himym_sec_characters %>%
                              dfm()

## 09.02.- Generate Wordcloud --------------------------------------------------

textplot_wordcloud(dfm_general_sec_characters, 
                   random_order = FALSE, 
                   rotation = 0.25,
                   #comparison = TRUE,
                   labelsize = 1.5,
                   min_count = 1, #Minimum frequency
                   color = RColorBrewer::brewer.pal(8, "Spectral"))


```

# 6. spaCy and spaCyr

SpaCyr is a package where we can use the amazing functions of spaCy for analysis of text. 


**Step 6.1** Explanation


```{r, eval=F}
# 10.- spaCy and spaCyr ------------------

#Explain what is spaCy and spaCyr

#Remember that spaCyr is a package where we can use the amazing functions of spaCy for analysis of text. 

## Using spaCyr for our TV show

#library(spacyr)
#
#spacy_install()
#
#spacy_initialize(model = "en_core_web_sm")

#We will not run this piece of chunk because it takes 5 minutes. 
#Here we are just installing from Python dependencies the package and the model.

## 10.01.- Load data------------------------------------------------------------

load("data/df_spaCyr_himym.Rdata")


## 10.02.- Review structure-----------------------------------------------------

#Look how the spacyr package separate our sentences into words and classified it with 
#Verbs, prepositions, Adverbs, Adjectives, etc. 
#head(sp_parse_doc)

## 10.03.- Filter data by type of word------------------------------------------

sp_parse_var <- full_join(sp_parse_doc, df_himym_final_doc, by = c("doc_id"))

#In this case, we will just look the proper names and adjectives.

sp_parse_var_PROPN <- sp_parse_var %>% filter(pos=="PROPN" & stringr::str_starts(entity, "PERSON_B"))

sp_parse_var_ADJ <- sp_parse_var %>% filter(pos=="ADJ")


## 10.04.- Get wordcloud using an spaCyr output---------------------------------

### 10.04.01.- Second step: Tokens----------------------------------------------

toks_himym_ADJ <- tokens(corp_himym, #corpus from all the episodes from all season
                         remove_punct = TRUE, #Remove punctuation of our texts
                         remove_separators = TRUE,  #Remove separators of our texts
                         remove_numbers = TRUE, #Remove numbers of our texts
                         remove_symbols = TRUE) %>%  #Remove symbols of our texts
  tokens_keep(c(unique(sp_parse_var_ADJ$lemma))) %>% #We want to keep all the adjective
  tokens_remove(c(stopwords("english"), "oh", "yeah", "okay", "like", 
                  "get", "got", "can", "one", "hey", "go",
                  "Ted", "Marshall", "Lily", "Robin", "Barney", "just", 
                  "know", "well", "right", "even", "see", 
                  "sure", "back", "first", "said", "maybe", "wedding", 
                  "whole", "wait")) #But we remove stopwords and other words that the package didn't classify it correctly. 

### 10.04.02.- Third step: Tokens----------------------------------------------

df_general_ADJ <- toks_himym_ADJ %>%
  tokens_group(groups = Season_w) %>% #group by season
  dfm() %>% dfm_subset(Season < 9)

### 10.04.03.- Wordcloud of adjectives -----------------------------------------

#Because of a function limitation, the maximum comparison that we can do is 8 groups

textplot_wordcloud(df_general_ADJ, 
                   random_order = FALSE, 
                   rotation = 0.25,
                   comparison = TRUE,
                   labelsize = 1.5, 
                   min_count = 1, #Minimum frequency
                   color = ggsci::pal_lancet(palette = "lanonc"))
#color = RColorBrewer::brewer.pal(10, "Spectral")) 


```


**Step 6.2** Get frequency of adjectives


```{r, eval=F}
## 10.05.- Get frequency of adjectives------------------------------------------

### 10.05.01.- Remember out third step: DFM object------------------------------

freq_gen_dfm <- toks_himym_ADJ %>%
  dfm()

#Generate dataframe
df_freq_gen_dfm <-  as.data.frame(textstat_frequency(freq_gen_dfm, # Our DFM object
                                                     n = 10, #Number of observations displayed
                                                     groups = Season)) #Grouped by season
                                  
df_freq_gen_dfm_match <- df_freq_gen_dfm %>% mutate(total = 1) %>% 
                                  group_by(feature) %>% 
                                  summarise(total = sum(total)) %>% 
                                  filter(total== 9)

df_freq_gen_dfm_final <- right_join(df_freq_gen_dfm, df_freq_gen_dfm_match,
                                   by = "feature") %>% rename(Word = feature) %>% 
                                   mutate(Word = str_to_title(Word))

### 10.05.02.- Plot frequency of adjectives-------------------------------------

ggplot2 <- ggplot(df_freq_gen_dfm_final, aes(x = group, y = frequency, group = Word, color = Word)) +
  geom_line(size = 1.5, show.legend = TRUE) +
  scale_color_manual(values = rev(brewer.pal(n = 7, name = "Dark2"))) +
  geom_point(size = 3.2) +
  theme_minimal(base_size = 14) +
  labs(x = "Number of Season",
       y = "Frequencies of words",
       title = "Frequency of adjectives",
       caption="Description: This plot shows the top adjectives that appears in every season of HIMYM")+
  theme(panel.grid.major=element_line(colour="#cfe7f3"),
        panel.grid.minor=element_line(colour="#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption=element_text(size=12, hjust=.1, color="#939393"),
        legend.position="bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 40,  # Bottom margin
                             l = 10), # Left margin
        text=element_text()) + 
  #geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
  #             arrow = arrow(length = unit(0.1, "cm")))+
  guides(colour = guide_legend(ncol = 4)) +
  gghighlight(max(frequency) > 140,
              keep_scales = TRUE,
              unhighlighted_params = list(colour = NULL, alpha = 0.2))
  

ggdraw(ggplot2) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)


```

# 7. Network plots

**Step 7.1** Network plot


```{r, eval=F}
# How the characters are related each other? 

## 11.01.- Network steps------------------------------------------------------

### 11.01.01.- Second step: Tokens----------------------------------------------


token_characters_himym <- tokens(corp_himym, #corpus from all the episodes from all season
                                 remove_punct = TRUE, #Remove punctuation of our texts
                                 remove_separators = TRUE, #Remove separators of our texts
                                 remove_numbers = TRUE, #Remove numbers of our texts
                                 remove_symbols = TRUE) %>%  #Remove symbols of our texts
  tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
  tokens_tolower() #We want lower cases in our tokens


### 11.01.02.- Extra step: create a feature co-ocurrence matrix (FCM)------------

fcm_characters_himym <- token_characters_himym %>%
                        fcm(context = "window", window = 5, tri = FALSE)

## 11.02.- Network plot of all characters----------------------------------------

#Vector with all the characters
v_top_characters <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 70)))

set.seed(100)

textplot_network(fcm_select(fcm_characters_himym, v_top_characters),
                 edge_color = "#008eed", 
                 edge_size = 2, 
                 vertex_labelcolor = "#006fba", 
                 omit_isolated = TRUE,
                 min_freq = .1)


```


**Step 7.2** Network plot with 30 principal characters


```{r, eval=F}
## 11.03.- Network plot with 30 principal characters----------------------------

#Vector with 30 characters
v_top_characters_2 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 30)))

textplot_network(fcm_select(fcm_characters_himym, v_top_characters_2),
                 edge_color = "#008eed", 
                 edge_size = 5, 
                 vertex_labelcolor = "#006fba",
                 omit_isolated = TRUE,
                 min_freq = .1)


```


**Step 7.3** Network plot of Ted


```{r, eval=F}
## 11.03.- Network plot of Ted -------------------------------------------------

fcm_characters_himym_ted <- token_characters_himym %>%
  tokens_remove(c("marshall", "lily", "barney", "robin")) %>% #Here we just want ted, that why we remove the other principal characters
  fcm(context = "window", window = 5, tri = FALSE)

#Vector with 30 characters
v_top_characters_3 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym_ted, 30)))

#Create a FCM matrix with our characters
vertex_size_f <- fcm_select(fcm_characters_himym_ted, pattern = v_top_characters_3)

#Create a proportion 
v_proportion <- rowSums(vertex_size_f)/min(rowSums(vertex_size_f))

#Vector of Ted
x_p <- c("ted")

#Replace that proportion in our vector
final_v <- replace(v_proportion, names(v_proportion) %in% x_p, 
                   v_proportion[names(v_proportion) %in% x_p]/15)

textplot_network(fcm_select(fcm_characters_himym_ted, v_top_characters_3),
                 edge_color = "#008eed", 
                 edge_size = 5, 
                 vertex_labelcolor = "#006fba",
                 omit_isolated = TRUE,
                 vertex_labelsize = final_v,
                 min_freq = .1)


```

# 8. Network plots

**Step 8.1** Text stat collocation


```{r, eval=F}
# 12.- Text stat collocation ---------------------------------------------------

#Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
#textstat_collocations()

### 12.01.01.- Second step: Tokens----------------------------------------------

toks_himym_s1 <- tokens(corp_himym_s1_simil, #Define our corpus for the first season
                        padding = TRUE) %>% #Leave an empty string where the removed tokens previously existed
  tokens_remove(stopwords("english")) #Remove stopwords of our token


## 12.02.- textstat_collocations function --------------------------------------

himym_s1_collocations <-textstat_collocations(toks_himym_s1, #Our token object
                                              tolower = F) #Keep capital letters


df_himym_s1_coll <- data.frame(himym_s1_collocations) %>% 
                        rename(`Total of collocations` = count)

## 12.02.- Plot allocations --------------------------------------

ggplot3 <- ggplot(df_himym_s1_coll, aes(x = z, y = lambda, label = collocation)) +
  geom_point(alpha = 0.2, aes(size = `Total of collocations`), color = "#00578a")+
  geom_point(data = df_himym_s1_coll %>% filter(z > 15), 
             aes(x = z, y = lambda, size = `Total of collocations`),
             color = '#00578a') + 
  geom_text_repel(data = df_himym_s1_coll %>% filter(z > 15), #Function from ggrepel package. Show scatterplots with text.
                  aes(label = collocation, size = count), size = 3,
                  box.padding = unit(0.35, "lines"),
                  point.padding = unit(0.3, "lines")) + 
  scale_y_continuous(breaks = seq(0, 16, by = 1), limits = c(0,16))+
  theme_minimal(base_size = 14) +
  labs(x = "Z statistic",
       y = "Lambda",
       title = "Allocations of words in the first season",
       caption = "Description: This plot identifies and scores multi-word expressions of the 1st season")+
  theme(panel.grid.major = element_line(colour = "#cfe7f3"),
        panel.grid.minor = element_line(colour = "#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption = element_text(size=12, hjust=.1, color="#939393"),
        legend.position="bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 10,  # Bottom margin
                             l = 10))

ggdraw(ggplot3) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)


#lambda collocation scoring metric
#array data is simply the number of times a given value appears


```

# 9. Network plots

**Step 9.1** Locate keywords-in-context - kwic function


```{r, eval=F}
# 13.- Locate keywords-in-context ----------------------------------------------


## 13.01.- Set dataframe to merge with other information--------------------------

df_title_s_chp <- df_himym_final_doc %>% 
                  select(Title, Season, Chapter, No.overall, 
                         Season_w, US.viewers.millions.)

### 13.02.01.- First step: Define a corpus --------------------------------------

corp_himym <- corpus(df_himym_final_doc)  # build a new corpus from the texts

docnames(corp_himym) <- df_himym_final_doc$Title #Rename docnames with Title of the episode

corp_himym_s5 <- corpus_subset(corp_himym, #our corpus
                               Season == 5) #Filter by season


### 13.02.02.- Second step: Define a token --------------------------------------

toks_himym_s5 <- tokens(corp_himym_s5, #Corpus of season 5
                        padding = TRUE)

## 13.03- kwic function---------------------------------------------------------


kw_himym_s5_love <- kwic(toks_himym_s5, #token object.
                         pattern = "love*", #pattern that we want to look for.
                         window = 10) #how many words you want before and after your pattern.

### 13.03.01- Wrangle dataframe of kwic output----------------------------------


df_kw_himym_s5_love <- as.data.frame(kw_himym_s5_love)  %>% 
  rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>% 
  rename_with(str_to_title, .cols = everything()) %>%  left_join(df_title_s_chp, 
                                                                 by ="Title") %>% 
  relocate(Title, Season, Chapter)

df_kw_himym_s5_love

### 13.04.01.- Second step: Define a token --------------------------------------

toks_himym <- tokens(corp_himym,  #Define our corpus for all seasons
                     padding = TRUE) #Leave an empty string where the removed tokens previously existed

kw_himym_legendary <- kwic(toks_himym, #token object.
                           pattern = "legendary*",  #pattern that we want to look for.
                           window = 10) #how many words you want before and after your pattern.

### 13.04.02.- Wrangle dataframe of kwic output----------------------------------

df_kw_himym_legendary <- as.data.frame(kw_himym_legendary)  %>% 
  rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>% 
  rename_with(str_to_title, .cols = everything()) %>%  left_join(df_title_s_chp, 
                                                                 by = "Title") %>% 
  relocate(Title, Season, Chapter)

df_kw_himym_legendary


### 13.05.01.- Second step: Define a token --------------------------------------


kw_himym_wait_for <- kwic(toks_himym, #token object.
                          pattern = phrase("wait for it"),  #Here we can specify even a phrase
                          window = 10) #how many words you want before and after your pattern.

### 13.05.02.- Wrangle dataframe of kwic output----------------------------------

df_kw_himym_wait_for <- as.data.frame(kw_himym_wait_for)  %>% 
  rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>% 
  rename_with(str_to_title, .cols = everything()) %>%  left_join(df_title_s_chp, 
                                                                 by = "Title") %>% 
  relocate(Title, Season, Chapter)


df_kw_himym_wait_for


```

# 10.- Sentiment analysis

**Step 10.1** Sentiment analysis


```{r, eval=F}
# 14.- Sentiment analysis --------------------------------------------------------


## 14.01.- Second step: Define a token --------------------------------------

toks_himym <- tokens(corp_himym, #Our corpus object
                     remove_punct = TRUE, #Remove punctuation in our texts
                     remove_separators = TRUE, #Remove separators in our texts
                     remove_numbers = TRUE, #Remove numbers in our texts
                     remove_symbols = TRUE) %>% #Remove symbols in our texts
  tokens_remove(stopwords("english"))#Add additional words

#tidy_sou <- df_himym_final_doc %>%
#  unnest_tokens(word, text) This is another way on spacyr

## 14.02- Get positive and negative words --------------------------------------

df_positive_words <- get_sentiments("bing") %>% #We have four options: "bing", "afinn", "loughran", "nrc" 
  filter(sentiment == "positive")

df_negative_words <- get_sentiments("bing") %>%
  filter(sentiment == "negative")

## 14.03.- Define a dictionary with positive and negative words from bing --------------------------------------

l_sentiment_dictionary <- dictionary(list(positive = df_positive_words, 
                                        negative = df_negative_words))

#dfm_sentiment_himym <- dfm(toks_himym) %>% dfm_lookup(dictionary = sentiment_dictionary)


## 14.04.- Load a file --------------------------------------
#It is a DFM object, which comes from a token off all the season of HIMYM

load(file = "data/dfm_sentiment_himym.Rdata")

#Rename doc:id with the Titles of every episode
docnames(dfm_sentiment_himym) <- df_himym_final_doc$Title


## 14.05.- Wrangle dataframe --------------------------------------

#Format in long to plot positive and negative words
df_sentiment_himym <- convert(dfm_sentiment_himym, "data.frame") %>% 
  gather(positive.word, negative.word, key = "Polarity", value = "Words") %>% 
  rename(Title = doc_id) %>% 
  mutate(Title = as_factor(Title)) %>% 
  left_join(df_title_s_chp, by ="Title") %>%
  mutate(Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "negative.word",
                                            replacement = "Negative words")),
         Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "positive.word",
                                            replacement = "Positive words")))

## 14.06.- Plot total of positive and negative words per season and episode -----

ggplot3 <- ggplot(df_sentiment_himym, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) + 
  geom_bar(stat = 'identity', position = position_dodge(), size = 1) + 
  facet_wrap(~ Season_w)+
  scale_fill_manual(values = c("#c6006f", "#004383")) + 
  scale_y_continuous(breaks = seq(0, 250, by = 50))+
  theme_minimal(base_size = 14) +
  labs(x = "Episodes",
       y = "Frequency of words",
       title = "Total of positve and negative words per season",
       caption="Description: This plot identifies total of positive and negative words \n per season and episode")+
  theme(panel.grid.major = element_line(colour="#cfe7f3"),
        panel.grid.minor = element_line(colour="#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
        legend.position = "bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 10,  # Bottom margin
                             l = 10))

ggdraw(ggplot3) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)


```

**Step 10.2** Weight the feature frequencies in a dfm

```{r, eval=F}

## 14.07.- Weight the feature frequencies in a dfm -----------------------------

#dfm_weight()

#This step is the same as the last one, but here we are taking into account the weights to do a fair comparison


dfm_sentiment_himym_prop <- dfm_weight(dfm_sentiment_himym, scheme = "prop")
dfm_sentiment_himym_prop

### 14.07.01- Wrangle dfm weight dataframe--------------------------------------

df_sentiment_himym_prop <- convert(dfm_sentiment_himym_prop, "data.frame") %>% 
  gather(positive.word, negative.word, key = "Polarity", value = "Words") %>% 
  rename(Title = doc_id) %>% 
  mutate(Title = as_factor(Title)) %>% 
  left_join(df_title_s_chp, by = "Title") %>%
  mutate(Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "negative.word",
                                            replacement = "Negative words")),
         Polarity = replace(Polarity, is.character(Polarity), 
                            str_replace_all(Polarity, 
                                            pattern = "positive.word",
                                            replacement = "Positive words")))

### 14.07.02.- Plot total of positive and negative words per season and episode -----

#This step is the same as the last one, but here we are taking into account the weights to do a fair comparison

ggplot4 <- ggplot(df_sentiment_himym_prop, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) + 
  geom_bar(stat = 'identity', position = position_dodge(), size = 1) + 
  facet_wrap(~ Season_w) +
  scale_fill_manual(values = c("#c6006f", "#004383")) + 
  scale_y_continuous(breaks = seq(0, .8, by = .2))+
  theme_minimal(base_size = 14) +
  labs(x = "Episodes",
       y = "Frequency of words",
       title = "Weighted positve and negative words per season",
       caption = "Description: This plot identifies the weighted total of positive and negative words \n per season and episode")+
  theme(panel.grid.major = element_line(colour = "#cfe7f3"),
        panel.grid.minor = element_line(colour = "#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        #axis.text.x=element_text(size=15),
        #axis.text.y=element_text(size=15),
        plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
        legend.position = "bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 10,  # Bottom margin
                             l = 10))

ggdraw(ggplot4) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)

```


**Step 10.3** Wrangle dfm weight dataframe with measures
```{r, eval=F}

## 14.08.- Wrangle dfm weight dataframe with measures---------------------------

#Scaling Policy Preferences from Coded Political Texts
#WILL LOWE, KENNETH BENOIT, SLAVA MIKHAYLOV, MICHAEL LAVER
#Balance between positive words/negative words using a log scale 

#Here we 
df_sentiment_himym_prop_measure <- convert(dfm_sentiment_himym_prop, "data.frame") %>% 
  rename(Sentiment = positive.word)  %>% rename(Title = doc_id) %>% 
  left_join(df_title_s_chp, by = "Title")  %>%
  mutate(measure = log((Sentiment + 0.5)/(negative.word + .5))) %>%
  select(-Season) %>% 
  rename(Season = Season_w)


## 14.09.- Plot measure of positivity among season------------------------------


ggplot5 <- ggplot(df_sentiment_himym_prop_measure, aes(x = No.overall, y = measure, 
                                            color = Season, group = Season)) +
  scale_color_manual(values = brewer.pal(n = 9, name = "Set1"))+
  geom_line(size = 1.5) +
  geom_point(size = 3.2) + 
  scale_x_continuous(breaks = seq(0, 208, by = 20))+
  theme_minimal(base_size = 14) +
  labs(x = "Number of episode",
       y = "Rate",
       title = "Measure of positivity among episodes",
       caption="Description: This plot shows the positivity rate of every episode")+
  theme(panel.grid.major = element_line(colour = "#cfe7f3"),
        panel.grid.minor = element_line(colour = "#cfe7f3"),
        plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
        plot.caption = element_text(size=12, hjust = .1, color = "#939393"),
        legend.position = "bottom",
        plot.margin = margin(t = 20,  # Top margin
                             r = 50,  # Right margin
                             b = 40,  # Bottom margin
                             l = 10), # Left margin
        text = element_text()) + 
  guides(colour = guide_legend(ncol = 3)) +
  geom_hline(yintercept = 0, linetype = "dashed", 
             color = "red", size = 1)


ggdraw(ggplot5) + draw_image(obj_img, x = .97, y = .97, 
                             hjust = 1.1, vjust = .7, 
                             width = 0.11, height = 0.1)

```

# Sources

ZZZZZZZZZZZZZZZ