Semdistflow: Semantic distance from natural language samples

Semdistflow is designed to read in one or more language transcripts, clean and format them, and then output a vector of semantic distances corresponding to every running bigram in the language sample. For example, let’s say you have a text file composed of just one sentence:

readme() reads the txt file into R, appends a document id based on its filename and formats the text as a dataframe.
cleanme() uses many regular expressions to clean and format the text. These include omitting contractions, converting to lowercase, omitting numbers, omitting stopwords, etc.
distme() computes two metrics of semantic distance for each running pair of words in the language sample you just cleaned. These are outputted as a vector of word pairs.

Step 1: Prep your texts

Semdistflow works on language transcripts formatted as txt files. The first function you will use readme() will append the filename of each txt file as a document_id to the subsequent dataframe that is read into R. This allows users to run many distinct documents at once, maintaining independence of each language transcript.

We recommend being systematic and deliberate about your processing pipeline and filenaming conventions. Language transcripts can easily proliferate. A sensible approach is to create a folder such as ‘my_texts’ in the project directory where you plan on processing your big text analysis project. Store your preprocessed files in that folder. Nominally, setting up a structure like this is probably a good idea.

Remember that readme() will look for any txt file in whatever folder name you specify, so it is important that once you are finished processing a txt file that you move it out of ‘my_texts’ into another folder of your choice (e.g., ‘texts_processed’).

You have several options when reading text files into R using our readme() function. If you just run ‘readme()’, R will look for any text files in your root directory. In contrast, the more systematic approach would be to dump your target files into a specific subfolder you create (my_texts) and call the function within that folder using:

readme(./my_texts)

Step 2: readme() read in your texts

mytexts <- readme()  #all text files in the root directory
mytexts <- readme(./my_texts)  #all text files in the 'my_texts' folder of your directory

If readme() did its job, the resulting dataframe will have two columns “doc_id” and “doc_text”. The doc_id is a distinct identifier corresponding to the filename of each text file you read into R. The doc_text corresponds to the raw (uncleaned and untouched) text within text file. The output should have a structure like this:

As you can see, R read in the text file, appended a doc_id and inputted the raw text under the column name, doc_text. Getting data into R is often half the battle, especially when working with string data. It’s not crucial that you use the readme() function. You can use semdistflow to clean and compute distance on any character vector as long as it is nested within a dataframe with column names “doc_id” and “doc_text”. Any other document identifiers will be retained but ignored in subsequent steps.

For the intrepid, here’s the code that operates under the hood of readme()

readme <- function(folder_name = "my_texts"){
  file_list <- list.files(path = folder_name, pattern = "*.txt", recursive = TRUE, full.names = TRUE) #list files with .txt ending
  textdata <- lapply(file_list, function(x) {
    paste(readLines(x), collapse=" ")
  }) #apply readlines over the file list, no warning if last line is not complete
  data.table::setattr(textdata, "names", file_list) #add names attribute to textdata from file_list
  lapply(names(file_list), function(x){
    lapply(names(file_list[[x]]), function(y) setattr(DT[[x]], y, file_list[[x]][[y]]))
  }) #set names attribute over the list
  df1 <- data.frame(doc_id = rep(names(textdata), lengths(textdata)), doc_text = unlist(textdata), row.names = NULL) #convert to dataframe where names attribute is doc_id and textdata is text
  return(df1)
}

Step 3: cleanme() clean and format texts

Now you are cooking. Your text is in a dataframe with an identifier. You are ready to clean the data. The cleanme() step does lots of substitutions and transformations using regular expressions. We adapted a custom stopword list which removes function words (e.g., the, a) and many other ‘empty’ lexical forms. If you want to know exactly what is getting cleaned and swapped during this process, you can read our preprint by linking here to PsyArXiv

mycleantxt <- cleanme(dat) # dat is a dataframe prepped from readme()

Look what cleanme() did to our original doc_text. First you’ll notice that the dataframe has a new column called doc_clean. Then if you look closely at doc_clean you’ll notice some differences. There is no punctuation, and the stopwords ‘and’ and ‘are’ have been omitted. Magic!

Here’s the annotated code showing omissions, substitutions and all other steps under the hood of the cleanme() function. Some of these code snips involve borrowed package dependencies from packages like stringi or textclean. Most are custom regex.

cleanme<- function(x) {
  load(here::here("data", "omissions_2023.rda"))
  load(here::here("data", "replacements_2023.rda"))
  omissions <- omissions_2023
  replacements <- replacements_2023
  message("Performing text cleaning.... sit tight!")
  y <- x
  x <- x %>% group_by(doc_id)
  x <- x$doc_text
  x <- tolower(x) #to lower
  x <- gsub("\"", " ", x)
  x <- gsub("\n", " ", x)
  x <- textclean::replace_date(x)
  x <- stringi::stri_replace_all_regex(x, "^" %s+% replacements$target %s+% "$", replacements$replacement, vectorize_all = FALSE)
  x <- gsub("`", "'", x)  # replaces tick marks with apostrophe for contractions
  x <- gsub("(\\d)(\\.)", "", x)   #look for period adjacent to any digit, replace with nothing
  x <- textclean::replace_contraction(x) #replace contractions
  x <- gsub("([[:alpha:]])([[:punct:]])", "\\1 \\2", x) #add a space between any alphabetic character and punctuation
  x <- gsub("-", " ", x) #replace all hyphens with spaces
  x <- tm::removeWords(x, omissions$target)
  x <- gsub("\\d+(st|nd|rd|th)", " ", x) #omits 6th, 23rd, ordinal numbers
  x <- gsub("[^a-zA-Z;.,]", " ", x) #omit numbers and most punctuation, retain alphabetic chars, comma, semicolon, periods
  x <- gsub("\\b[a-z]\\b{1}", " ", x) #omits any singleton alphabetic character
  x <- gsub("\\;", "\\.", x) #replace semicolons with periods
  x <- gsub("\\s{2,}", " ", x) #replace two or more spaces with a single space
  x <- unlist(strsplit(x, " "))
  x <- paste(x,collapse=" ")
  x <- gsub("person person", "person", x)
  x <- gsub("people people", "people", x)
  x <- gsub("(.*)(, and* )(\\w{1,}\\s\\w{1,})", "\\1. \\3", x)
  x <- gsub("(.*)(, but* )(\\w{1,}\\s\\w{1,})", "\\1. \\3", x)
  x <- gsub("(.*)(,* because* )(\\w{1,}\\s\\w{1,})", "\\1. \\3", x)
  x <- gsub("(.*)(,* then* )(\\w{1,}\\s\\w{1,})", "\\1. \\3", x)
  x <- gsub(",", "", x) # remove commas
  x <- gsub("([[:punct:]])", "", x) # remove periods
  doc_clean <- tm::stripWhitespace(x)
  cleandoc <- cbind(y, doc_clean)
  return(cleandoc)
}

Step 4: distme() computes semantic distances

distme() works some serious magic. It takes your cleantxt and by default lemmatizes it then transforms it into a running vector of content words. After each running pair of content words we append two metrics of semantic distance.

mydistances <- distme(mycleantexts)

distme() takes two arguments. The first is an input text. The second is whether or not to lemmatize the text before running distance norms on it. By default, lemmatize=T. This lemmatizes all of your words converting them to their dictionary entries. This helps with term aggregation so that we have as few missing observations from morphological derivatives as possible. if you do not specify the lemmatize=T argument it defaults to true and will lemmatize your data. If you want distme() to just compute distances on the words you feed it then run distme(dat, lemmatize=F)

The output is a tibble that looks like this….

Let’s break this down by column name. Here they are…

The function for distme() is a bit complex. It computes cosine distance between the semantic vectors between every running pair of words in the language sample you feed it. Here it is!

distme <- function(targetdf, lemmatize=TRUE){
  message("Loading lookup databases and joining your data to SemDist15 and Glowca")
  #load lookup databases
  glowca_v1 <- readRDS(here("data", "glowca_vol1_2023.rda")) #rounded 5,subtlex matched 60k
  glowca_v2 <- readRDS(here("data", "glowca_vol2_2023.rda"))
  glowca <-  rbind(glowca_v1, glowca_v2)
  sd15 <-  readRDS(here("data", "semdist15_2023.rda"))
  if (lemmatize == TRUE) {
    #groups by factor variables and unlists the string, one word per row
    dat <-targetdf %>% group_by(doc_id, doc_text) %>% tidytext::unnest_tokens(word, doc_clean)
    #lemmatizes target dataframe on column labeled 'lemma1'
    dat2 <- dat %>% mutate(lemma1 = textstem::lemmatize_words(word))
    #join semdist15 and glowca lookup databases with target input dataframe
    joindf_semdist15 <- left_join(dat2, sd15, by=c("lemma1"="word")) %>% data.frame()
    joindf_glowca <- left_join(dat2, glowca, by=c("lemma1"="word")) %>% data.frame()
    #Select numeric columns for cosine calculations, eliminate columns with string data
    dat_sd15 <- joindf_semdist15 %>% select_if(is.numeric)
    datglo <- joindf_glowca %>% select_if(is.numeric)
    #convert join dataframes containing hyperparameter values (glowca and sd15) to matrices
    mat_sd15 <- as.matrix(dat_sd15)
    mat_glo <- as.matrix(datglo)
    #compute cosine distance for each running pair of words in sd15 and glowca
    vals_sd15 <- unlist(lapply(2:nrow(mat_sd15), function(i){
      lsa::cosine(mat_sd15[i-1,],  mat_sd15[i,])
    }))
    vals_glo <- unlist(lapply(2:nrow(mat_glo), function(i){
      lsa::cosine(mat_glo[i-1,], mat_glo[i,])
    }))
    #Convert matrices back to dataframes
    vals_sd15 <- data.frame(vals_sd15)
    vals_glo <- data.frame(vals_glo)
    #Rename first column of cosine distance values
    names(vals_sd15)[1] <- "Sd15_Cosine"
    names(vals_glo)[1] <- "Glo_Cosine"
    #Add NA to final row to match length, there is no pairwise distance for final observation
    vals_sd15[nrow(vals_sd15)+1, ] <- NA
    vals_glo[nrow(vals_glo)+1, ] <- NA
    #Reverse scale the cosine values for gloca and sd15, subtract each obs from max (1)
    sd15vals <- vals_sd15 %>% mutate(Sd15_CosRev0Score= 1-Sd15_Cosine)
    glowcavals <- vals_glo %>% mutate(Glowca_CosRev0Score = 1-Glo_Cosine)
    #Rebuild the dataframe creating bigram columns, cbind words to their cosine values
    dat3<- dat2 %>% ungroup() %>% mutate(lemma2 = lead(lemma1, 1)) #rebuild dataframe, cast lead (+1) bigram
    #Bind the sd15 cosine data to the original dataframe
    mydists_lemmatized <- cbind(dat3, sd15vals, glowcavals)
    #Make last word of each text NA so we don't include cosine calculations between texts
    mydists_lemmatized %>%
      group_by(doc_id) %>%
      mutate(lemma2 = replace(lemma2, n(), NA)) %>%
      mutate(Sd15_Cosine = replace(Sd15_Cosine, n(), NA)) %>%
      mutate(Sd15_CosRev0Score = replace(Sd15_CosRev0Score, n(), NA)) %>%
      mutate(Glo_Cosine = replace(Glo_Cosine, n(), NA)) %>%
      mutate(Glowca_CosRev0Score = replace(Glowca_CosRev0Score, n(), NA)) %>%
      ungroup -> mydists_lemmatized_NAatend
    #output a formatted dataframe
    return(as_tibble(mydists_lemmatized_NAatend))
  }

  if (lemmatize == FALSE) {
    message("Loading lookup databases and joining your data to SemDist15 and Glowca")

    #groups by factor variables and unlists the string, one word per row
    dat <-targetdf %>% group_by(doc_id, doc_text) %>% tidytext::unnest_tokens(word, doc_clean)
    #join semdist15 and glowca lookup databases with target input dataframe
    joindf_semdist15 <- left_join(dat2, semdist15_new, by=c("word1"="word")) %>% data.frame()
    joindf_glowca <- left_join(dat2, glowca_lite, by=c("word1"="word")) %>% data.frame()
    #Select numeric columns for cosine calculations, eliminate columns with string data
    dat_sd15 <- joindf_semdist15 %>% select_if(is.numeric)
    datglo <- joindf_glowca %>% select_if(is.numeric)
    message("Computing Distances.... Be patient!!!")
    #convert join dataframes containing hyperparameter values (glowca and sd15) to matrices
    mat_sd15 <- as.matrix(dat_sd15)
    mat_glo <- as.matrix(datglo)
    #compute cosine distance for each running pair of words in sd15 and glowca
    vals_sd15 <- unlist(lapply(2:nrow(mat_sd15), function(i){
      lsa::cosine(mat_sd15[i-1,],  mat_sd15[i,])
    }))
    vals_glo <- unlist(lapply(2:nrow(mat_glo), function(i){
      lsa::cosine(mat_glo[i-1,], mat_glo[i,])
    }))
    #Convert matrices back to dataframes
    message("So far so good... Now building output dataframe")
    vals_sd15 <- data.frame(vals_sd15)
    vals_glo <- data.frame(vals_glo)
    #Rename first column of cosine distance values
    names(vals_sd15)[1] <- "Sd15_Cosine"
    names(vals_glo)[1] <- "Glo_Cosine"
    #Add NA to final row to match length, there is no pairwise distance for final observation
    vals_sd15[nrow(vals_sd15)+1, ] <- NA
    vals_glo[nrow(vals_glo)+1, ] <- NA
    #Reverse scale the cosine values for gloca and sd15, subtract each obs from max (1)
    sd15vals <- vals_sd15 %>% mutate(Sd15_CosRev0Score= 1-Sd15_Cosine)
    glowcavals <- vals_glo %>% mutate(Glowca_CosRev0Score = 1-Glo_Cosine)
    #Rebuild the dataframe creating bigram columns, cbind words to their cosine values
    dat3<- dat2 %>% ungroup() %>% mutate(word2 = lead(word1, 1)) #rebuild dataframe, cast lead (+1) bigram
    #Bind the sd15 cosine data to the original dataframe
    mydists_unlemmatized <- cbind(dat3, sd15vals, glowcavals)
    mydists_unlemmatized %>%
      group_by(doc_id) %>%
      mutate(lemma2 = replace(lemma2, n(), NA)) %>%
      mutate(Sd15_Cosine = replace(Sd15_Cosine, n(), NA)) %>%
      mutate(Sd15_CosRev0Score = replace(Sd15_CosRev0Score, n(), NA)) %>%
      mutate(Glo_Cosine = replace(Glo_Cosine, n(), NA)) %>%
      mutate(Glowca_CosRev0Score = replace(Glowca_CosRev0Score, n(), NA)) %>%
      ungroup -> mydists_unlemmatized_NAatend
    #output a formatted dataframe
    return(as_tibble(mydists_unlemmatized_NAatend))
  }
}

Jamie Reilly, Celia Litovsky, Ann Marie Finley, Bonnie Zuckerman

Step 1: Prep your texts

Step 2: readme() read in your texts

Step 3: cleanme() clean and format texts

Step 4: distme() computes semantic distances