Cleans and formats language transcripts from the read stage. Removes non-alphabetic characters and stopwords. Language transcripts can be lemmatized by calling lemmatize = TRUE. Vectorizes each utterance and reports the total word count and mean word length by interlocutor in each dyad. Also reports the number of words in each turn.
Usage
clean_dyads(read_ts_df, lemmatize = TRUE, stop_words_df = "default")
Arguments
- read_ts_df
data frame produced from the read_dyads() function
- lemmatize
logical, should words be lemmatized (switched to base morphological form)
- stop_words_df
defaults to built in list of stopwords, otherwise supply a file path to a cvs file with a column of stopwords titles 'Word'
Value
dataframe with cleaned text data, formatted with one word per row