word_mimicry module
- features.word_mimicry.Content_mimicry_score(df, column_count_frequency, column_count_mimic)
Combine the steps to compute the content word mimicry score. Normalizes the frequency of words by how much they appear across the entire dataset.
- Parameters:
df (DataFrame) – The input dataframe.
column_count_frequency (str) – The column with content words to calculate frequency.
column_count_mimic (str) – The column with content word mimicry.
- Returns:
A series with content word accommodation scores.
- Return type:
Series
- features.word_mimicry.Content_mimicry_score_per_conv(df, column_count_frequency, column_count_mimic, conversation_id)
Computes the content word mimicry score, but normalizes the term frequency of the words by how often they appear within a given conversation. This version of the score may be more useful in cases where different conversations in the dataset cover very different subject matter, and therefore one may not wish to normalize across the full dataset.
- Parameters:
df (DataFrame) – The input dataframe.
column_count_frequency (str) – The column with content words to calculate frequency.
column_count_mimic (str) – The column with content word mimicry.
- Returns:
A series with content word accommodation scores.
- Return type:
Series
- features.word_mimicry.computeTF(column_mimc, frequency_dict)
Compute the term frequency of each content mimic word, then sum them up.
- Parameters:
column_mimc (list) – Each entry under the content_word_mimicry column.
frequency_dict (dict) – A dictionary of content word frequency across the dataset.
- Returns:
The sum of term frequencies for the content mimic words.
- Return type:
float
- features.word_mimicry.compute_frequency(df, on_column)
Compute the frequency of each content word across the whole dataset.
- Parameters:
df (DataFrame) – The input dataframe.
on_column (str) – The column with which we calculate content word frequency.
- Returns:
A dictionary with content words as keys and their frequencies as values.
- Return type:
dict
- features.word_mimicry.compute_frequency_per_conv(df, on_column)
Compute the frequency of each content word across the whole dataset.
- Parameters:
df (DataFrame) – The input dataframe.
on_column (str) – The column with which we calculate content word frequency.
- Returns:
A dictionary with content words as keys and their frequencies as values.
- Return type:
dict
- features.word_mimicry.function_mimicry_score(function_mimic_words)
Compute the number of mimic words for function words by simply counting the number of mimic words using len().
- Parameters:
function_mimic_words (list) – Each entry under the function_word_mimicry column.
- Returns:
The number of function mimic words.
- Return type:
int
- features.word_mimicry.get_content_words_in_message(text, function_word_reference)
Extract the non-function words in a given message.
- Parameters:
text (str) – The input text to be analyzed.
function_word_reference (list) – A list of function words to reference against.
- Returns:
A list of content words found in the input text.
- Return type:
list
- features.word_mimicry.get_function_words_in_message(text, function_word_reference)
Extract the function words & non-functions words from a message
- Parameters:
text (str) – The input text to be analyzed.
function_word_reference (list) – A list of function words to reference against.
- Returns:
A list of function words found in the input text.
- Return type:
list
- features.word_mimicry.get_mimicry_bert(chat_data, vect_data, conversation_id)
Uses SBERT vectors to get the cosine similarity between each message and the previous message.
- Parameters:
chat_data (DataFrame) – The input chat dataframe.
vect_data (DataFrame) – The dataframe containing SBERT vectors.
conversation_id (str) – The column name that should be selected as the conversation ID.
- Returns:
A list of cosine similarity scores between each message and the previous message.
- Return type:
list
- features.word_mimicry.get_moving_mimicry(chat_data, vect_data, conversation_id)
Calculate the moving average of mimicry scores using SBERT vectors.
- Parameters:
chat_data (DataFrame) – The input chat dataframe.
vect_data (DataFrame) – The dataframe containing SBERT vectors.
conversation_id (str) – The column name that should be selected as the conversation ID.
- Returns:
A list of moving average mimicry scores for each message in the conversation.
- Return type:
list
- features.word_mimicry.mimic_words(df, on_column, conversation_id)
Return a list of words that are also used in the other’s previous turn.
- Parameters:
df (DataFrame) – The dataset that removed all punctuations.
on_column (str) – The column that we want to find mimicry on.
conversation_id (str) – The column name that should be selected as the conversation ID.
- Returns:
A list of lists, where each sublist contains words mimicked from the previous turn.
- Return type:
list