calculate_chat_level_features module
- class utils.calculate_chat_level_features.ChatLevelFeaturesCalculator(chat_data: DataFrame, vect_data: DataFrame, bert_sentiment_data: DataFrame, ner_training: DataFrame, ner_cutoff: int, conversation_id_col: str, message_col: str, timestamp_col: str | tuple[str, str], timestamp_unit: str, custom_liwc_dictionary: dict)
Bases:
objectInitialize variables and objects used by the ChatLevelFeaturesCalculator class.
This class uses various feature modules to define chat-level features. It reads input data and initializes variables required to compute the features.
- Parameters:
chat_data (pd.DataFrame) – Pandas dataframe of chat-level features read from the input dataset
vect_data (pd.DataFrame) – Pandas dataframe containing vector data
bert_sentiment_data (pd.DataFrame) – Pandas dataframe containing BERT sentiment data
ner_training (pd.DataFrame) – This is a pandas dataframe of training data for named entity recognition feature
ner_cutoff (int) – This is the cutoff value for the confidence of prediction for each named entity
conversation_id_col (str) – A string representing the column name that should be selected as the conversation ID. Defaults to “conversation_num”.
message_col (str) – A string representing the column name that should be selected as the message. Defaults to “message”.
timestamp_col (str) – A string representing the column name that should be selected as the message. Defaults to “timestamp”.
custom_liwc_dictionary (dict) – This is the user’s own LIWC dictionary. Defaults to empty dictionary.
- calculate_chat_level_features(feature_methods: list) DataFrame
Main driver function for creating chat-level features.
This function computes various chat-level features using different modules and appends them as new columns to the input chat data.
- Returns:
The chat-level dataset with new columns for each chat-level feature
- Return type:
pd.DataFrame
- calculate_hedge_features() None
Calculate features related to expressing hesitation (or ‘hedge’).
This function identifies whether a message contains hedge words using a naive approach and appends this information as a new column to the chat data.
- Returns:
None
- Return type:
None
- calculate_politeness_sentiment() None
Calculate politeness strategies using the Politeness module from Convokit.
This function applies the Convokit politeness strategies to the chat messages and appends all outputted features to the chat data.
- Returns:
None
- Return type:
None
- calculate_politeness_v2() None
Calculate politeness features using the System for Encouraging Conversational Receptiveness (SECR).
Source: https://www.mikeyeomans.info/papers/receptiveness.pdf
This function applies the SECR module to the chat messages and appends the calculated politeness features to the chat data.
- Returns:
None
- Return type:
None
- calculate_textblob_sentiment() None
Calculate features related to sentiment using TextBlob.
This function calculates and appends the following TextBlob sentiment features to the chat data: - Subjectivity score - Polarity score
- Returns:
None
- Return type:
None
- calculate_vector_word_mimicry() None
Compute the mimicry relative to the previous chat(s) using SBERT vectors.
- Returns:
None
- Return type:
None
- calculate_word_mimicry() None
Calculate features related to word mimicry.
This function calculates the number of function words and the sum of inverse frequency of content words that also appear in the other’s prior turn.
Extracts function and content words from a message
Identifies mimicry of function and content words from the immediate previous turn
Computes function word accommodation (number of mimicked function words)
Computes content word accommodation (sum of inverse frequency of mimicked content words)
Drops the intermediate columns related to function and content words after calculation.
- Returns:
None
- Return type:
None
- concat_bert_features() None
Concatenate RoBERTa sentiment features to the chat data.
This function appends RoBERTa sentiment data (which are pre-processed beforehand to save computation) as new columns to the existing chat data.
- Returns:
None
- Return type:
None
- get_certainty_score() None
Calculate the certainty score of a statement.
This function uses the formula published in Rocklage et al. (2023) to calculate the certainty score of a chat message and appends it to the chat data.
Source: https://journals.sagepub.com/doi/pdf/10.1177/00222437221134802
- Returns:
None
- Return type:
None
- get_dale_chall_score_and_classfication() None
Calculate the readability of a text according to its Dale-Chall score.
This function calculates and appends the following Dale-Chall readability features to the chat data: - Dale-Chall score - Dale-Chall classification
- Returns:
None
- Return type:
None
- get_forward_flow() None
Calculate the chat-level forward flow.
This function compares the current chat to the average of the previous chats and appends the forward flow score to the chat data.
- Returns:
None
- Return type:
None
- get_named_entity() None
This function calculates the number of named entities in a chat.
- Returns:
None
- Return type:
None
- get_reddit_features() None
Calculate a suite of features common in online communication.
This function calculates and appends the following features to the chat data: - Number of all caps words - Number of links - Number of user references (Reddit format) - Number of emphases (bold, italics) - Number of bullet points - Number of numbered points - Number of line breaks - Number of quotes - Number of responses to someone else (using “>”) - Number of ellipses - Number of parentheses - Number of emojis
- Returns:
None
- Return type:
None
- get_temporal_features() None
Calculate features relevant to the timestamps of each chat.
This function calculates and appends the following temporal feature to the chat data: - Time difference between messages sent
It assumes the ‘timestamp’ column is available, which is checked in feature_builder.py.
- Returns:
None
- Return type:
None
- info_exchange() None
Extract different types of z-scores from the chats.
This function calculates and appends the following info exchange features to the chat data: - Modified word count (total word count minus first singular pronouns) - Z-score of the modified word count across all chats - Z-score of the modified word count within each conversation
It then drops the intermediate info_exchange_wordcount column as it is a linear combination of the z-score columns.
- Returns:
None
- Return type:
None
- lexical_features() None
Implement lexical features.
This driver function calls relevant functions to compute lexical features and appends them to the chat data.
- Returns:
None
- Return type:
None
- other_lexical_features() None
Extract various lexical features from the chats.
This function calculates and appends the following lexical features to the chat data: - Number of questions (naive approach using question marks and question words) - Classification of whether the message contains clarification questions - Word type-to-token ratio (TTR) - Proportion of first-person pronouns
It also drops the raw number of first-person pronouns from the chat data as it is proportional to other columns.
- Returns:
None
- Return type:
None
- positivity_zscore() None
Calculate the z-score of a message’s positivity (as measured by RoBERTa).
This function calculates and appends the following positivity z-score features to the chat data: - Z-score of the positivity across all chats - Z-score of the positivity within each conversation
- Returns:
None
- Return type:
None
- text_based_features() None
Implement common text-based features.
This function calculates and appends the following text-based features to the chat data: - Number of words - Number of characters - Number of messages
- Returns:
None
- Return type:
None