calculate_chat_level_features module

class utils.calculate_chat_level_features.ChatLevelFeaturesCalculator(chat_data: DataFrame, vect_data: DataFrame, bert_sentiment_data: DataFrame, ner_training: DataFrame, ner_cutoff: int, conversation_id_col: str, message_col: str, timestamp_col: str | tuple[str, str], timestamp_unit: str, custom_liwc_dictionary: dict)

Bases: object

Initialize variables and objects used by the ChatLevelFeaturesCalculator class.

This class uses various feature modules to define chat-level features. It reads input data and initializes variables required to compute the features.

Parameters:
  • chat_data (pd.DataFrame) – Pandas dataframe of chat-level features read from the input dataset

  • vect_data (pd.DataFrame) – Pandas dataframe containing vector data

  • bert_sentiment_data (pd.DataFrame) – Pandas dataframe containing BERT sentiment data

  • ner_training (pd.DataFrame) – This is a pandas dataframe of training data for named entity recognition feature

  • ner_cutoff (int) – This is the cutoff value for the confidence of prediction for each named entity

  • conversation_id_col (str) – A string representing the column name that should be selected as the conversation ID. Defaults to “conversation_num”.

  • message_col (str) – A string representing the column name that should be selected as the message. Defaults to “message”.

  • timestamp_col (str) – A string representing the column name that should be selected as the message. Defaults to “timestamp”.

  • custom_liwc_dictionary (dict) – This is the user’s own LIWC dictionary. Defaults to empty dictionary.

calculate_chat_level_features(feature_methods: list) DataFrame

Main driver function for creating chat-level features.

This function computes various chat-level features using different modules and appends them as new columns to the input chat data.

Returns:

The chat-level dataset with new columns for each chat-level feature

Return type:

pd.DataFrame

calculate_hedge_features() None

Calculate features related to expressing hesitation (or ‘hedge’).

This function identifies whether a message contains hedge words using a naive approach and appends this information as a new column to the chat data.

Returns:

None

Return type:

None

calculate_politeness_sentiment() None

Calculate politeness strategies using the Politeness module from Convokit.

This function applies the Convokit politeness strategies to the chat messages and appends all outputted features to the chat data.

Returns:

None

Return type:

None

calculate_politeness_v2() None

Calculate politeness features using the System for Encouraging Conversational Receptiveness (SECR).

Source: https://www.mikeyeomans.info/papers/receptiveness.pdf

This function applies the SECR module to the chat messages and appends the calculated politeness features to the chat data.

Returns:

None

Return type:

None

calculate_textblob_sentiment() None

Calculate features related to sentiment using TextBlob.

This function calculates and appends the following TextBlob sentiment features to the chat data: - Subjectivity score - Polarity score

Returns:

None

Return type:

None

calculate_vector_word_mimicry() None

Compute the mimicry relative to the previous chat(s) using SBERT vectors.

Returns:

None

Return type:

None

calculate_word_mimicry() None

Calculate features related to word mimicry.

This function calculates the number of function words and the sum of inverse frequency of content words that also appear in the other’s prior turn.

  • Extracts function and content words from a message

  • Identifies mimicry of function and content words from the immediate previous turn

  • Computes function word accommodation (number of mimicked function words)

  • Computes content word accommodation (sum of inverse frequency of mimicked content words)

Drops the intermediate columns related to function and content words after calculation.

Returns:

None

Return type:

None

concat_bert_features() None

Concatenate RoBERTa sentiment features to the chat data.

This function appends RoBERTa sentiment data (which are pre-processed beforehand to save computation) as new columns to the existing chat data.

Returns:

None

Return type:

None

get_certainty_score() None

Calculate the certainty score of a statement.

This function uses the formula published in Rocklage et al. (2023) to calculate the certainty score of a chat message and appends it to the chat data.

Source: https://journals.sagepub.com/doi/pdf/10.1177/00222437221134802

Returns:

None

Return type:

None

get_dale_chall_score_and_classfication() None

Calculate the readability of a text according to its Dale-Chall score.

This function calculates and appends the following Dale-Chall readability features to the chat data: - Dale-Chall score - Dale-Chall classification

Returns:

None

Return type:

None

get_forward_flow() None

Calculate the chat-level forward flow.

This function compares the current chat to the average of the previous chats and appends the forward flow score to the chat data.

Returns:

None

Return type:

None

get_named_entity() None

This function calculates the number of named entities in a chat.

Returns:

None

Return type:

None

get_reddit_features() None

Calculate a suite of features common in online communication.

This function calculates and appends the following features to the chat data: - Number of all caps words - Number of links - Number of user references (Reddit format) - Number of emphases (bold, italics) - Number of bullet points - Number of numbered points - Number of line breaks - Number of quotes - Number of responses to someone else (using “>”) - Number of ellipses - Number of parentheses - Number of emojis

Returns:

None

Return type:

None

get_temporal_features() None

Calculate features relevant to the timestamps of each chat.

This function calculates and appends the following temporal feature to the chat data: - Time difference between messages sent

It assumes the ‘timestamp’ column is available, which is checked in feature_builder.py.

Returns:

None

Return type:

None

info_exchange() None

Extract different types of z-scores from the chats.

This function calculates and appends the following info exchange features to the chat data: - Modified word count (total word count minus first singular pronouns) - Z-score of the modified word count across all chats - Z-score of the modified word count within each conversation

It then drops the intermediate info_exchange_wordcount column as it is a linear combination of the z-score columns.

Returns:

None

Return type:

None

lexical_features() None

Implement lexical features.

This driver function calls relevant functions to compute lexical features and appends them to the chat data.

Returns:

None

Return type:

None

other_lexical_features() None

Extract various lexical features from the chats.

This function calculates and appends the following lexical features to the chat data: - Number of questions (naive approach using question marks and question words) - Classification of whether the message contains clarification questions - Word type-to-token ratio (TTR) - Proportion of first-person pronouns

It also drops the raw number of first-person pronouns from the chat data as it is proportional to other columns.

Returns:

None

Return type:

None

positivity_zscore() None

Calculate the z-score of a message’s positivity (as measured by RoBERTa).

This function calculates and appends the following positivity z-score features to the chat data: - Z-score of the positivity across all chats - Z-score of the positivity within each conversation

Returns:

None

Return type:

None

text_based_features() None

Implement common text-based features.

This function calculates and appends the following text-based features to the chat data: - Number of words - Number of characters - Number of messages

Returns:

None

Return type:

None