calculate_user_level_features module

class utils.calculate_user_level_features.UserLevelFeaturesCalculator(chat_data: DataFrame, user_data: DataFrame, vect_data: DataFrame, conversation_id_col: str, speaker_id_col: str, user_aggregation: bool, user_methods: list, user_columns: list, chat_features: list, logger)

Bases: object

Initialize variables and objects used by the UserLevelFeaturesCalculator class.

This class uses various feature modules to define user- (speaker) level features. It reads input data and initializes variables required to compute the features.

Parameters:
  • chat_data (pd.DataFrame) – Pandas dataframe of chat-level features read from the input dataset

  • user_data (pd.DataFrame) – Pandas dataframe of user-level features derived from the chat-level dataframe

  • vect_data (pd.DataFrame) – Pandas dataframe of message embeddings corresponding to each instance of the chat data

  • conversation_id_col (str) – A string representing the column name that should be selected as the conversation ID. Defaults to “conversation_num”.

  • speaker_id_col (str) – A string representing the column name that should be selected as the speaker ID. Defaults to “speaker_nickname”.

  • user_aggregation (bool) – If true, will aggregate features at the user level

  • user_methods (list) – Specifies which functions users want to aggregate with (e.g., mean, stdev…) at the user level

  • user_columns (list) – Specifies which columns (at the chat level) users want aggregated for the user level

  • chat_features (list) – Tracks all the chat-level features generated by the toolkit

calculate_user_level_features() DataFrame

Main driver function for creating user-level features.

This function computes various user-level features by aggregating chat-level data, and appends them as new columns to the input user-level data.

Returns:

The user-level dataset with new columns for each user-level feature

Return type:

pd.DataFrame

get_centroids() None

Calculate the centroid of each user’s chats in a given conversation for future discursive metric calculations.

This function computes and appends the mean embedding (centroid) of each user’s chats to the user-level data.

Returns:

None

Return type:

None

get_user_level_summary_statistics_features() None

This function is used to aggregate the summary statistics from chat level features to user level features.

There are many possible ways to aggregate user level features, e.g.: - Mean of all chats by a given user; - Max of all chats by a given user; - Weighted mean (e.g., looking at different time points?) … and so on.

This is an open question, so we are putting a TODO here.

get_user_level_summed_features() None

Aggregate summary statistics from chat-level features that need to be summed together.

Features for which summing makes sense include: - Word count (total number of words) - Character count - Message count

This function calculates and merges the summed features into the user-level data.

Returns:

None

Return type:

None

get_user_network() None

Get the user list per user per conversation.

This function calculates and appends the list of other users in a given conversation to the user-level data.

Returns:

None

Return type:

None