preprocess module

utils.preprocess.compress(turn_df, message_col)

Combine messages in the same turn into a single message.

This function takes a DataFrame representing messages in a single turn and concatenates their ‘message’ and ‘message_lower_with_punc’ columns into single strings if there are multiple messages in the same turn.

Parameters:
  • turn_df (pandas.DataFrame) – The DataFrame containing messages in a single turn.

  • message_col (str) – A string representing the column name that should be selected as the message.

Returns:

A Series with combined messages for the turn.

Return type:

pandas.Series

utils.preprocess.create_cumulative_rows(input_df, conversation_id, timestamp_col, grouping_keys, within_task=False)

Generate cumulative rows for chat data to analyze conversations in context.

This function takes chat-level data and duplicates rows to facilitate the analysis of conversations in the context of preceding chats. It enables the inclusion of chats from previous stages or tasks within the same conversation.

NOTE: This function was created in the context of a multi-stage Empirica game (see: https://github.com/Watts-Lab/multi-task-empirica).

It assumes that there are exactly 3 nested columns at different levels: a High, Mid, and Low level; further, it assumes that these levels are temporally nested: that is, each group/conversation has one High-level identifier, which contains one or more Mid-level identifiers, which contains one or more Low-level identifiers.

This is specifically applicable to a hierachical conversation in which the same group of pairing does a series of different activities, each of which may have one or more subparts. Thus, the group as a whole will have a “high-level” identiifer; each activity will have a “mid-level” identifier, and each sub-part will have a “low-level” identifier.

Parameters:
  • input_df (pandas.DataFrame) – The DataFrame containing chat data.

  • conversation_id (str) – The ID (e.g., stage or round) used for grouping the data.

  • timestamp_col (str) – The column containing the timestamp. Since we assume that the conversation is evolving over time, we use the timestamp column to make the analysis of conversation “cumulative” (that is, to include in our analysis prior discussions for other activities).

  • grouping_keys (list) – A list of three hierarchical keys, which must be passed in the order of (highest level, mid level, lowest level). We assume that, for a given item at the highest level, there are one or more items at the mid level; for each item at the mid level, there are one or more items at the lowest level.

  • within_task (bool, optional) – Flag to determine whether to restrict the analysis to the same activity or “task” (assumed to be the Mid-Level Identifier), defaults to False.

Returns:

The processed DataFrame with cumulative rows added.

Return type:

pandas.DataFrame

utils.preprocess.get_turn_id(df, speaker_id_col)

Generate turn IDs for a conversation to identify turns taken by speakers.

This function compares the current speaker with the previous one to identify when a change in speaker occurs, and then assigns a unique ‘turn_id’ that increments whenever the speaker changes within the conversation.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing chat data for a single conversation.

  • speaker_id_col (str) – A string representing the column name that should be selected as the speaker ID.

Returns:

A Series containing the turn IDs.

Return type:

pandas.Series

utils.preprocess.preprocess_conversation_columns(df: DataFrame, column_names: dict, grouping_keys: list, cumulative_grouping: bool = False, within_task: bool = False) DataFrame

Preprocesses conversation data by removing special characters from column names and assigning a conversation number.

Parameters:
  • df (pd.DataFrame) – The DataFrame containing conversation data.

  • conversation_id (str, optional) – The column name to use for assigning conversation numbers.

  • timestamp_col (str) – The name of the column containing the timestamp

  • grouping_keys (list) – A list of multiple identifier keys for a conversation.

  • cumulative_grouping (bool, optional) – Whether to group data cumulatively based on the conversation_id. This option was created in the context of a multi-stage Empirica game (see: https://github.com/Watts-Lab/multi-task-empirica).

  • within_task (bool, optional) – Used only if cumulative_grouping is True, to specify if grouping is within the “task.” This option was created in the context of a multi-stage Empirica game (see: https://github.com/Watts-Lab/multi-task-empirica).

Returns:

The preprocessed DataFrame with a conversation number column.

Return type:

pd.DataFrame

utils.preprocess.preprocess_naive_turns(chat_data, column_names)

Combine adjacent rows of the same speaker in the same conversation and compress messages into a “turn”.

This function first generates a ‘turn_id’ for each chat message within the same conversation, indicating turns taken by the active speaker. It then combines messages with the same ‘turn_id’ within each conversation to compress repeated messages from the same speaker.

Parameters:
  • chat_data (pandas.DataFrame) – The chat data to process.

  • column_names (dict) – Columns to preprocess.

Returns:

The processed chat data with combined message turns.

Return type:

pandas.DataFrame

utils.preprocess.preprocess_text(text: str) str

Preprocess text by removing non-alphanumeric characters and converting to lowercase.

This function takes a string, removes any characters that are not letters, numbers, or spaces, except certain emojis, and converts the remaining text to lowercase.

Parameters:

text (str) – The input text to process.

Returns:

The processed text containing only alphanumeric characters and spaces in lowercase.

Return type:

str

utils.preprocess.preprocess_text_lowercase_but_retain_punctuation(text)

Convert the input text to lowercase while retaining punctuation.

This function takes a string and converts all characters to lowercase, keeping any punctuation marks intact.

Parameters:

text (str) – The input text to process.

Returns:

The processed text with all characters in lowercase.

Return type:

str

utils.preprocess.remove_unhashable_cols(df: DataFrame, column_names: dict, warning: bool = True) DataFrame

If a required column contains unhashable types, raise an error. Otherwise, remove those columns from the DataFrame and print a warning message.

Parameters:
  • df (pd.DataFrame) – Pandas DataFrame to validate

  • column_names (dict) – Dictionary of 4 required columns that must not contain unhashable types

Returns:

Cleaned DataFrame (if columns were removed)

Return type:

pd.DataFrame

Raises:

ValueError – if a required column contains unhashable types

utils.preprocess.setup_logger(name: str, log_file_path: str, level: int = 20)

Set up a logger

Parameters:
  • name (str) – The name of the logger.

  • log_file_path (str) – Path to the log file, such as ‘./output/logs/feature_builder.log’.

  • level (int, optional) – Logging level, defaults to logging.INFO. All levels: 0: NOTSET, 10: DEBUG, 20: INFO, 30: WARNING, 40: ERROR, 50: CRITICAL.

Returns:

Configured logger.

Return type:

logging.Logger