assign_chunk_nums module

utils.assign_chunk_nums.assign_chunk_nums(chat_data, num_chunks, conversation_id_col)

Assign chunks to the chat data, splitting it into “equal” pieces.

This functionality is necessary for some conversational features that examine what happens throughout the course of a conversation (e.g., in the beginning, middle, and end).

This function has slightly different behavior depending on whether timestamps are available, and depending on user speciifcations.

If a timestamp column exists and use_time_if_possible is True, the function will chunk based on the timestamp. Otherwise, it will chunk based on the number of messages.

Parameters:
  • chat_data (pd.DataFrame) – The input chat data

  • num_chunks (int) – The number of chunks desired

  • timestamp_col (str) – The name of the column containing the timestamp

  • use_time_if_possible (bool, optional) – If a timestamp exists, chunk based on the timestamp rather than based on the number of messages. Defaults to True.

Returns:

DataFrame with chunk numbers assigned

Return type:

pd.DataFrame

utils.assign_chunk_nums.create_chunks(df, num_chunks, conversation_id_col, timestamp_col)

Assign chunk numbers to the chats within each conversation based on time.

This function divides each conversation into time-based chunks, ensuring each chunk spans an equal duration.

Parameters:
  • df (pd.DataFrame) – DataFrame containing chat data with a ‘timestamp’ column

  • num_chunks (int) – Number of chunks to divide the conversation into

  • conversation_id_col (str) – The name of the column containing the unique conversation identifier

  • timestamp_col (str) – The name of the column containing the timestamp

Returns:

DataFrame with an additional ‘chunk_num’ column indicating time-based chunk assignments

Return type:

pd.DataFrame

utils.assign_chunk_nums.create_chunks_messages(chat_data, num_chunks, conversation_id_col)

Assign chunk numbers to the chats within each conversation based on the number of messages.

This function ensures that there is an even number of messages per chunk by calculating the chunk size for each conversation and adjusting the chunk number accordingly.

Parameters:
  • chat_data (pd.DataFrame) – Dataframe containing chat data

  • num_chunks (int) – Initial maximum number of chunks

  • conversation_id_col (str) – The name of the column containing the unique conversation identifier

Returns:

Dataframe with an additional ‘chunk_num’ column indicating chunk assignments

Return type:

pd.DataFrame

utils.assign_chunk_nums.reduce_chunks(num_rows, max_num_chunks)

Reduce the number of chunks based on the number of rows.

This function adjusts the number of chunks to ensure that each chunk has at least two rows. If the total number of rows is less than twice the maximum number of chunks, it reduces the number of chunks accordingly.

Parameters:
  • num_rows (int) – Total number of rows

  • max_num_chunks (int) – Initial maximum number of chunks

Returns:

Adjusted number of chunks

Return type:

int