assign_chunk_nums module
- utils.assign_chunk_nums.assign_chunk_nums(chat_data, num_chunks, conversation_id_col)
Assign chunks to the chat data, splitting it into “equal” pieces.
This functionality is necessary for some conversational features that examine what happens throughout the course of a conversation (e.g., in the beginning, middle, and end).
This function has slightly different behavior depending on whether timestamps are available, and depending on user speciifcations.
If a timestamp column exists and use_time_if_possible is True, the function will chunk based on the timestamp. Otherwise, it will chunk based on the number of messages.
- Parameters:
chat_data (pd.DataFrame) – The input chat data
num_chunks (int) – The number of chunks desired
timestamp_col (str) – The name of the column containing the timestamp
use_time_if_possible (bool, optional) – If a timestamp exists, chunk based on the timestamp rather than based on the number of messages. Defaults to True.
- Returns:
DataFrame with chunk numbers assigned
- Return type:
pd.DataFrame
- utils.assign_chunk_nums.create_chunks(df, num_chunks, conversation_id_col, timestamp_col)
Assign chunk numbers to the chats within each conversation based on time.
This function divides each conversation into time-based chunks, ensuring each chunk spans an equal duration.
- Parameters:
df (pd.DataFrame) – DataFrame containing chat data with a ‘timestamp’ column
num_chunks (int) – Number of chunks to divide the conversation into
conversation_id_col (str) – The name of the column containing the unique conversation identifier
timestamp_col (str) – The name of the column containing the timestamp
- Returns:
DataFrame with an additional ‘chunk_num’ column indicating time-based chunk assignments
- Return type:
pd.DataFrame
- utils.assign_chunk_nums.create_chunks_messages(chat_data, num_chunks, conversation_id_col)
Assign chunk numbers to the chats within each conversation based on the number of messages.
This function ensures that there is an even number of messages per chunk by calculating the chunk size for each conversation and adjusting the chunk number accordingly.
- Parameters:
chat_data (pd.DataFrame) – Dataframe containing chat data
num_chunks (int) – Initial maximum number of chunks
conversation_id_col (str) – The name of the column containing the unique conversation identifier
- Returns:
Dataframe with an additional ‘chunk_num’ column indicating chunk assignments
- Return type:
pd.DataFrame
- utils.assign_chunk_nums.reduce_chunks(num_rows, max_num_chunks)
Reduce the number of chunks based on the number of rows.
This function adjusts the number of chunks to ensure that each chunk has at least two rows. If the total number of rows is less than twice the maximum number of chunks, it reduces the number of chunks accordingly.
- Parameters:
num_rows (int) – Total number of rows
max_num_chunks (int) – Initial maximum number of chunks
- Returns:
Adjusted number of chunks
- Return type:
int