The Basics (Get Started Here!)

A Light-Touch, One-Function Package

The Team Communication Toolkit is designed to be a light-touch package. This means you should need minimal lines of code to get from text data to structured communication features. We have defaults and sample code to help you quickly run the toolkit on your data.

However, we understand that you might have special requirements and need to customize features. Therefore, we offer adjustable “knobs” in the FeatureBuilder (feature_builder module).

This overview will provide you with a high-level understanding of the key inputs and assumptions of our toolkit. After reading, refer to the walkthrough in the Worked Example for a detailed discussion.

Demo / Sample Code

We have provided a simple example file, “featurize.py”, and a demo notebook, “demo.ipynb,” under our examples folder on GitHub.

We also have demos available on Google Colab that you can copy and run on your own:

Key Assumptions and Parameters

Package Assumptions

Pandas DataFrame: Your input should be a Pandas dataframe.
Unique Conversation Identifier: Each conversation in your dataframe needs a unique identifier (defined by conversation_id_col), or it can be generated by grouping multiple columns (defined by grouping_keys).
- conversation_id_col defaults to “conversation_num.”
- If grouping_keys are provided, they override the conversation identifier.
Unique Speaker Identifier: Each speaker in the conversation should have a unique identifier (defined by speaker_id_col).
- speaker_id_col defaults to “speaker_nickname.”
Single Utterance Column: The text of a single utterance should be in one column in the dataframe (defined by message_col).
- message_col defaults to “message.”
- Ensure you’ve segmented and preprocessed conversational transcripts into “utterances” or “turns”. We do not accept open-ended transcripts.
Temporal Order: Messages should be in temporal order. Earlier rows are assumed to be utterances that occurred before later rows.
Timestamps: If timestamps for each message exist, they should be in a column (timestamp_col), or in two columns. timestamp_col can be a string or a tuple of (start, end), with the first item as the start time and the second item as the end time.
Timestamp Unit: We accept all timestamp formats compatible with pandas.to_datetime. If your timestamp is an integer or float number, we default to treating the units as milliseconds (‘ms’). This default can be changed using the timestamp_unit parameter; acceptable values are (D, s, ms, us, ns).
Metadata Columns: Columns not required as inputs (conversation identifier, speaker identifier, message, and timestamp column(s)) are assumed to be metadata and won’t be summarized in the featurization process.
Vector Data Cache: Your data’s vector data will be cached in vector_directory. This directory will be created if it doesn’t exist, but its contents should be reserved for cached vector files.
- This parameter defaults to “vector_data/”.
- Note: v0.1.3 and earlier compute vectors using _preprocessed_ text by default, which drops capitalization and punctuation. However, this can affect the interpretation of sentiment vectors; for example, “Hello!” has more positive sentiment than “hello.” Consequently, from v0.1.4 onwards, we compute vectors using the raw input text, including punctuation and capitalization. To restore this behavior, please set compute_vectors_from_preprocessed to True.
- Additionally, we assume that empty messages are equivalent to “NaN vector,” defined here.
Output File Base: We generate three output files at different levels of analysis. (Utterance/Chat, Speaker/User, and Conversation). We recommend using the output_file_base parameter to give them all a common naming scheme (a string that will be used to automatically name all files). You can also name each of them individually, but there’s some complexity (for now) that we explain in Output File Naming Details.

Custom Features: To save time, we exclude features that require computing sentence vectors by default. To access these features, use the custom_features parameter in your FeatureBuilder:

# Add any of these features depending on your preference.
custom_features = [
        "(BERT) Mimicry",
        "Moving Mimicry",
        "Forward Flow",
        "Discursive Diversity"]

Summarizing Numeric Features: All numeric utterance-level features are summarizable. Aggregations (e.g., “mean level of positivity”) will appear in the Conversation-level data.

Customizable Parameters

Here are some parameters that can be customized. For more details, refer to the Worked Example and feature_builder module.

turns: Combine successive messages by the same individual into a single “turn.”
cumulative_grouping and within_task: Perform nested grouping, analyzing “sub-conversations” within a larger conversation together.
ner_training_df and ner_cutoff: Measure the number of named entities in each utterance (see Named Entity Recognition).
regenerate_vectors: Force-regenerate vector data even if it already exists.
use_gpu: If set to True and a GPU is available, the package will generate sentence vectors (SBERT) and RoBERTa sentiments using the GPU. Defaults to False (which means the package will only use the CPU).
compute_vectors_from_preprocessed: Computes vectors using preprocessed text (that is, with capitalization and punctuation removed). This was the default behavior for v.0.1.3 and earlier, but we now default to computing metrics on the unpreprocessed text (which INCLUDES capitalization and punctuation), and this parameter now defaults to False.
custom_liwc_dictionary_path: Allows the user to “bring their own” LIWC dictionary, and thereby access more recent versions of the LIWC features. Our default version of LIWC is 2007, but users can obtain more recent versions of the lexicon by contacting Ryan Boyd and Jamie Pennebaker. For more information on using the custom LIWC dictionary, please see Linguistic Inquiry and Word Count (LIWC) and Other Lexicons.
Custom Aggregation of Utterance (Chat)-Level Attributes (convo_aggregation, convo_methods, convo_columns, user_aggregation, user_methods, and user_columns): Customize the ways in which attributes at a lower level of analysis (for example, the number of words in a given message) get aggregated to a higher level of analysis (for example, the total number of words in an entire conversation.) See the Worked Example (Custom Aggregation) for details.

Custom Aggregation Example Usage:

convo_methods = ['max', 'median']  # This aggregates ONLY "positive_bert" at the conversation level using max and median.
convo_columns = ['positive_bert'],
user_methods = ['mean']            # This aggregates ONLY "negative_bert" at the speaker/user level using mean.
user_columns = ['negative_bert']

To turn off aggregation, set the following parameters to False. By default, both are True as aggregation is performed automatically:

convo_aggregation = False
user_aggregation = False

Reducing Redundant Features (drop_redundant_columns, corr_thresh, min_na_ratio, min_zero_ratio, min_group_size, and treat_zero_as_na): New in v.0.1.8. The FeatureBuilder can automatically detect groups of highly correlated features and retain only one representative per group, as well as drop columns with too many missing (NA) or zero values. See the Worked Example (Reducing Redundant Features) for details.

Reducing Redundant Features Example Usage:

# By default, drop_redundant_columns is False, so redundant columns are only logged, not removed.
# Set it to True to actually drop them from the output.
drop_redundant_columns = True
corr_thresh = 0.9    # Treat features correlated at >= 0.9 (absolute Spearman) as redundant.