.. _basics:

The Basics (Get Started Here!)
================================

A Light-Touch, One-Function Package
*************************************

The Team Communication Toolkit is designed to be a **light-touch package**. This means you should need minimal lines of code to get from text data to structured communication features. We have defaults and sample code to help you quickly run the toolkit on your data.

However, we understand that you might have special requirements and need to customize features. Therefore, we offer adjustable "knobs" in the FeatureBuilder (:ref:`feature_builder`).

This overview will provide you with a high-level understanding of the key inputs and assumptions of our toolkit. After reading, refer to the walkthrough in the :ref:`examples` for a detailed discussion.

Demo / Sample Code
*******************

We have provided a simple example file, "featurize.py", and a demo notebook, "demo.ipynb," under our `examples folder <https://github.com/Watts-Lab/team_comm_tools/tree/main/examples>`_ on GitHub.

We also have demos available on Google Colab that you can copy and run on your own:

- `Demo 1: Overview of Team Communication Toolkit and 3 Levels of Features <https://colab.research.google.com/drive/1e8D5h_prRJsGs_N563EvpoQK0uZIAYsJ?usp=sharing>`_

- `Demo 2: Sample Analysis with the Group Affect and Performance Corpus <https://colab.research.google.com/drive/1wnuUC5yg6uQH0TYP1AXVPGRgfp2-npgJ?usp=sharing>`_


Key Assumptions and Parameters
*******************************

Package Assumptions 
++++++++++++++++++++

1. **Pandas DataFrame**: Your input should be a `Pandas dataframe <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_.

2. **Unique Conversation Identifier**: Each conversation in your dataframe needs a unique identifier (defined by ``conversation_id_col``), or it can be generated by grouping multiple columns (defined by ``grouping_keys``).

   * ``conversation_id_col`` defaults to "conversation_num."
   * If ``grouping_keys`` are provided, they override the conversation identifier.

3. **Unique Speaker Identifier**: Each speaker in the conversation should have a unique identifier (defined by ``speaker_id_col``).

   * ``speaker_id_col`` defaults to "speaker_nickname."

4. **Single Utterance Column**: The text of a single utterance should be in one column in the dataframe (defined by ``message_col``).

   * ``message_col`` defaults to "message."
   * Ensure you've segmented and preprocessed conversational transcripts into "utterances" or "turns". We do not accept open-ended transcripts.

5. **Temporal Order**: Messages should be in **temporal order**. Earlier rows are assumed to be utterances that occurred before later rows.

6. **Timestamps**: If timestamps for each message exist, they should be in a column (``timestamp_col``), or in two columns. ``timestamp_col`` can be a string or a tuple of (start, end), with the first item as the start time and the second item as the end time.

7. **Timestamp Unit**: We accept all timestamp formats compatible with `pandas.to_datetime <https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html>`_. If your timestamp is an integer or float number, we default to treating the units as milliseconds ('ms'). This default can be changed using the ``timestamp_unit`` parameter; acceptable values are ``(D, s, ms, us, ns)``.

8. **Metadata Columns**: Columns not required as inputs (conversation identifier, speaker identifier, message, and timestamp column(s)) are assumed to be metadata and won't be summarized in the featurization process.

9. **Vector Data Cache**: Your data's vector data will be cached in ``vector_directory``. This directory will be created if it doesn’t exist, but its contents should be reserved for cached vector files.
   
   * This parameter defaults to "vector_data/".

   * Note: v0.1.3 and earlier compute vectors using _preprocessed_ text by default, which drops capitalization and punctuation. However, this can affect the interpretation of sentiment vectors; for example, "Hello!" has more positive sentiment than "hello." Consequently, from v0.1.4 onwards, we compute vectors using the raw input text, including punctuation and capitalization. To restore this behavior, please set **compute_vectors_from_preprocessed** to True.

   * Additionally, we assume that empty messages are equivalent to "NaN vector," defined `here <https://raw.githubusercontent.com/Watts-Lab/team_comm_tools/refs/heads/main/src/team_comm_tools/features/assets/nan_vector.txt>`_.

10. **Output File Base**: We generate three output files at different levels of analysis. (Utterance/Chat, Speaker/User, and Conversation). We recommend using the ``output_file_base`` parameter to give them all a common naming scheme (a string that will be used to automatically name all files). You can also name each of them individually, but there's some complexity (for now) that we explain in :ref:`output_file_details`.

11. **Custom Features**: To save time, we exclude features that require computing sentence vectors by default. To access these features, use the ``custom_features`` parameter in your FeatureBuilder:

    .. code-block:: python

       # Add any of these features depending on your preference.
       custom_features = [
               "(BERT) Mimicry",
               "Moving Mimicry",
               "Forward Flow",
               "Discursive Diversity"]

12. **Summarizing Numeric Features**: All numeric utterance-level features are **summarizable**. Aggregations (e.g., "mean level of positivity") will appear in the Conversation-level data.

Customizable Parameters
++++++++++++++++++++++++

Here are some parameters that can be customized. For more details, refer to the :ref:`examples` and :ref:`feature_builder`.

1. ``turns``: Combine successive messages by the same individual into a single "turn."

2. ``cumulative_grouping`` and ``within_task``: Perform nested grouping, analyzing "sub-conversations" within a larger conversation together.

3. ``ner_training_df`` and ``ner_cutoff``: Measure the number of named entities in each utterance (see :ref:`named_entity_recognition`).

4. ``regenerate_vectors``: Force-regenerate vector data even if it already exists.

5. ``use_gpu``: If set to True and a GPU is available, the package will generate sentence vectors (SBERT) and RoBERTa sentiments using the GPU. Defaults to False (which means the package will only use the CPU).

6. ``compute_vectors_from_preprocessed``: Computes vectors using preprocessed text (that is, with capitalization and punctuation removed). This was the default behavior for v.0.1.3 and earlier, but we now default to computing metrics on the unpreprocessed text (which INCLUDES capitalization and punctuation), and this parameter now defaults to False.

7. ``custom_liwc_dictionary_path``: Allows the user to "bring their own" LIWC dictionary, and thereby access more recent versions of the LIWC features. Our default version of LIWC is 2007, but users can obtain more recent versions of the lexicon by contacting `Ryan Boyd <https://www.ryanboyd.io/>`_ and `Jamie Pennebaker <https://liberalarts.utexas.edu/psychology/faculty/pennebak>`_. For more information on using the custom LIWC dictionary, please see :ref:`liwc`.

8. **Custom Aggregation of Utterance (Chat)-Level Attributes** (``convo_aggregation``, ``convo_methods``, ``convo_columns``, ``user_aggregation``, ``user_methods``, and ``user_columns``): Customize the ways in which attributes at a lower level of analysis (for example, the number of words in a given message) get aggregated to a higher level of analysis (for example, the total number of words in an entire conversation.) See the Worked Example (:ref:`custom_aggregation`) for details.

Custom Aggregation Example Usage:

.. code-block:: python

     convo_methods = ['max', 'median']  # This aggregates ONLY "positive_bert" at the conversation level using max and median.
     convo_columns = ['positive_bert'],
     user_methods = ['mean']            # This aggregates ONLY "negative_bert" at the speaker/user level using mean.
     user_columns = ['negative_bert']

To turn off aggregation, set the following parameters to ``False``. By default, both are ``True`` as aggregation is performed automatically:

.. code-block:: python

     convo_aggregation = False
     user_aggregation = False

9. **Reducing Redundant Features** (``drop_redundant_columns``, ``corr_thresh``, ``min_na_ratio``, ``min_zero_ratio``, ``min_group_size``, and ``treat_zero_as_na``): **New in v.0.1.8.** The FeatureBuilder can automatically detect groups of highly correlated features and retain only one representative per group, as well as drop columns with too many missing (NA) or zero values. See the Worked Example (:ref:`reducing_redundant_features`) for details.

Reducing Redundant Features Example Usage:

.. code-block:: python

     # By default, drop_redundant_columns is False, so redundant columns are only logged, not removed.
     # Set it to True to actually drop them from the output.
     drop_redundant_columns = True
     corr_thresh = 0.9    # Treat features correlated at >= 0.9 (absolute Spearman) as redundant.