Information Diversity

High-Level Intuition

This conversation-level feature measures the diversity of different topics discussed in a conversation.

Citation

Riedl and Woolley (2018)

Implementation Basics

Preprocessing and Generating Topics. We first preprocess the data by ensuring all utterance are in lowercase, lemmatized, and by removing stop words and words shorter than three caracters. We then use the gensim package to create an LDA model for each conversation, generating a corresponding topic space with its in which the number of dimensions equals the number of topics.

Determining Number of Topics. To determine the number of topics used, we use the square root of the number of utterances (rows) in the conversation, rounded to the nearest integer.

Computing the Measure. A team’s information diversity is then computed by examining the average cosine distance (where cosine distance is defined as 1 - cosine similarity) between the “topic vector” associated with a given utterance and the mean topic vector across the entire conversation.

Implementation Notes/Caveats

As of September 9, 2024, this feature uses a LDA-based topic model. However, because LDA is stochastic, it does not generate consistent results. An updated version of this feature with more stable topic extraction is currently under development.

Interpreting the Feature

The output ranges between 0 and 1, with higher values indicating a higher level of diversity in the topics discussed throughout the conversation. As discussed in Riedl and Woolley (2018), typical information diversity values are quite small, with the paper having a mean score of 0.04 and standard deviation of 0.05.