information_diversity module

features.information_diversity.calculate_ID_score(doc_topics, num_topics)

Computes info diversity score as suggested in Reidl & Woolley (2017); determines a topic vector for every message using an LDA Model, computes a mean topic vector across all messages, and measures the average cosine similarity between the message vectors and the mean vector.

Source: https://www.circlelytics.com/wp-content/uploads/2022/05/Riedl-Woolley-2017-Teams-vs-Crowds-A-field-test-of-the-realitive-contribution-of-incentives-member-abilities.pdf

Parameters:
  • doc_topics (list) – the list of topic vectors from the team’s chat messages that comes from the LDA model.

  • num_topics (int) – the number of topics; set to be the square root of the number of rows, rounded to the nearest integer (this is a design decision on our part to be robust to datasets of varying sizes).

Returns:

The information diversity score, given the list of topics vectors and the number of topics

Return type:

float

features.information_diversity.get_info_diversity(df, conversation_id_col, message_col)

Computes information diversity (value between 0 and 1 inclusive) for all conversations.

Parameters:
  • df (pd.DataFrame) – The utterance (chat)-level dataframe.

  • conversation_id_col (str) – This is a string with the name of the column containing the unique identiifer of a conversation.

  • message_col (str) – This is a string with the name of the column containing the message / text.

Returns:

the grouped conversational dataframe, with a new column (“info_diversity”) representing the conversation’s information diversity score.

Return type:

pd.DataFrame

features.information_diversity.info_diversity(df, message_col)

Preprocess data and then create numeric mapping of words in dataset to pass into LDA model Uses square root of number of rows as number of topics

Parameters:
  • df (pd.DataFrame) – The input dataframe, grouped by the conversation index, to which this function is being applied.

  • message_col (str) – This is a string with the name of the column containing the message / text.

Returns:

The information diversity score, obtained from calling calculate_ID_score on the chat’s topics; defaults to zero in case of empty data

Return type:

float

features.information_diversity.preprocessing(data)

Preprocesses the data by lowercasing, lemmatizing, and removing words of size less than 4

Parameters:

data (str) – The utterance being analyzed (in this case, preprocessed for the LDA model.)

Returns:

A list of lemmatized text with stopwords and shorter words removed.

Return type:

list