information_diversity module

features.information_diversity.calculate_ID_score(doc_topics, num_topics)

Computes info diversity score as suggested in Reidl & Woolley (2017); determines a topic vector for every message using an LDA Model, computes a mean topic vector across all messages, and measures the average squared cosine distance (dissimilarity) between the message vectors and the mean vector. Higher values indicate a higher level of topical diversity.

Implements Eq. (1) of Riedl & Woolley (2017): ID = sum((1 - cos(d_j, M))^2) / N, where cos(d_j, M) is the cosine similarity between message vector d_j and the mean topic vector M. Note that scipy’s cosine() already returns the cosine distance (1 - similarity), so it maps directly onto the (1 - cos(d_j, M)) term and must NOT be subtracted from 1 again (doing so would yield similarity, inverting the metric).

Source: https://ssrn.com/abstract=2384068

Parameters:

doc_topics (list) – the list of topic vectors from the team’s chat messages that comes from the LDA model.
num_topics (int) – the number of topics; set to be the square root of the number of rows, rounded to the nearest integer (this is a design decision on our part to be robust to datasets of varying sizes).

Returns:

The information diversity score, given the list of topics vectors and the number of topics

Return type:

float

features.information_diversity.get_info_diversity(df, conversation_id_col, message_col)

Computes information diversity (value between 0 and 1 inclusive) for all conversations.

Parameters:

df (pd.DataFrame) – The utterance (chat)-level dataframe.
conversation_id_col (str) – This is a string with the name of the column containing the unique identiifer of a conversation.
message_col (str) – This is a string with the name of the column containing the message / text.

Returns:

the grouped conversational dataframe, with a new column (“info_diversity”) representing the conversation’s information diversity score.

Return type:

pd.DataFrame

features.information_diversity.info_diversity(df, message_col)

Preprocess data and then create numeric mapping of words in dataset to pass into LDA model Uses square root of number of rows as number of topics

Parameters:

df (pd.DataFrame) – The input dataframe, grouped by the conversation index, to which this function is being applied.
message_col (str) – This is a string with the name of the column containing the message / text.

Returns:

The information diversity score, obtained from calling calculate_ID_score on the chat’s topics; defaults to zero in case of empty data

Return type:

float

features.information_diversity.preprocessing(data)

Preprocesses the data by lowercasing, lemmatizing, and removing words of size less than 4

Parameters:: data (str) – The utterance being analyzed (in this case, preprocessed for the LDA model.)
Returns:: A list of lemmatized text with stopwords and shorter words removed.
Return type:: list