information_diversity module
- features.information_diversity.calculate_ID_score(doc_topics, num_topics)
Computes info diversity score as suggested in Reidl & Woolley (2017); determines a topic vector for every message using an LDA Model, computes a mean topic vector across all messages, and measures the average cosine similarity between the message vectors and the mean vector.
- Parameters:
doc_topics (list) – the list of topic vectors from the team’s chat messages that comes from the LDA model.
num_topics (int) – the number of topics; set to be the square root of the number of rows, rounded to the nearest integer (this is a design decision on our part to be robust to datasets of varying sizes).
- Returns:
The information diversity score, given the list of topics vectors and the number of topics
- Return type:
float
- features.information_diversity.get_info_diversity(df, conversation_id_col, message_col)
Computes information diversity (value between 0 and 1 inclusive) for all conversations.
- Parameters:
df (pd.DataFrame) – The utterance (chat)-level dataframe.
conversation_id_col (str) – This is a string with the name of the column containing the unique identiifer of a conversation.
message_col (str) – This is a string with the name of the column containing the message / text.
- Returns:
the grouped conversational dataframe, with a new column (“info_diversity”) representing the conversation’s information diversity score.
- Return type:
pd.DataFrame
- features.information_diversity.info_diversity(df, message_col)
Preprocess data and then create numeric mapping of words in dataset to pass into LDA model Uses square root of number of rows as number of topics
- Parameters:
df (pd.DataFrame) – The input dataframe, grouped by the conversation index, to which this function is being applied.
message_col (str) – This is a string with the name of the column containing the message / text.
- Returns:
The information diversity score, obtained from calling calculate_ID_score on the chat’s topics; defaults to zero in case of empty data
- Return type:
float
- features.information_diversity.preprocessing(data)
Preprocesses the data by lowercasing, lemmatizing, and removing words of size less than 4
- Parameters:
data (str) – The utterance being analyzed (in this case, preprocessed for the LDA model.)
- Returns:
A list of lemmatized text with stopwords and shorter words removed.
- Return type:
list