check_embeddings module

utils.check_embeddings.check_embeddings(chat_data: DataFrame, vect_path: str, bert_path: str, need_sentence: bool, need_sentiment: bool, regenerate_vectors: bool, message_col: str = 'message')

Check if embeddings and required lexicons exist, and generate them if they don’t.

This function ensures the necessary vector and BERT embeddings are available. It also checks for the presence of certainty and lexicon files, generating them if needed.

Parameters:
  • chat_data (pd.DataFrame) – Dataframe containing chat data

  • vect_path (str) – Path to the vector embeddings file (by default, we want SBERT vectors; embeddings for each utterance.)

  • bert_path (str) – Path to the RoBERTa sentiment inference output file

  • need_sentence (bool) – Whether at least one feature will require SBERT vectors; we will not need to calculate them otherwise.

  • need_sentiment (bool) – Whether at least one feature will require the RoBERTa sentiments; we will not need to calculate them otherwise.

  • regenerate_vectors (bool, optional) – If true, will regenerate vector data even if it already exists

  • message_col (str, optional) – A string representing the column name that should be selected as the message. Defaults to “message”.

Returns:

None

Return type:

None

utils.check_embeddings.fix_abbreviations(dicTerm: str) str

Helper function to fix abbreviations with punctuations. src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L714

This function goes over a list of hardcoded exceptions for the tokenizer / sentence parser built into LIWC so that it doesn’t convert them into separate strings (e.g., we want “i.e.” to not be seen as two words and two sentences [i, e]).

Parameters:

dicTerm (str) – The lexicon term

Returns:

dicTerm

Return type:

str

utils.check_embeddings.generate_bert(chat_data, output_path, message_col, batch_size=64)

Generates RoBERTa sentiment scores for the given chat data and saves them to a CSV file.

Parameters:
  • chat_data (pd.DataFrame) – Contains message data to be analyzed for sentiments.

  • output_path (str) – Path to save the CSV file containing sentiment scores.

  • message_col (str, optional) – A string representing the column name that should be selected as the message. Defaults to “message”.

  • batch_size (int) – The size of each batch for processing sentiment analysis. Defaults to 64.

Raises:

FileNotFoundError – If the output path is invalid.

Returns:

None

Return type:

None

utils.check_embeddings.generate_certainty_pkl()

Helper function for generating the pickle file containing the certainty lexicon.

This function reads in lexicon files from a specified directory, processes the content, and appends the cleaned lexicon patterns to a dictionary.

Parameters:
  • directory (Path) – The directory containing the lexicon files

  • lexicons_dict (dict) – Dictionary to store the processed lexicon patterns

Returns:

None

Return type:

None

utils.check_embeddings.generate_lexicon_pkl()

Helper function for generating the pickle file containing lexicons.

This function reads in lexicon files from a specified directory, processes the content, and appends the cleaned lexicon patterns to a dictionary.

Parameters:
  • directory (Path) – The directory containing the lexicon files

  • lexicons_dict (dict) – Dictionary to store the processed lexicon patterns

Returns:

None

Return type:

None

utils.check_embeddings.generate_vect(chat_data, output_path, message_col, batch_size=64)

Generates sentence vectors for the given chat data and saves them to a CSV file.

Parameters:
  • chat_data (pd.DataFrame) – Contains message data to be vectorized.

  • output_path (str) – Path to save the CSV file containing message embeddings.

  • message_col (str, optional) – A string representing the column name that should be selected as the message. Defaults to “message”.

  • batch_size (int) – The size of each batch for processing sentiment analysis. Defaults to 64.

Raises:

FileNotFoundError – If the output path is invalid.

Returns:

None

Return type:

None

utils.check_embeddings.get_nan_vector()

Get a default value for an empty string (the “NaN vector”) and returns it as a 1D np array.

utils.check_embeddings.get_sentiment(texts)

Analyzes the sentiment of the given list of texts using a BERT model and returns a DataFrame with scores for positive, negative, and neutral sentiments.

Parameters:

texts (list of str) – The list of input texts to analyze.

Returns:

A DataFrame with sentiment scores.

Return type:

pd.DataFrame

utils.check_embeddings.is_valid_term(dicTerm: str) bool

Check if a dictionary term is valid.

This function returns True if the term matches the regex pattern and False otherwise. The pattern matches the following criteria:

  • Alphanumeric characters (a-zA-Z0-9)

  • Valid symbols: -, ‘, *, /

  • The * symbol can only appear once at the end of a word

  • 8 emojis are valid only when they appear alone

  • The / symbol can only appear once after alphanumeric characters

  • Spaces are allowed between valid words

Parameters:

dicTerm (str) – The dictionary term

Returns:

True if the term is valid, False otherwise

Return type:

bool

utils.check_embeddings.load_liwc_dict(dicText: str) dict

Loads up a dictionary that is in the LIWC 2007/2015 format. src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81

This functions reads the content of a LIWC dictionary file in the official format, and convert it to a dictionary with lexicon: regular expression format. We assume the dicText has two parts: the header, which maps numbers to “category names,” and the body, which maps words in the lexicon to different category numbers, separated by ‘%’. Below is an example: ‘’’ % 1 function 2 pronoun 3 ppron % again 1 2 against 1 2 3 ‘’’ Note that the elements in each line are separated by ‘ ‘.

Parameters:

dicText (str) – The content of a .dic file

Returns:

dicCategories

Return type:

dict

utils.check_embeddings.read_in_lexicons(directory, lexicons_dict)
utils.check_embeddings.sort_words(lexicons: list) str

Sorts the dictionary terms in a list.

This function sorts the dictionary terms in a list by their length in descending order. The hyphenated words are sorted first, followed by the non-hyphenated words.

Parameters:

dicTerms (list) – List of dictionary terms

Returns:

dicTerms

Return type:

str

utils.check_embeddings.str_to_vec(str_vec)

Takes a string representation of a vector and returns it as a 1D np array.