check_embeddings module

utils.check_embeddings.check_embeddings(chat_data: DataFrame, vect_path: str, bert_path: str, need_sentence: bool, need_sentiment: bool, regenerate_vectors: bool, use_gpu: bool, message_col: str, logger)

Check if embeddings and required lexicons exist, and generate them if they don’t.

This function ensures the necessary vector and BERT embeddings are available. It also checks for the presence of certainty and lexicon files, generating them if needed.

Parameters:

chat_data (pd.DataFrame) – Dataframe containing chat data
vect_path (str) – Path to the vector embeddings file (by default, we want SBERT vectors; embeddings for each utterance.)
bert_path (str) – Path to the RoBERTa sentiment inference output file
need_sentence (bool) – Whether at least one feature will require SBERT vectors; we will not need to calculate them otherwise.
need_sentiment (bool) – Whether at least one feature will require the RoBERTa sentiments; we will not need to calculate them otherwise.
regenerate_vectors (bool, optional) – If true, will regenerate vector data even if it already exists
use_gpu (bool) – If true, will use GPU for embeddings if available; otherwise, will use CPU.
message_col (str) – A string representing the column name that should be selected as the message.
logger (logging.Logger) – Logger for logging messages

Returns:

None

Return type:

None

utils.check_embeddings.fix_abbreviations(dicTerm: str) → str

Helper function to fix abbreviations with punctuations. src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L714

This function goes over a list of hardcoded exceptions for the tokenizer / sentence parser built into LIWC so that it doesn’t convert them into separate strings (e.g., we want “i.e.” to not be seen as two words and two sentences [i, e]).

Parameters:: dicTerm (str) – The lexicon term
Returns:: dicTerm
Return type:: str

utils.check_embeddings.generate_bert(chat_data, output_path, message_col, device, batch_size=64)

Generates RoBERTa sentiment scores for the given chat data and saves them to a CSV file.

Parameters:

chat_data (pd.DataFrame) – Contains message data to be analyzed for sentiments.
output_path (str) – Path to save the CSV file containing sentiment scores.
message_col (str) – A string representing the column name that should be selected as the message.
batch_size (int) – The size of each batch for processing sentiment analysis. Defaults to 64.

Raises:

FileNotFoundError – If the output path is invalid.

Returns:

None

Return type:

None

utils.check_embeddings.generate_certainty_pkl()

Helper function for generating the pickle file containing the certainty lexicon.

This function reads in lexicon files from a specified directory, processes the content, and appends the cleaned lexicon patterns to a dictionary.

Parameters:

directory (Path) – The directory containing the lexicon files
lexicons_dict (dict) – Dictionary to store the processed lexicon patterns

Returns:

None

Return type:

None

utils.check_embeddings.generate_lexicon_pkl()

Helper function for generating the pickle file containing lexicons.

This function reads in lexicon files from a specified directory, processes the content, and appends the cleaned lexicon patterns to a dictionary.

Parameters:

directory (Path) – The directory containing the lexicon files
lexicons_dict (dict) – Dictionary to store the processed lexicon patterns

Returns:

None

Return type:

None

utils.check_embeddings.generate_vect(chat_data, output_path, message_col, device, batch_size=64)

Generates sentence vectors for the given chat data and saves them to a CSV file.

Parameters:

chat_data (pd.DataFrame) – Contains message data to be vectorized.
output_path (str) – Path to save the CSV file containing message embeddings.
message_col (str) – A string representing the column name that should be selected as the message.
device (str) – A string representing the device to use for computation, either “cpu” or “cuda”.
batch_size (int) – The size of each batch for processing sentiment analysis. Defaults to 64.

Raises:

FileNotFoundError – If the output path is invalid.

Returns:

None

Return type:

None

utils.check_embeddings.get_nan_vector(): Get a default value for an empty string (the “NaN vector”) and returns it as a 1D np array.

utils.check_embeddings.get_sentiment(texts, model_bert, device)

Analyzes the sentiment of the given list of texts using a BERT model and returns a DataFrame with scores for positive, negative, and neutral sentiments.

Parameters:: texts (list of str) – The list of input texts to analyze.
Returns:: A DataFrame with sentiment scores.
Return type:: pd.DataFrame

utils.check_embeddings.is_valid_term(dicTerm: str) → bool

Check if a dictionary term is valid.

This function returns True if the term matches the regex pattern and False otherwise. The pattern matches the following criteria:

Alphanumeric characters (a-zA-Z0-9)
Valid symbols: -, ‘, *, /
The * symbol can only appear once at the end of a word
8 emojis are valid only when they appear alone
The / symbol can only appear once after alphanumeric characters
Spaces are allowed between valid words

Parameters:: dicTerm (str) – The dictionary term
Returns:: True if the term is valid, False otherwise
Return type:: bool

utils.check_embeddings.load_liwc_dict(dicText: str) → dict

Loads up a dictionary that is in the LIWC 2007/2015 format. src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81

This functions reads the content of a LIWC dictionary file in the official format, and convert it to a dictionary with lexicon: regular expression format. We assume the dicText has two parts: the header, which maps numbers to “category names,” and the body, which maps words in the lexicon to different category numbers, separated by ‘%’. Below is an example: ‘’’ % 1 function 2 pronoun 3 ppron % again 1 2 against 1 2 3 ‘’’ Note that the elements in each line are separated by ‘ ‘.

Parameters:: dicText (str) – The content of a .dic file
Returns:: dicCategories
Return type:: dict

utils.check_embeddings.read_in_lexicons(directory, lexicons_dict)

utils.check_embeddings.sort_words(lexicons: list) → str

Sorts the dictionary terms in a list.

This function sorts the dictionary terms in a list by their length in descending order. The hyphenated words are sorted first, followed by the non-hyphenated words.

Parameters:: dicTerms (list) – List of dictionary terms
Returns:: dicTerms
Return type:: str

utils.check_embeddings.str_to_vec(str_vec): Takes a string representation of a vector and returns it as a 1D np array.