check_embeddings module
- utils.check_embeddings.check_embeddings(chat_data: DataFrame, vect_path: str, bert_path: str, need_sentence: bool, need_sentiment: bool, regenerate_vectors: bool, use_gpu: bool, message_col: str, logger)
Check if embeddings and required lexicons exist, and generate them if they don’t.
This function ensures the necessary vector and BERT embeddings are available. It also checks for the presence of certainty and lexicon files, generating them if needed.
- Parameters:
chat_data (pd.DataFrame) – Dataframe containing chat data
vect_path (str) – Path to the vector embeddings file (by default, we want SBERT vectors; embeddings for each utterance.)
bert_path (str) – Path to the RoBERTa sentiment inference output file
need_sentence (bool) – Whether at least one feature will require SBERT vectors; we will not need to calculate them otherwise.
need_sentiment (bool) – Whether at least one feature will require the RoBERTa sentiments; we will not need to calculate them otherwise.
regenerate_vectors (bool, optional) – If true, will regenerate vector data even if it already exists
use_gpu (bool) – If true, will use GPU for embeddings if available; otherwise, will use CPU.
message_col (str) – A string representing the column name that should be selected as the message.
logger (logging.Logger) – Logger for logging messages
- Returns:
None
- Return type:
None
- utils.check_embeddings.fix_abbreviations(dicTerm: str) str
Helper function to fix abbreviations with punctuations. src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L714
This function goes over a list of hardcoded exceptions for the tokenizer / sentence parser built into LIWC so that it doesn’t convert them into separate strings (e.g., we want “i.e.” to not be seen as two words and two sentences [i, e]).
- Parameters:
dicTerm (str) – The lexicon term
- Returns:
dicTerm
- Return type:
str
- utils.check_embeddings.generate_bert(chat_data, output_path, message_col, device, batch_size=64)
Generates RoBERTa sentiment scores for the given chat data and saves them to a CSV file.
- Parameters:
chat_data (pd.DataFrame) – Contains message data to be analyzed for sentiments.
output_path (str) – Path to save the CSV file containing sentiment scores.
message_col (str) – A string representing the column name that should be selected as the message.
batch_size (int) – The size of each batch for processing sentiment analysis. Defaults to 64.
- Raises:
FileNotFoundError – If the output path is invalid.
- Returns:
None
- Return type:
None
- utils.check_embeddings.generate_certainty_pkl()
Helper function for generating the pickle file containing the certainty lexicon.
This function reads in lexicon files from a specified directory, processes the content, and appends the cleaned lexicon patterns to a dictionary.
- Parameters:
directory (Path) – The directory containing the lexicon files
lexicons_dict (dict) – Dictionary to store the processed lexicon patterns
- Returns:
None
- Return type:
None
- utils.check_embeddings.generate_lexicon_pkl()
Helper function for generating the pickle file containing lexicons.
This function reads in lexicon files from a specified directory, processes the content, and appends the cleaned lexicon patterns to a dictionary.
- Parameters:
directory (Path) – The directory containing the lexicon files
lexicons_dict (dict) – Dictionary to store the processed lexicon patterns
- Returns:
None
- Return type:
None
- utils.check_embeddings.generate_vect(chat_data, output_path, message_col, device, batch_size=64)
Generates sentence vectors for the given chat data and saves them to a CSV file.
- Parameters:
chat_data (pd.DataFrame) – Contains message data to be vectorized.
output_path (str) – Path to save the CSV file containing message embeddings.
message_col (str) – A string representing the column name that should be selected as the message.
device (str) – A string representing the device to use for computation, either “cpu” or “cuda”.
batch_size (int) – The size of each batch for processing sentiment analysis. Defaults to 64.
- Raises:
FileNotFoundError – If the output path is invalid.
- Returns:
None
- Return type:
None
- utils.check_embeddings.get_nan_vector()
Get a default value for an empty string (the “NaN vector”) and returns it as a 1D np array.
- utils.check_embeddings.get_sentiment(texts, model_bert, device)
Analyzes the sentiment of the given list of texts using a BERT model and returns a DataFrame with scores for positive, negative, and neutral sentiments.
- Parameters:
texts (list of str) – The list of input texts to analyze.
- Returns:
A DataFrame with sentiment scores.
- Return type:
pd.DataFrame
- utils.check_embeddings.is_valid_term(dicTerm: str) bool
Check if a dictionary term is valid.
This function returns True if the term matches the regex pattern and False otherwise. The pattern matches the following criteria:
Alphanumeric characters (a-zA-Z0-9)
Valid symbols: -, ‘, *, /
The * symbol can only appear once at the end of a word
8 emojis are valid only when they appear alone
The / symbol can only appear once after alphanumeric characters
Spaces are allowed between valid words
- Parameters:
dicTerm (str) – The dictionary term
- Returns:
True if the term is valid, False otherwise
- Return type:
bool
- utils.check_embeddings.load_liwc_dict(dicText: str) dict
Loads up a dictionary that is in the LIWC 2007/2015 format. src: https://github.com/ryanboyd/ContentCoder-Py/blob/main/ContentCodingDictionary.py#L81
This functions reads the content of a LIWC dictionary file in the official format, and convert it to a dictionary with lexicon: regular expression format. We assume the dicText has two parts: the header, which maps numbers to “category names,” and the body, which maps words in the lexicon to different category numbers, separated by ‘%’. Below is an example: ‘’’ % 1 function 2 pronoun 3 ppron % again 1 2 against 1 2 3 ‘’’ Note that the elements in each line are separated by ‘ ‘.
- Parameters:
dicText (str) – The content of a .dic file
- Returns:
dicCategories
- Return type:
dict
- utils.check_embeddings.read_in_lexicons(directory, lexicons_dict)
- utils.check_embeddings.sort_words(lexicons: list) str
Sorts the dictionary terms in a list.
This function sorts the dictionary terms in a list by their length in descending order. The hyphenated words are sorted first, followed by the non-hyphenated words.
- Parameters:
dicTerms (list) – List of dictionary terms
- Returns:
dicTerms
- Return type:
str
- utils.check_embeddings.str_to_vec(str_vec)
Takes a string representation of a vector and returns it as a 1D np array.