oats.nlp package

Submodules

oats.nlp.preprocess module

concatenate_texts(texts)

Combines multiple description strings into a single string. This is different than a simple join with whitespace, because it handles additional formatting which is assumed to be necessary for texts that are either fragments or full sentences. This includes removing duplicates that differ only by punctuation or capitalization, retaining the specific order of the texts, and making sure they are capitalized and punctuated in a standard way that will be parseable by other packages and functions that deal with text.

Args:
texts (list of str): A list of arbitrary strings.
Returns:
str: The text string that results from concatenating and formatting these text strings.
concatenate_with_delim(delim, elements)

Concatenates the strings in the passed in list with a specific delimiter and returns the resulting string. This is useful when preparing strings that are intended to be placed within a table object or delim-separated text file. Any of the input strings can themselves already be representing delim-separated lists, and this will be accounted for.

Args:
elements (list of str): A list of strings that represent either lists or list elements.
Returns:
str: A text string representing a list that is delimited by the provided delimiter.s
remove_text_duplicates_retain_order(texts)

Remove the duplicates from a list of text strings, where duplicates are defined as two text strings that differ only by puncutation, capitalization, or the length of whitespaces. This is useful for not retaining extra text information just because its not perfectly identical to some existing string. Duplicates are removed such that the first occurence is retained, and that determines the final ordering. The texts that are returned are not processed, and are a subset of the original list of text strings. The strings retained determined which version of that duplicate in terms of punctuation, capitalization, and whitespace is retained in the final list.

Args:
texts (list of str): A list of arbitrary strings.
Returns:
list of str: A subset of the original list, with duplicates as defined above removed.
replace_delimiter(text, old_delim, new_delim)

Takes a string that uses one delimiter to represent a list, and returns a new string that represents a list using a different delimiter.

Args:

text (str): A string that is representing a list using the old delimiter.

old_delim (str): Any arbitrary string.

new_delim (str): Any arbitrary string.

Returns:
str: A string representing a list using the new delimiter.
subtract_string_lists(delim, string_list_1, string_list_2)

Treats the two input strings as lists that are delimted by the provided delimiter, and then returns a new delimited string list that represents the results of the operation for treating each list as a set and substracting the second set from the first set.

Args:
delim (str): A delimiter for parsing the strings that represent lists. string_list_1 (str): A string that represents a list. string_list_2 (str): A string that represents a list.
Returns:
TYPE: A string that represents the list resulting from the operation.

oats.nlp.search module

all_fuzzy_matches(patterns, txt, threshold, local=1)

Returns the sublist of patterns that appear in the text. A fuzzy match does not necessarily need to be a perfect character for character match between a pattern and the larger text string, with a tolerance for mismatches controlled by the threhsold parameter. The underlying metric is Levenshtein distance.

Args:

patterns (list): The shorter text strings to search for.

txt (str): The larger text to search within.

threshold (float): Value between 0 and 1 at which matches are considered positive.

local (int, optional): Alignment method, 0 for global 1 for local.

Returns:
list: A sublist of the patterns argument containing only the patterns that were found.
all_rabinkarp_matches(patterns, text, q=1193)

Returns the sublist of patterns that appear in the text. The Robin Karp algorithm is a fast algorithm for finding exact matches between a pattern and a longer string, so a match is only considered real if it matches character for character.

Args:

patterns (str): The shorter text to search for.

text (str): The larger text to search within.

q (int, optional): A prime number used for hashing.

Returns:
list: A sublist of the patterns argument containing only the patterns that were found.
any_fuzzy_matches(patterns, text, threshold, local=1)

Return true if any pattern from a list of patterns is in the text, false else. A fuzzy match does not necessarily need to be a perfect character for character match between a pattern and the larger text string, with a tolerance for mismatches controlled by the threhsold parameter. The underlying metric is Levenshtein distance.

Args:

patterns (list): The shorter text strings to search for.

txt (str): The larger text to search within.

threshold (float): Value between 0 and 1 at which matches are considered positive.

local (int, optional): Alignment method, 0 for global 1 for local.

Returns:
list: A sublist of the patterns argument containing only the patterns that were found.
any_rabinkarp_matches(patterns, text, q=1193)

Return true if any pattern from a list of patterns is in the text, false else. The Robin Karp algorithm is a fast algorithm for finding exact matches between a pattern and a longer string, so a match is only considered real if it matches character for character.

Args:

patterns (str): The shorter text to search for.

text (str): The larger text to search within.

q (int, optional): A prime number used for hashing.

Returns:
boolean: True if any of the patterns were found, false if none were.
binary_fuzzy_match(pat, txt, threshold, local=1)

Searches for fuzzy matches to a pattern in a longer string. A fuzzy match does not necessarily need to be a perfect character for character match between a pattern and the larger text string, with a tolerance for mismatches controlled by the threhsold parameter. The underlying metric is Levenshtein distance.

Args:

pat (str): The shorter text to search for.

txt (str): The larger text to search within.

threshold (int): Value between 0 and 1 at which matches are considered real.

local (int, optional): Alignment method, 0 for global 1 for local.

Returns:
boolean: True if the pattern was found, false if it was not.
binary_robinkarp_match(pat, txt, q=1193)

Searches for exact matches to a pattern in a longer string. Adapted from implementation by Bhavya Jain from https://www.geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching/. The Robin Karp algorithm is a fast algorithm for finding exact matches between a pattern and a longer string, so a match is only considered real if it matches character for character.

Args:

pat (str): The shorter text to search for.

txt (str): The larger text to search within.

q (int, optional): A prime number used for hashing.

Returns:
boolean: True if the pattern was found, false is it was not.

oats.nlp.small module

add_end_character(text)

Adds a period to the end of the text. This could be useful when concatentating text while still retaining the sentence or phrase boundaries taken into account by other processing steps such as part-of-speech analysis. Accounts for text that already ends in periods or semicolons or is an empty string.

Args:
text (str): Any piece of text.
Returns:
str: That text with a period added to the end.
add_prefix_safely(token, prefix)

Attaches the passed in prefix argument to the front of the token, unless the token is an empty string in which case nothing happens and the token is returned unchaged. This can be important for avoiding accidentally making a meaningless token meaningful by modifying it with an additional text component.

Args:

token (str): Any arbitrary string.

prefix (str): Any arbitrary string.

Returns:
str: The token with the prefix added to the beginning of the string.
capitalize_sentence(text)

Makes the first character of a text string captial if it is a letter.

Args:
text (str): Any arbitrary text string.
Returns:
str: The text string with the first letter capitalized.
get_ontology_ids(text)

Find all ontology IDs inside of some text. This makes the assumption that all (and exactly) seven digits of the ontology term ID number are included, that the abbreviation for the ontology name is in all caps, but makes no assumption about the length of the name.

Args:
text (str): Any string of text.
Returns:
list: A list of the ontology term IDs mentioned in it.
remove_enclosing_brackets(text)

Removes square brackets if they are enclosing the text string.

Args:
text (str): Any arbitrary string.
Returns:
str: The same string but with the enclosing brackets removed if they were there.
remove_newlines(text)

Remove all newline characters from a piece of text.

Args:
text (str): Any piece of text.
Returns:
str: That text without newline characters.
remove_punctuation(text)

Remove all punctuation from a piece of text.

Args:
text (str): Any piece of text.
Returns:
str: That text without any characters that were punctuation.

oats.nlp.vocabulary module

get_overrepresented_tokens(interesting_text, background_text, max_features)

See https://liferay.de.dariah.eu/tatom/feature_selection.html. This way uses the difference in the rate of each particular words between the interesting text and the background text to determine what the vocabulary of relevant words should be. This means we are selecting as features things that are important in some paritcular domain but are not as important in the general language. This is potentially one method of finding words which will be of removing the general words from the text that is parsed for a particular domain. Potential problem is that we actually want words (features) that are good at differentiating different phenotypes, which is a slightly different question.

Args:

interesting_text (str): A string of many tokesn coming form examples of interest.

background_text (str): A string of many tokens coming from some background examples.

max_features (int): The maximum number of features (tokens) in the returned vocabulary.

Returns:
list: A list of features which are tokens, or words, which are represent
get_vocab_from_tokens(tokens)

Generates a mapping between each token and some indices 0 to n that can place that token at a particular index within a vector. This is a vocabulary dict that is in the format necessary for passing as an argument to the sklearn classes for generating feature vectors from input text.

Args:
tokens (list): A list of tokens that belong in the vocabulary.
Returns:
dict: A mapping between each token and an index from zero to n.

Get a dictionary mapping tokens in some text to related words found with Word2Vec. Note that these are not necessarily truly synonyms, but may just be words that are strongly or weakly related to a given word, depending on how strict the threshold parameters are that are used.

Args:

description (str): Any string of text, a description of something.

model (Word2Vec): The actual model object that has already been loaded.

threshold (float): Similarity threshold that must be satisfied to add a word as related.

max_qty (int): Maximum number of related words to accept for a single token.

Returns:
dict: A mapping from a string to a list of strings, the found related words..

Method to generate a list of words that are found to be related to the input word through assessing similarity to other words in a word2vec model of word embeddings. The model can be learned from relevant text data or can be pre-trained on an existing source. All words that satisfy the threshold provided up to the quantity specified as the maximum are added.

Args:

word (str): The word for which we want to find other related words.

model (Word2Vec): The actual model object that has already been loaded.

threshold (float): Similarity threshold that must be satisfied to add a word as related.

max_qty (int): Maximum number of related words to accept.

Returns:
list: The list of related words that were found, could be empty if nothing was found.

Get a dictionary mapping tokens in some text to related words found with WordNet. Note that these could not only be synonyms but also hypernyms and hyponyms depending on what parameters are used.

Args:

description (str): Any string of text, a description of something.

synonyms (int, optional): Set to 1 to include synonyms in the set of related words.

hypernyms (int, optional): Set to 1 to included hypernyms in the set of related words.

hyponyms (int, optional): Set to 1 to include hyponyms in the set of related words.

Returns:
dict: A mapping from a string to a list of strings, the found related words..

Method to generate a list of words that are found to be related to the input word through the WordNet ontology and resource. The correct sense of the input word to be used within the context of WordNet is picked based on disambiguation from the PyWSD package which takes the surrounding text (or whatever text is provided as context) into account. All synonyms, hypernyms, and hyponyms are considered to be related words in this case.

Args:

word (str): The word for which we want to find related words.

context (str): Text to use for word-sense disambigutation, usually sentence the word is in.

synonyms (int, optional): Set to 1 to include synonyms in the set of related words.

hypernyms (int, optional): Set to 1 to included hypernyms in the set of related words.

hyponyms (int, optional): Set to 1 to include hyponyms in the set of related words.

Returns:
list: The list of related words that were found, could be empty if nothing was found.
reduce_vocab_connected_components(descriptions, tokens, distance_matrix, threshold)

Reduces the vocabulary size for a dataset of provided tokens by looking at a provided distance matrix between all the words and creating new tokens to represent groups of words that have a small distance (less than the threshold) between two of the members of that group. This problem is solved here as a connected components problem by creating a graph where tokens are words, and each word is connected to itself and any word where the distance to that word is less than the threshold. Note that the Linares Pontes algorithm is generally favorable to this approach because if the threshold is too high the connected components can quickly become very large.

Args:

descriptions (dict): A mapping between IDs and text descriptions.

tokens (list): A list of tokens from which to construct the vocabulary.

distance_matrix (np.array): An by n square matrix of distances where n must be length of tokens list and indices must correspond.

threshold (float): The value where a distance of less than this threshold indicates the words should be collapsed to a new token.

Returns:

dict: Mapping between IDs and text descriptions with reduced vocabulary, matches input.

dict: Mapping between tokens present in the original texts and corresponding tokens from the reduced vocabulary.

dict: Mapping between tokens present in the reduced vocab and lists of corresponding original vocabulary tokens.

reduce_vocab_linares_pontes(descriptions, tokens, distance_matrix, n)

Implementation of the algorithm described in the paper cited below. In short, this returns the descriptions with each word replaced by the most frequently used token in the set of tokens that consists of that word and the n most similar words as given by the distance matrix provided. Some values of n that are used in the papers are 1, 2, and 3. Note that the descriptions in the passed in dictionary should already be preprocessed in whatever way is necessary, but they should atleast be formatted as lowercase tokens that are separated by a single space in each description. The tokens in the list of tokens should be pulled directly from those descriptions and be found by splitting by a single space. They are passed in as a separate list though because the index of the token in the list has to correspond to the index of that token in the distance matrix. If the descriptions contain any tokens that are not present in the tokens list will not be affected when altering the tokens that are present in the descriptions.

Elvys Linhares Pontes, Stéphane Huet, Juan-Manuel Torres-Moreno, Andréa Carneiro Linhares. Automatic Text Summarization with a Reduced Vocabulary Using Continuous Space Vectors. 21st International Conference on Applications of Natural Language to Information Systems (NLDB), 2016, Salford, United Kingdom. pp.440-446, ff10.1007/978-3-319-41754-7_46ff. ffhal-01779440

Args:

descriptions (dict): A mapping between IDs and text descriptions.

tokens (list): A list of strings which are tokens that appear in the descriptions.

distance_matrix (np.array): A square array of distances between the ith and jth token in the tokens list.

n (int): The number of most similar words to consider when replacing a word in building the reduced vocabulary.

Returns:

dict: Mapping between IDs and text descriptions with reduced vocabulary, matches input.

dict: Mapping between tokens present in the original vocab and the token it is replaced with in the reduced vocabulary.

dict: Mapping between tokens present in the reduced vocab and lists of corresponding original vocabulary tokens.

token_enrichment(all_ids_to_texts, group_ids)

Obtain a dataframe with the results of a token enrichment analysis using Fisher exact test with the results sorted by p-value.

Args:

all_ids_to_texts (dict of int:str): A mapping between unique integer IDs (for genes) and some string of text.

group_ids (list of int): The IDs which should be a subset of the dictionary argument that refer to those belonging to the group to be tested.

Returns:
pandas.DataFrame: A dataframe sorted by p-value that contains the results of the enrichment analysis with one row per token.

Module contents