oats.distances package¶
Submodules¶
oats.distances.distances module¶
-
class
RectangularPairwiseDistances(metric_str, edgelist, row_vector_dictionary, col_vector_dictionary, id_to_row_index, id_to_col_index, row_index_to_id, col_index_to_id, array, vectorizing_function, vectorizing_function_kwargs, vectorizer_object=None)¶ Bases:
objectAn object that contains the results of doing a pairwise comparison between two groups of texts.
- Attributes:
array (TYPE): Description
col_index_to_id (TYPE): Description
col_vector_dictionary (TYPE): Description
edgelist (TYPE): Description
id_to_col_index (TYPE): Description
id_to_row_index (TYPE): Description
metric_str (TYPE): Description
row_index_to_id (TYPE): Description
row_vector_dictionary (TYPE): Description
vectorizer_object (TYPE): Description
vectorizing_function (TYPE): Description
vectorizing_function_kwargs (TYPE): Description
This class is for creating objects to hold the specifications for a pairwise similarity or distance graph between a set of objects with IDs (such as genes in some dataset), as well as remembering certain information about how the graph was constructed so that a new object of some text or annotations could be placed within the context of this graph without rebuilding the entire graph. To this end, this class also provides a method for vectorizing new instances of text in the same way that text was vectorized to build the graph in the first place, and a method for finding the k nearest neighbors to a new instance of text that is present in the graph. Note that this does not currently include preprocessing, so this class does not remember how the text needs to be preprocessed in order to be fully compatible with these vectors, only how the vectorization is done once the text has been preprocessed.
- Args:
metric_str (str): A string indicating which distance metric was/should be used, compatible with sklearn.
edgelist (pandas.DataFrame): Each row is an edge in the graph with format (from,to,value).
row_vector_dictionary (dict of int:numpy.array): A mapping between row node IDs and vector representations.
col_vector_dictionary (dict of int:numpy.array): A mappign between column node IDs and vector representations.
id_to_row_index (dict of int:int): A mapping between ID integers and row indices in the array.
id_to_col_index (dict of int:int): A mapping between ID integers and column indices in the array.
row_index_to_id (dict of int:int): A mapping between row indices in the array and ID integers.
col_index_to_id (dict of int:int): A mapping between column indices in the array and ID integers.
array (numpy.array): A numpy array containing all the distance values that were calculated.
vectorizing_function (function): A function to call to convert text to vector compatible with this graph.
vectorizing_function_kwargs (dict of str:obj): A dictionary of keyword arguments that are passed to the vectorizing function.
vectorizer_object (None, optional): The vectorizer object used for embedding each node.
-
get_nearest_neighbor_ids(text, k)¶ Returns a list of k IDs which are the closest to a given string of text. Currently written to accept a single string of text, not a list of strings. Also generates the KNN model inside this method, TODO move this outside of this method if it it’s too slow. Args:
text (str): Any string of text.
k (int): The number of neighbor IDs to return.
- Returns:
- list: A list of the IDs for the nearest neighbors to the input text.
-
get_vector(text)¶
-
class
SquarePairwiseDistances(metric_str, edgelist, vector_dictionary, id_to_index, index_to_id, array, vectorizing_function=None, vectorizing_function_kwargs=None, vectorizer_object=None)¶ Bases:
objectAn object that contains the results of doing a pairwise comparison within one group of texts.
- Attributes:
array (TYPE): Description
edgelist (TYPE): Description
id_to_index (TYPE): Description
index_to_id (TYPE): Description
metric_str (TYPE): Description
vector_dictionary (TYPE): Description
vectorizer_object (TYPE): Description
vectorizing_function (TYPE): Description
vectorizing_function_kwargs (TYPE): Description
This class is for creating objects to hold the specifications for a pairwise similarity or distance matrax between a set of objects with IDs (such as genes in some dataset), as well as remembering certain information about how the graph was constructed so that a new object of some text or annotations could be placed within the context of this graph without rebuilding the entire graph. To this end, this class also provides a method for vectorizing new instances of text in the same way that text was vectorized to build the graph in the first place, and a method for finding the k nearest neighbors to a new instance of text that is present in the graph. Note that this does not currently include preprocessing, so this class does not remember how the text needs to be preprocessed in order to be fully compatible with these vectors, only how the vectorization is done once the text has been preprocessed.
- Args:
metric_str (str): A string indicating which distance metric was/should be used, compatible with sklearn.
vectorizing_function (function): A function to call to convert text to vector compatible with this graph.
vectorizing_function_kwargs (dict of str:obj): A dictionary of keyword arguments that are passed to the vectorizing function.
edgelist (pandas.DataFrame): Each row is an edge in the graph with format (from,to,value).
vector_dictionary (dict of int:int): A mapping between node IDs and vector representation.
vectorizer_object (None, optional): The vectorizer object used for embedding each node.
id_to_index (dict of int:int): A mapping between node IDs and indices in the distance matrix.
index_to_id (dict of int:int): A mapping between indices in the distance matrix and node IDs.
array (numpy.array): A numpy array containing all the distance values that were calculated.
-
get_distance(id1, id2)¶
-
get_distances(text)¶
-
get_nearest_neighbor_ids(text, k)¶ Returns a list of k IDs which are the closest to a given string of text. Currently written to accept a single string of text, not a list of strings. Also generates the KNN model inside this method, TODO move this outside of this method if it it’s too slow.
- Args:
text (str): Any string of text.
k (int): The number of neighbor IDs to return.
- Returns:
- list: A list of the IDs for the nearest neighbors to the input text.
-
get_vector(text)¶
oats.distances.pairwise module¶
-
vectorize_with_bert(text, model, tokenizer, method='sum', layers=4)¶ This function uses a pretrained BERT model to infer a document level vector for a collection of one or more sentences. The sentence are defined using the nltk sentence parser. This is done because the BERT encoder expects either a single sentence or a pair of sentences. The internal representations are drawn from the last n layers as specified by the layers argument, and represent a particular token but account for the context that it is in because the entire sentence is input simultanously. The vectors for the layers can concatentated or summed together based on the method argument. The vector obtained for each token then are averaged together to for the document level vector.
- Args:
text (str): Any arbitrary text string.
model (pytorch model): An already loaded BERT PyTorch model from a file or other source.
tokenizer (bert tokenizer): Object which handles how tokenization specific to BERT is done.
method (str): A string indicating how layers for a token should be combined (concat or sum).
layers (int): An integer saying how many layers should be used for each token.
- Returns:
- numpy.Array: A numpy array which is the vector embedding for the passed in text.
- Raises:
- ValueError: The method argument has to be either ‘concat’ or ‘sum’.
-
vectorize_with_doc2vec(text, model)¶ Genereate a vector representation of a text string using Doc2Vec.
- Args:
text (str): Any arbitrary text string.
model (gensim.models.Doc2Vec): A loaded Doc2Vec model object.
- Returns:
- numpy.Array: A numerical vector with length determined by the model used.
-
vectorize_with_ngrams(strs, training_texts=None, tfidf=False, **kwargs)¶ Create a vector embedding for each passed in text string.
- Args:
strs (list of str): A list of text strings that will each be translated into a numerical vector.
training_texts (list of str, optional): If provided, this is the list of texts that will be used to the determine the vocabulary and weights for each token.
tfidf (bool, optional): This value is false by default, set to true to use term-frequency inverse-document-frequency weighting instead of raw counts.
**kwargs: Any other applicable keyword arguments that will be passed to the sklearn vectorization function.
- Returns:
- list of numpy.Array, object: A list of the numerical vector arrays that is the same length as the input list of text strings, and the vectorizing object.
-
vectorize_with_similarities(text, vocab_tokens, vocab_token_to_index, vocab_matrix)¶ Generate a vector representation of a text string based on a word similarity matrix. The resulting vector has n positions, where n is the number of words or tokens in the full vocabulary. The value at each position indicates the maximum similarity between that corresponding word in the vocabulary and any of the words or tokens in the input text string, as given by the input similarity matrix. Therefore, this is similar to an n-grams approach but uses the similarity between non-identical words or tokens to make the vector semantically meaningful.
- Args:
text (str): Any arbitrary text string.
vocab_tokens (list of str): The words or tokens that make up the entire vocabulary.
vocab_token_to_index (dict of str:int): Mapping between words in the vocabulary and an index in rows and columns of the matrix.
vocab_matrix (numpy.array): A pairwise distance matrix holding the similarity values between all possible pairs of words in the vocabulary.
- Returns:
- numpy.Array: A numerical vector with length equal to the size of the vocabulary.
-
vectorize_with_word2vec(text, model, method)¶ Generate a vector representation of a text string using Word2Vec.
- Args:
text (str): Any arbitrary text string.
model (gensim.models.Word2Vec): A loaded Word2Vec model object.
method (str): Either ‘mean’ or ‘max’, indicating how word vectors should combine to form the document vector.
- Returns:
- numpy.Array: A numerical vector with length determined by the model used.
- Raises:
- Error: The string for the method argument is not recognized.
-
with_annotations(ids_to_annotations, ontology, metric, tfidf=False, **kwargs)¶ Find distance between nodes of interest in the input dictionary based on the overlap in the ontology terms that are mapped to those nodes. The input terms for each ID are in the format of lists of term IDs. All inherited terms by all these terms will be added in this function using the provided ontology object so that each node will be represented by the union of all the terms inherited by the terms annotated to it. After that step, the term IDs are simply treated as words in a vocabulary, and the same approach as with n-grams is used to generate the distance matrix.
- Args:
ids_to_annotations (dict): A mapping between IDs and a list of ontology term ID strings.
ontology (Ontology): Ontology object with all necessary fields.
metric (str): A string indicating which distance metric should be used (e.g., cosine).
tfidf (bool, optional): Whether to use TFIDF weighting or not.
**kwargs: All the keyword arguments that can be passed to sklearn.feature_extraction.CountVectorizer()
- Returns:
- oats.pairwise.SquarePairwiseDistances: Distance matrix and accompanying information.
-
with_bert(ids_to_texts, model, tokenizer, metric, method, layers)¶ Find distance between strings of text in some input data using Doc2Vec. The preprocessing done to the text strings here is complex, and uses the passed in tokenizer object as well. For this reason, in most cases the text passed in to this method should be the raw relatively unprocessed sentence of interest. Splitting up of multiple sentences is handled in the helper function for this function.
- Args:
ids_to_texts (dict): A mapping between IDs and strings of text.
model (pytorch model): An already loaded BERT PyTorch model from a file or other source.
tokenizer (bert tokenizer): Object which handles how tokenization specific to BERT is done.
metric (str): A string indicating which distance metric should be used (e.g., cosine).
method (str): A string indicating how layers for a token should be combined (concat or sum).
layers (int): An integer saying how many layers should be used for each token.
- Returns:
- oats.pairwise.SquarePairwiseDistances: Distance matrix and accompanying information.
-
with_doc2vec(ids_to_texts, model, metric)¶ Find distance between strings of text in some input data using Doc2Vec. Note that only very simple preprocessing is done here (case normalizing and splitting on whitespace) so any preprocessing steps on the text strings should be done prior to passing them in a dictionary to this function.
- Args:
ids_to_texts (dict): A mapping between IDs and strings of text.
model (gensim.models.doc2vec): An already loaded Doc2Vec model from a file or training.
metric (str): A string indicating which distance metric should be used (e.g., cosine).
- Returns:
- oats.pairwise.SquarePairwiseDistances: Distance matrix and accompanying information.
-
with_ngrams(ids_to_texts, metric, training_texts=None, tfidf=False, **kwargs)¶ Find distance between strings of text in some input data using n-grams. Note that only very simple preprocessing is done after this point (splitting on whitespace only) so all processing of the text necessary should be done prio to passing to this function.
- Args:
ids_to_texts (dict): A mapping between IDs and strings of text.
metric (str): A string indicating which distance metric should be used (e.g., cosine).
training_texts (None, optional): Description
tfidf (bool, optional): Whether to use TFIDF weighting or’ not.
**kwargs: All the keyword arguments that can be passed to sklearn.feature_extraction.CountVectorizer().
- Returns:
- oats.pairwise.SquarePairwiseDistances: Distance matrix and accompanying information.
-
with_precomputed_vectors(ids_to_vectors, metric)¶ Summary
- Args:
ids_to_vectors (TYPE): Description
metric (TYPE): Description
- Returns:
- TYPE: Description
-
with_similarities(ids_to_texts, vocab_tokens, vocab_matrix, metric)¶
-
with_topic_model(ids_to_texts, metric, training_texts=None, seed=124134, num_topics=10, algorithm='lda', **kwargs)¶ docstring
- Args:
ids_to_texts (TYPE): Description
metric (TYPE): Description
seed (int, optional): Description
num_topics (TYPE): Description
algorithm (str, optional): Description
**kwargs: Description
- Returns:
- TYPE: Description
- Raises:
- ValueError: Description
-
with_word2vec(ids_to_texts, model, metric, method='mean')¶ Find distance between strings of text in some input data using Word2Vec. Note that only very simple preprocessing is done here (case normalizing and splitting on whitespace) so any preprocessing steps on the text strings should be done prior to passing them in a dictionary to this function. Note that if no words in a description are in the model vocabulary, then a random word will be selected to represent the text. This avoids using one default value which will force all these descriptions to cluster, and prevents an error being raised due to no vector appearing. This should rarely or never happen as long as the text has been preprocessed into reasonable tokens and the model is large.
- Args:
ids_to_texts (dict): A mapping between IDs and strings of text.
model (gensim.models.word2vec): An already loaded Word2Vec model from a file or training.
metric (str): A string indicating which distance metric should be used (e.g., cosine).
method (str, optional): Should the word embeddings be combined with mean or max.
- Returns:
- oats.pairwise.SquarePairwiseDistances: Distance matrix and accompanying information.
- No Longer Raises:
- Error: The ‘method’ argument has to be one of “mean” or “max”.