oats.biology package¶
Submodules¶
oats.biology.dataset module¶
-
class
Dataset(data=None, keep_ids=False, case_sensitive=False)¶ Bases:
objectA class that wraps a dataframe containing gene names, text, and ontology annotations.
- Attributes:
- df (pandas.DataFrame): The dataframe containing the dataset accessed through this class or the path to a csv file that can be loaded as a comparable dataframe.
- Args:
- data (pandas.DataFrame or str, optional): A dataframe containing the data to be added to this dataset, or the path to a csv file that can be loaded as a comparable dataframe. The columns of this dataframe must contain “species”, “gene_names”, “gene_synonyms”, “description”, “term_ids”, and “sources”. Any of those columns can contain any number of missing values but the columns must exist. Any columns that are outside of that list of column names are ignored. Gene names, symbols, and ontology term IDS in the “gene_names”, “gene_synonyms”, and “term_ids” columns should be pipe delimited.
-
add_data(new_data, case_sensitive=False)¶ Add additional data to this dataset.
- Args:
- new_data (pandas.DataFrame or str): A dataframe containing the data to be added to this dataset, or the path to a csv file that can be loaded as a comparable dataframe. The columns of this dataframe must contain “species”, “gene_names”, “gene_synonyms”, “description”, “term_ids”, and “sources”. Any of those columns can contain any number of missing values but the columns must exist. Any columns that are outside of that list of column names are ignored. Gene names, symbols, and ontology term IDS in the “gene_names”, “gene_synonyms”, and “term_ids” columns should be pipe delimited.
-
annotations(ontologies=None)¶ Get a mapping from IDs to lists of ontology term IDs.
- Returns:
- dict of int:list of str: Mapping between record IDs and lists of ontology term IDs.
- Args:
- ontologies (list of str, optional): Names of ontologies to subset the annotations.
-
describe()¶ Returns a summarizing dataframe for this dataset.
-
descriptions()¶ Get a mapping from record IDs to text descriptions.
- Returns:
- dict of int:str: Mapping between record IDs and text descriptions.
-
filter_by_species(species)¶ Remove all records not related to these species.
- Args:
- species (list of str): A list of strings referring the species names.
-
filter_has_annotation(ontology_name=None)¶ Remove all records that don’t have atleast one related ontology term annotation.
- Args:
- ontology_name (str, optional): A string which is the name of an ontology (e.g, “PATO”, “PO”). If this ontology name is provided then only annotations from that ontology are considered when filtering the dataset.
-
filter_has_description()¶ Remove all records that don’t have a related text description.
-
filter_randomly(k, seed=1483)¶ Remove all but k randomly sampled records from the dataset.
- Args:
k (int): The number of records (IDs) to retain.
seed (int, optional): A seed value for the random subsampling.
-
filter_with_ids(ids)¶ Remove all records with IDs not in the provided list.
- Args:
- ids (list of int): A list of the unique integer IDs for the records to retain.
-
genes()¶ Get a mapping from record IDs to their corresponding gene objects.
- Returns:
- dict of int:oats.datasets.gene.Gene: Mapping from record IDs to gene objects.
-
get_name_to_id_dictionary(unambiguous=True)¶ Get a mapping between gene names or identifiers and the corresponding ID in this dataset.
- Args:
- unambiguous (bool, optional): When the unambiguous argument is True then the only keys that are present in the dictionary are names that map to a single gene from the dataset, so this excludes names that map to genes from multiple species. This is because this method is useful for mapping gene names (with no species information) in other files to IDs from this dataset. So those files should only be used when they contain unambiguous names or accessions that encode the species information. When this argument is False this check is not performed and values may be overwritten if the same key appears twice across multiple species.
- Returns:
- dict of str:int: Mapping between gene names or identifiers and unique integer IDs.
-
get_species_to_name_to_ids_dictionary(include_synonyms=False, lowercase=False)¶ Summary
- Args:
- include_synonyms (bool, optional): Description lowercase (bool, optional): Description
- Returns:
- TYPE: Description
-
ids()¶ Get a list of the IDs for all records in this dataset.
- Returns:
- list of int: Unique integer IDs for all the records in this dataset.
-
species()¶ Get a mapping from record IDs to species names.
- Returns:
- dict of int:str: Mapping between records IDs and species names.
-
to_csv(path)¶ Writes the dataset to a csv file.
- Args:
- path (str): Path of the csv file that will be created.
-
to_json()¶ Creates a nested dictionary from this dataset.
- Returns:
- defaultdict: A nested dictionary representation of this dataset.
-
to_pandas()¶ Creates a pandas.DataFrame object from this dataset.
- Returns:
- pandas.DataFrame: A dataframe representation of this dataset.
oats.biology.gene module¶
-
class
Gene(species, unique_identifiers, other_identifiers, gene_models, primary_identifier=None)¶ Bases:
objectA class representing a single gene with information about its species and different identifiers.
- Attributes:
- all_identifiers (list of str): The combined list of all the strings which represent this gene, both uniquely and not uniquely. gene_models (list of str): Strings that refer specifically to gene model names which map to this gene, not necessarily uniquely. other_identifiers (list of str): Names, aliases, synonyms, gene models that are mapped to this gene but are not necessarily unique to it. primary_identifier (str): The primary identifer for this gene. species (str): A string referencing what species this particular gene is in. unique_identifiers (list of str): A list of all the strings that uniquely identify this gene, such as names, accessions, or identifiers.
Summary
- Args:
species (str): Description
unique_identifiers (str): Description
other_identifiers (str): Description
gene_models (TYPE): Description
primary_identifier (str): Description
oats.biology.groupings module¶
-
class
Groupings(path, case_sensitive=False)¶ Bases:
objectThis is a class for creating and containing mappings between information in the dataset of interest and relationships to other types of information such as biochemical pathways or protein-protein interactions.
- Attributes:
- df (TYPE): Description
- Args:
path (str): The path to the CSV file that is used to build this instance.
case_sensitive (bool, optional): Set to true to account for differences in capitalization between gene names or identifiers.
-
describe()¶ Returns a summarizing dataframe for this object.
- Returns:
- pandas.DataFrame: The summarizing dataframe for this object.
-
static
get_dataframe_for_kegg(paths)¶ Create a dictionary mapping KEGG pathways to lists of genes. Code is adapted from the example of parsing pathway files obtained through the KEGG REST API, which can be found here: https://biopython-tutorial.readthedocs.io/en/latest/notebooks/18%20-%20KEGG.html The specifications for those files state that the first 12 characeters of each line are reserved for the string which species the section, like “GENE”, and the remainder of the line is for everything else.
- Args:
- paths (TYPE): Description
- Returns:
- pandas.DataFrame: The dataframe containing all relevant information about all applicable KEGG pathways.
-
static
get_dataframe_for_plantcyc(paths)¶
-
get_gene_names_from_group_id(species, group_id)¶ Given a group ID and species code (three letters), return a list of all the gene names which are associated with that ID in this instance of this class.
- Args:
species (TYPE): Description
group_id (TYPE): Description
- Returns:
- TYPE: Description
-
get_group_id_to_ids_dict(gene_dict)¶ Returns a mapping from group IDs to lists of IDs. Note that groups are only retained as keys if they are mapped to atleast one ID.
- Args:
- gene_dict (dict of int:oats.dataset.Gene): Mapping between unique integer IDs from the dataset and the corresponding gene objects.
- Returns:
- dict of obj:list of int: Mapping between integers or strings (whatever datatype the group IDs are given as) and lists of unique integer IDs from the dataset.
-
get_group_ids_from_gene_name(species, gene_name)¶ Given a species code (three letters) and a gene name, return a list of group IDs that the gene might belong to.
- Args:
- species (TYPE): Description gene_name (TYPE): Description
- Returns:
- TYPE: Description
-
get_group_ids_from_gene_obj(gene_obj)¶ Given a gene object, return a list of group IDs it belongs in.
- Args:
- gene_obj (TYPE): Description
- Returns:
- TYPE: Description
-
get_groupings_for_dataset(dataset)¶ Returns the
- Args:
- dataset (oats.datasets.Dataset): A dataset object.
- Returns:
- (dict,dict): A mapping from IDs to lists of group identifiers, and a mapping from group identifiers to lists of IDs.
-
get_id_to_group_ids_dict(gene_dict)¶ Returns a mapping from IDs to lists of group IDs. Note that this retains as keys even IDs that don’t map to any groups.
- Args:
- gene_dict (dict of int:oats.datasets.Gene): Mapping between unique integer IDs from the dataset and the corresponding gene objects.
- Returns:
- dict of int:list of obj: Mapping between unique integer IDs from the dataset and list of group IDs.
-
static
save_all_kegg_pathway_files(paths)¶ Uses the KEGG REST API to find and save all pathway data files for each species in the input dictionary.
- Args:
- paths (dict of str:str): A mapping between strings referencing species and paths to the output directory for each.
-
to_pandas()¶ Returns that dataframe that this object was constructed with. This dataframe is unchanged from how it was read in from the CSV file provided as the main argument. The first three columns are fixed and the remaining columns are unused and could contain any information, they are not removed when the object is constructed.
- Returns:
- pandas.DataFrame: The internal dataframe used to define the groupings.
oats.biology.relationships module¶
-
class
AnyInteractions(name_to_id_dictionary, filename)¶ Bases:
objectThis is a class for accessing information about relationships between genes in a dataset of interest by parsing csv files that may contain information about relationships between some or all of those genes. The first and second columns should contain strings that refer to gene names, and will only be used if those strings match strings which are given as keys in the provided dictionary. The third column should be a numerical value indicating a weight associated with the edge or relationsip or interaction between those two given genes.
- Attributes:
df: A dataframe containing all this known information but with protein names replaced with unique IDs.
ids: A list of all the IDs which map to a protein which was mentioned atleast once in the file passed in.
- Args:
name_to_id_dictionary (dict of str:int): Mapping between gene name strings and unique integer IDs from a dataset.
filename (str): Path to a csv file containing lines that identify edges between strings mentioned in the dictionary.
-
get_df()¶ Get a dataframe specifying relationships between IDs and their weights identified from this file.
- Returns:
- pandas.DataFrame: The dataframe of identified relationships.
-
get_ids()¶ Get a list of all the IDs from the passed in dictionary that were successfully associated to this file. Note that there may be IDs in this list that are not present in the dataframe of identifed relationships, because an ID could have been associated to a gene within the file, but not have relationships to any other genes that were succcessfully mapped to IDs, only to genes that are not mapped to an ID within the passed in dictionary.
- Returns:
- list: The list of IDs.
-
class
ProteinInteractions(id_to_gene_dict, name_mapping_file, *string_data_files)¶ Bases:
objectThis is a class for accessing information about relationships between genes in a dataset of interest by looking up these genes in the STRING protein-protein interaction database.
- Attributes:
name_map (TYPE): Description
df: A dataframe containing all the STRING information but with protein names replaced with unique IDs.
ids: A list of all the IDs which map to a protein which was mentioned atleast once in the STRING files passed in.
- Args:
id_to_gene_dict (dict of oats.biodata.gene.Gene:int): Mapping between gene objects and unique integer IDs from a dataset.
name_mapping_file (str): The path to a file linking gene names with protein names used in STRING, available from STRING.
*string_data_files (str): Any number of paths to protein-protein interaction files obtained from STRING.
-
get_df()¶ Summary
- Returns:
- TYPE: Description
-
get_ids()¶ Summary
- Returns:
- TYPE: Description