GSForge.models package

Module contents

There are two core data models in GSForge, both of which store their associated data in xarray.Dataset object under a data attribute. You are encouraged to consult the xarray documentation for how to perform any transform or selection not provided by GSForge.

Core Data Classes


Contains the gene expression matrix, which is indexed by a ‘Gene’ and ‘Sample’ coordinates. This xarray.Dataset object also contains (but is not limited to) phenotype information as well.


A GeneSet is a set of genes and any associated values. A GeneSet can a set of ‘supported’ genes, i.e. genes that are ‘within’ a given GeneSet.

These core data classes are constructed with a limited set of packages:

  • numpy

  • pandas

  • xarray

  • param

This allows the creation of container images without interactive visualization libraries.

class GSForge.models.AnnotatedGEM(*args, **params)

Bases: param.parameterized.Parameterized

A data class for a gene expression matrix and any associated sample or gene annotations.

This model holds the count expression matrix, and any associated labels or annotations as an xarray.DataSet object under the .data attribute. By default this dataset will be expected to have its indexes named “Gene” and “Sample”, although there are parameters to override those arrays and index names used.

data = param.ClassSelector(readonly=False)

An xarray.Dataset object that contains the Gene Expression Matrix, and any needed annotations. This xarray.Dataset object is expected to have a count array named ‘counts’, that has coordinates (‘Gene’, ‘Sample’).

count_array_name = param.String(readonly=False)

This parameter controls which variable from the xarray.Dataset should be considered to be the ‘count’ variable. Consider using this if you require different index names, or wish to control which count array among many should be used by default.

sample_index_name = param.String(readonly=False)

This parameter controls which variable from the xarray.Dataset should be considered to be the ‘sample’ coordinate. Consider using this if you require different coordinate names.

gene_index_name = param.String(readonly=False)

This parameter controls which variable from the xarray.Dataset should be considered to be the ‘gene index’ coordinate. Consider using this if you require different coordinate names.

data = None
count_array_name = 'counts'
sample_index_name = 'Sample'
gene_index_name = 'Gene'
property gene_index: xarray.core.dataarray.DataArray

Returns the entire gene index of this AnnotatedGEM object as an xarray.DataArray.

The variable or coordinate that this returns is controlled by the gene_index_name parameter.


The complete gene index of this AnnotatedGEM.

Return type


property sample_index: xarray.core.dataarray.DataArray

Returns the entire sample index of this AnnotatedGEM object as an xarray.DataArray.

The actual variable or coordinate that this returns is controlled by the sample_index_name parameter.


The complete sample index of this AnnotatedGEM.

Return type


property count_array_names: List[str]

Returns a list of all available count arrays contained within this AnnotatedGEM object.

This is done simply by returning all data variables that have the same dimension set as the default count array.


A list of available count arrays in this AnnotatedGEM.

Return type


infer_variables(quantile_size: int = 10, skip: Optional[bool] = None) Dict[str, numpy.ndarray]

Infer categories for the variables in the AnnotatedGEM’s labels.

  • quantile_size (int) – The maximum number of unique elements before a variable is no longer considered as a quantile-able set of values.

  • skip (bool) – The variables to be skipped.


Return type

A dictionary of the inferred value types.

classmethod from_netcdf(netcdf_path: Union[str, pathlib.Path, IO], **params) GSForge.models._AnnotatedGEM.AnnotatedGEM

Construct an AnnotatedGEM object from a netcdf (.nc) file path.


netcdf_path (Union[str, Path, IO[AnyStr]]) – A path to a netcdf file. If this file has different index names than default (Gene, Sample, counts), be sure to explicitly set those parameters (gene_index_name, sample_index_name, count_array_name).



Return type

A new instance of the AnnotatedGEM class.

classmethod from_pandas(count_df: pandas.core.frame.DataFrame, label_df: Optional[pandas.core.frame.DataFrame] = None, **params) GSForge.models._AnnotatedGEM.AnnotatedGEM

Reads in a GEM pandas.DataFrame and an optional annotation DataFrame. These must share the same sample index.

  • count_df (pd.DataFrame) – The gene expression matrix as a pandas.DataFrame. This file is assumed to have genes as rows and samples as columns.

  • label_df (pd.DataFrame) – The gene annotation data as a pandas.DataFrame. This file is assumed to have samples as rows and annotation observations as columns.



Return type

A new instance of the AnnotatedGEM class.

static xrarray_gem_from_pandas(count_df: pandas.core.frame.DataFrame, label_df: Optional[pandas.core.frame.DataFrame] = None, transpose_counts: bool = True) xarray.core.dataset.Dataset

Stitch together a gene expression and annotation DataFrames into a single xarray.Dataset object.

  • count_df (pd.DataFrame) – The gene expression matrix as a pandas.DataFrame; assumed to have genes as rows and samples as columns.

  • label_df (pd.DataFrame) – The gene annotation data as a pandas.DataFrame; assumed to have samples as rows and annotations as columns.

  • transpose_counts (bool) – Transpose the count matrix from (genes as rows, samples as columns) to (samples as rows, observations as columns).



Return type

Containing the gene expression matrix and the gene annotation data.

classmethod from_files(count_path: Union[str, pathlib.Path, IO], label_path: Optional[Union[str, pathlib.Path, IO]] = None, count_kwargs: Optional[dict] = None, label_kwargs: Optional[dict] = None, transpose_counts: bool = True, **params) GSForge.models._AnnotatedGEM.AnnotatedGEM

Construct a AnnotatedGEM object from file paths and optional parsing arguments.

  • count_path (Union[str, Path, IO[AnyStr]]) – Path to the gene expression matrix.

  • label_path (Union[str, Path, IO[AnyStr]]) – Path to the gene annotation data.

  • count_kwargs (dict) – A dictionary of arguments to be passed to pandas.read_csv for the count matrix.

  • label_kwargs (dict) – A dictionary of arguments to be passed to pandas.read_csv for the annotations.



Return type

A new instance of the AnnotatedGEM class.

classmethod from_geo_id(geo_id: str, destination: str = './') GSForge.models._AnnotatedGEM.AnnotatedGEM
save(path: Union[str, pathlib.Path, IO], **kwargs) str

Save as a netcdf (.nc) to the file at path.


path (Union[str, Path, IO[AnyStr]]) – The filepath to save to. This should use the .nc extension.



Return type

The path to which the file was saved.

name = 'AnnotatedGEM'
class GSForge.models.GeneSet(*args, **params)

Bases: param.parameterized.Parameterized

A data class for a the result of a gene selection or analysis.

A GeneSet can also be a measurement or ranking of a set of genes, and this could include all of the ‘available’ genes. In such cases a boolean array ‘support’ indicates membership in the GeneSet.

Create a GeneSet from a .netcf file path, ``pandas.DataFrame``, ``np.ndarray`` or list of genes:

# Supply any of the above objects along with any other parameters to create a GeneSet.
my_geneset = GeneSet(<pandas.DataFrame, xarray.DataSet, numpy.ndarray, str>)

# One can also explicitly call the constructors for the types above, e.g.:
my_geneset = GeneSet.from_pandas(<pandas.DataFrame>)

Get supported Genes:


Set the support with a list or array of genes:

data = param.Parameter(readonly=False)

Contains a gene-index xarray.Dataset object, it should have only those genes that are considered ‘within’ the GeneSet in the index, or a boolean variable named ‘support’.

support_index_name = param.String(readonly=False)

This parameter controls which variable should be considered to be the (boolean) variable indicating membership in this GeneSet.

gene_index_name = param.String(readonly=False)

This parameter controls which variable from the xarray.Dataset should be considered to be the ‘gene index’ coordinate. Consider using this if you require different coordinate names.

data = None
support_index_name = 'support'
gene_index_name = 'Gene'
classmethod from_pandas(dataframe: pandas.core.frame.DataFrame, genes: Optional[numpy.ndarray] = None, attrs=None, **params)

Create a GeneSet from a pandas.DataFrame.

  • dataframe (pd.DataFrame) – A pandas.DataFrame object. Assumed to be indexed by genes names.

  • genes (np.ndarray) – If you have a separate (but ordered the same!) gene array that corresponds to your data, it can be passed here to be set as the index appropriately.

  • attrs (dict) – A dictionary of attributes to be added to the xarray.Dataset.attrs attribute.

  • params (dict) – Other parameters to set.


Return type

A new GeneSet object.

classmethod from_GeneSets(*gene_sets: GSForge.models._GeneSet.GeneSet, mode: str = 'union', attrs=None, **params) GSForge.models._GeneSet.GeneSet

Create a new GeneSet by combining all the genes in the given GeneSets.

No variables or attributes from the original GeneSets are maintained in this process.

  • *gene_sets (GeneSet) – One or more GSForge.GeneSet objects.

  • mode (str) – Mode by which to combine the given GeneSet objects given.

  • attrs (dict) – A dictionary of attributes to be added to the xarray.Dataset.attrs attribute.

  • params (dict) – Other parameters to set.



Return type

A new GeneSet built from the given GeneSets as described by mode.

classmethod from_bool_array(bool_array: numpy.ndarray, complete_gene_index: numpy.ndarray, attrs=None, **params) GSForge.models._GeneSet.GeneSet

Create a GeneSet object from a boolean support array. This requires a matching gene index array.

  • bool_array (np.ndarray) – A boolean array representing support within this GeneSet.

  • complete_gene_index (np.ndarray) – The complete gene index.

  • attrs (dict) – A dictionary of attributes to be added to the xarray.Dataset.attrs attribute.

  • params (dict) – Other parameters to set.



Return type

A new GeneSet object.

classmethod from_gene_array(selected_gene_array: numpy.ndarray, complete_gene_index=None, attrs=None, **params) GSForge.models._GeneSet.GeneSet

Parses arguments for a new GeneSet from an array or list of ‘selected’ genes. Such genes are assumed to be within the optionally supplied complete_gene_index.

  • selected_gene_array (np.ndarray) – The genes ‘selected’ to be within the support of this GeneSet.

  • complete_gene_index (np.ndarray) – Optional. The complete gene index to which those selected genes belong.

  • attrs (dict) – A dictionary of attributes to be added to the xarray.Dataset.attrs attribute.

  • params (dict) – Other parameters to set.



Return type

A new GeneSet object.

classmethod from_xarray_dataset(data: xarray.core.dataset.Dataset, **params) GSForge.models._GeneSet.GeneSet

Create a GeneSet from an xarray.Dataset.

  • data (xr.Dataset) – An xarray.Dataset object. See the .data parameter of this class.

  • params (dict) – Other parameters to set.



Return type

A new GeneSet object.

classmethod from_netcdf(path: Union[str, pathlib.Path, IO], **params)

Create a GeneSet object from a netcdf file path.

  • path (Union[str, Path, IO[AnyStr]]) – The path to the .netcdf file to be used.

  • params (dict) – Other parameters to set.



Return type

A new GeneSet object.

property gene_index: xarray.core.dataarray.DataArray

Returns the entire gene index of this GeneSet object as an xarray.DataArray.

The variable or coordinate that this returns is controlled by the gene_index_name parameter.



Return type

A copy of the entire gene index of this GeneSet as an xarray.DataArray.

get_support() numpy.ndarray

Returns the list of genes ‘supported in this GeneSet.

The value that this return is (by default) controlled by the self.support_index_name parameter.


Return type

A numpy array of the genes ‘supported’ by this GeneSet.

property support_exists: bool

Returns True if a support array exists, and that it has at least one member within, returns False otherwise.

set_support_by_genes(genes: numpy.ndarray) GSForge.models._GeneSet.GeneSet

Set this GeneSet support to the given genes. This function calculates the boolean support array for the gene index via np.isin(gene_index, genes). Returns an updated copy of the GeneSet.


genes (np.ndarray) – An array of genes which represent the “supported” subset within the entire gene index.



Return type

Returns an updated copy of the GeneSet.

set_support_from_boolean_array(boolean_array: numpy.ndarray) GSForge.models._GeneSet.GeneSet

Set this GeneSet support based on the given boolean array, which must be the same length as the existing gene index. Returns an updated copy of the GeneSet.

This function calculates the boolean support array for the gene index via np.isin(gene_index, genes).


boolean_array (numpy.ndarray) – A boolean numpy.ndarray.



Return type

Returns an updated copy of the GeneSet.

get_genes_by_threshold(threshold, score_variable: str, comparison: str = 'ge', within_support: bool = True, absolute: bool = True) numpy.ndarray
get_top_n_genes(score_variable: str, n: int = 1000, within_support: bool = True, absolute: bool = True) numpy.ndarray
to_dataframe(only_supported: bool = True) pandas.core.frame.DataFrame

Convert this attribute to a pandas.DataFrame. This restricts the data returned to include only those genes that are returned by GeneSet.get_support().


only_supported (bool) – Defaults to True, set to False if you want all GeneSet data to be in the DataFrame returned.


Return type

A pandas.DataFrame of this attribute.

save_as_netcdf(target_dir=None, name=None) str

Save this GeneSet as a netcdf (.nc) file in the target_dir directory.

The default filename will be: {}.nc, if the GeneSet does not have a name, one must be provided via the name argument.

  • target_dir (str) – The directory to place the saved GeneSet into.

  • name (str) – The name to give the GeneSet upon saving.



Return type

The path to which the file was saved.

name = 'GeneSet'
class GSForge.models.GeneSetCollection(**params)

Bases: param.parameterized.Parameterized

An interface class which contains an AnnotatedGEM and a dictionary of GeneSet objects.

gem = param.ClassSelector(readonly=False)

A GSForge.AnnotatedGEM object.

gem = None
summarize_gene_sets() Dict[str, int]

Summarize this GeneSetCollection, returns a dictionary of {gene_set_name: support_length}. This is used to generate display used in the __repr__ function.

get_support(key: str) numpy.ndarray

Get the support array for a given key.


key (str) – The GeneSet from which to get the gene support.



Return type

An array of the genes that make up the support of this GeneSet.

gene_sets_to_dataframes(keys: Optional[List[str]] = None, only_supported: bool = True) Dict[str, pandas.core.frame.DataFrame]

Returns a dictionary of {key: pd.DataFrame} of the The DataFrame is limited to only those genes that are ‘supported’ within the GeneSet by default.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • only_supported (bool) – Whether to return a subset defined by each GeneSet support, or the complete data frame.



Return type

A dictionary of {key: pd.DataFrame} of the attribute.

gene_sets_to_csv_files(target_dir: Optional[str] = None, keys: Optional[List[str]] = None, only_supported: bool = True) None

Writes as .csv files.

By default this creates creates a folder with the current working directory and saves the .csv files within. By default only genes that are “supported” by a GeneSet are included.

  • target_dir – The target directory to save the .csv files to. This defaults to the name of this GeneSetCollection, which creates a folder in the current working directory.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • only_supported (bool) – Whether to return a subset defined by each GeneSet support, or the complete data frame.


Return type


gene_sets_to_excel_sheet(name: Optional[str] = None, keys: Optional[List[str]] = None, only_supported: bool = True) None

Writes the within this GeneSetCollection as a single Excel worksheet.

By default this sheet is named using the .name of this GeneSetCollection. By default only genes that are “supported” by a GeneSet are included.

  • name (str) – The name of the Excel sheet. .xlsx will be appended to the given name.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • only_supported (bool) – Whether to return a subset defined by each GeneSet support, or the complete data frame.


Return type


as_dict(keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None, empty_supports: bool = False) Dict[str, numpy.ndarray]

Returns a dictionary of {name: supported_genes} for each GeneSet, or those specified by the keys argument.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • exclude (List[str]) – An optional list of GeneSet keys to exclude from the returned dictionary.

  • empty_supports – Whether to include GeneSets that have no support array, or no genes supported within the support array.



Return type

Dictionary of {name: supported_genes} for each GeneSet.

intersection(keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None) numpy.ndarray

Return the intersection of supported genes in this GeneSet collection.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • exclude (List[str]) – An optional list of GeneSet keys to exclude from the returned dictionary.



Return type

Intersection of the supported genes within GeneSets.

union(keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None) numpy.ndarray

Get the union of supported genes in this GeneSet collection.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • exclude (List[str]) – An optional list of GeneSet keys to exclude from the returned dictionary.



Return type

Union of the supported genes within GeneSets.

difference(primary_key: str, other_keys: Optional[List[str]] = None, mode: str = 'union') numpy.ndarray

Finds the genes within primary_key that are not within the mode of the sets given in other_keys.

If no other_keys are provided, all remaining keys are used. The default mode is union.

  • primary_key (List[str]) – The set

  • other_keys (List[str]) – An optional list of GeneSet keys…

  • mode (str) – Mode by which to join the GeneSets given by other_keys.


Return type


joint_difference(primary_keys: List[str], other_keys: Optional[List[str]] = None, primary_join_mode: str = 'union', others_join_mode: str = 'union')
  • primary_keys

  • other_keys

  • primary_join_mode

  • others_join_mode

pairwise_unions(keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None) Dict[Tuple[str, str], numpy.ndarray]

Construct pairwise permutations of GeneSets within this collection, and return the union of each pair in a dictionary.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • exclude (List[str]) – An optional list of GeneSet keys to exclude from the returned dictionary.



Return type

A dictionary of {(, gene support union}.

pairwise_intersection(keys: Optional[List[str]] = None, exclude: Optional[List[str]] = None) Dict[Tuple[str, str], numpy.ndarray]

Construct pairwise combinations of GeneSets within this collection, and return the intersection of each pair in a dictionary.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • exclude (List[str]) – An optional list of GeneSet keys to exclude from the returned dictionary.



Return type

A dictionary of {GeneSet.Name, GeneSets.get_support() intersection}.

pairwise_percent_intersection(keys=None, exclude=None) List[Tuple[str, str, float]]

Construct pairwise permutations of GeneSets within this collection, and return the intersection of each pair within a dictionary.

  • keys (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • exclude (List[str]) – An optional list of GeneSet keys to exclude from the returned dictionary.



Return type

A dictionary of {GeneSet.Name, percent gene intersection}.

construct_standard_specification(include: Optional[List[str]] = None, exclude=None) dict

Construct a standard specification that can be used to view unions, intersections and differences (unique genes) of the sets within this collection.

  • include (List[str]) – An optional list of gene_set keys to return, by default all keys are selected.

  • exclude (List[str]) – An optional list of GeneSet keys to exclude from the returned dictionary.



Return type

A specification dictionary.

static merge_specifications(*specs)

Merges sets of defaultdict(list) objects with common keys.

process_set_operation_specification(specification: Optional[dict] = None) dict

Calls and stores the results from a specification. The specification must declare set operation functions and their arguments.


specification (Dict) –

classmethod from_specification(source_collection, specification=None, name='processed_specification')
classmethod from_folder(gem: GSForge.models._AnnotatedGEM.AnnotatedGEM, target_dir: Union[str, pathlib.Path, IO], glob_filter: str = '*.nc', filter_func: Optional[Callable] = None, **params) GSForge.models._GeneSetCollection.GeneSetCollection

Create a GeneSetCollection from a directory of saved GeneSet objects.

The file name of each file will be used as the key in the gene_sets dictionary.

  • gem (AnnotatedGEM) – A GSForge.AnnotatedGEM object.

  • target_dir (Union[str, Path, IO[AnyStr]]) – The directory which contains the saved GeneSet .netcdf files.

  • glob_filter (str) – A glob by which to restrict the files found within target_dir.

  • filter_func (Callable) – A function by which to filter which xarray.Dataset objects are included. This function should take an xarray.Dataset and return a boolean.

  • params – Parameters to configure the GeneSetCollection.



Return type

A new GeneSetCollection.

save(target_dir: str, keys: Optional[List[str]] = None) None

Save this collection to target_dir. Each GeneSet will be saved as a separate .netcdf file within this directory.

  • target_dir (str) – The path to which GeneSet xarray.Dataset .netcdf files will be written.

  • keys (List[str]) – The list of GeneSet keys that should be saved. If this is not provided, all GeneSet objects are saved.


Return type


name = 'GeneSetCollection'
class GSForge.models.Interface(*args, **params)

Bases: param.parameterized.Parameterized

The Interface provides common API access for interacting with the AnnotatedGEM and GeneSetCollection objects.

gem = param.ClassSelector(readonly=False)

An AnnotatedGEM object.

gene_set_collection = param.ClassSelector(readonly=False)

A GeneSetCollection object.

selected_gene_sets = param.ListSelector(readonly=False)

A list of keys from the provided GeneSetCollection (stored in gene_set_collection) that are to be used for selecting sets of genes from the count matrix.

selected_genes = param.Parameter(readonly=False)

A list of genes to use in indexing from the count matrix. This parameter takes priority over all other gene selecting methods. That means that selected GeneSets (or combinations thereof) will have no effect.

gene_set_mode = param.ObjectSelector(readonly=False)

Controls how any selected gene sets are returned by the interface. complete Returns the entire gene set of the AnnotatedGEM. union Returns the union of the selected gene sets support. intersection Returns the intersection of the selected gene sets support.

sample_subset = param.Parameter(readonly=False)

A list of samples to use in a given operation. These can be supplied directly as a list of genes, or can be drawn from a given GeneSet.

count_variable = param.ObjectSelector(readonly=False)

The name of the count matrix used.

annotation_variables = param.List(readonly=False)

The name of the active annotation variable(s). These are the annotation columns that will be control the subset returned by y_annotation_data.

count_mask = param.ObjectSelector(readonly=False)

The type of mask to use for the count matrix. complete Returns the entire count matrix as numbers. masked Returns the entire count matrix with zero or missing as NaN values. dropped Returns the count matrix without genes that have zero or missing values.

annotation_mask = param.ObjectSelector(readonly=False)

The type of mask to use for the target array. complete Returns the entire target array. dropped Returns the target array without samples that have zero or missing values.

count_transform = param.Callable(readonly=False)

A transform that will be run on the x_data that is supplied by this Interface. The transform runs on the subset of the matrix that has been selected.

gem = None
gene_set_collection = None
selected_gene_sets = [None]
selected_genes = None
gene_set_mode = 'union'
sample_subset = None
count_variable = None
annotation_variables = [None]
count_mask = 'complete'
annotation_mask = 'complete'
count_transform = None
property active_count_variable: str

Returns the name of the currently active count matrix.

property gene_index_name: str

Returns the name of the gene index.

property sample_index_name: str

Returns the name of the sample index.

get_sample_index() numpy.ndarray

Get the currently selected sample index as a numpy array.


An array of the currently selected samples.

Return type


property get_selection_indices: dict

Returns the currently selected indexes as a dictionary.

property x_count_data: Optional[xarray.core.dataarray.DataArray]

Returns the currently selected ‘x_data’. Usually this will be a subset of the active count array.

Note: In constructing the a gene index, the count data is constructed first in order to infer coordinate selection based on masking.


The selection of the currently active count data.

Return type


get_gene_index() numpy.array

Get the currently selected gene index as a numpy array.


An array of the currently selected genes.

Return type


property y_annotation_data: Optional[Union[xarray.core.dataset.Dataset, xarray.core.dataarray.DataArray]]

Returns the currently selected ‘y_data’, or None, based on the selected_annotation_variables parameter.


Return type

An xarray.Dataset of the currently selected y_data.

get_gem_data(single_object=False, output_type='xarray', **params)

Returns count [and annotation] data based on the current parameters.

Users should call gsf.get_gem_data

name = 'Interface'
class GSForge.models.CallableInterface(**kwargs)

Bases: GSForge.models._Interface.Interface, param.parameterized.ParameterizedFunction

Parameters inherited from:

GSForge.models._Interface.Interface: gem, gene_set_collection, selected_gene_sets, selected_genes, gene_set_mode, sample_subset, count_variable, annotation_variables, count_mask, annotation_mask, count_transform

name = 'CallableInterface'