Notebook setup

from os import environ
from pathlib import Path
import numpy as np
import pandas as pd
import GSForge as gsf
rng = np.random.default_rng(0)

Creating Feature Sets and Collections

A GeneSet is one of the three core data structures provided by GSForge, it stores the result of a selection method so long as those results can be indexed by gene. Any number of measures can be stored, but the support attribute is a special boolean array that indicates ‘selection’ by a given GeneSet.

A GeneSetCollection is the final core data structure, it stores one AnnotatedGEM and any number of GeneSet objects.

In this example we have an AnnotatedGEM already constructed:

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
agem = gsf.AnnotatedGEM(GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza Sativa
Selected GEM Variable: 'counts'
    Gene   66338
    Sample 475

Creating GeneSets

See the API reference of GSForge.GeneSet for all availble creation functions, which include:

  • from_pandas

  • from_GeneSets

  • from_bool_array

  • from_gene_array

  • from_xarray_dataset

  • from_netcdf

GeneSets from Lists or Arrays

A minimal GeneSet is just that: a set of genes.

Here we draw random features to demonstrate this:

random_features = rng.choice(agem.data.Gene.values, 10, replace=False)
random_features
array(['LOC_Os10g39450.1', 'LOC_Os10g12660.1', 'LOC_Os07g34630.1',
       'LOC_Os06g05610.1', 'LOC_Os03g28010.1', 'LOC_Os01g26410.1',
       'LOC_Os01g10420.1', 'LOC_Os03g53690.1', 'LOC_Os02g34160.1',
       'LOC_Os01g49154.1'], dtype=object)

Provide this to the from_gene_array() constructor to create a simple GeneSet.

set_example = gsf.GeneSet.from_gene_array(random_features, name='Random Set')
set_example
<GSForge.GeneSet>
Name: Random Set
    Supported Genes:  10

From pandas.DataFrames or xarray.DataSet objects

Commonly there is some information associated with a set of features. Differential gene expression results often contain information about many (or all) of the genes, but only identify a few as ‘differentially expressed’. We can store all such information in the GeneSet object, and indicate which genes are selected by setting a boolean array named support.

Here we simulate an example DataFrame.

n = 100
random_features = rng.choice(agem.data.Gene.values, n, replace=False)

df = pd.DataFrame(
    {
        'sim_LFC': rng.normal(size=n),
        'sim_pvalue': np.abs(rng.normal(size=n)),
    },
    index=random_features
)

df['support'] = (df['sim_pvalue'] < 0.05) | (df['sim_LFC'] > 1.0)

df.head()
sim_LFC sim_pvalue support
LOC_Os12g34470.1 0.621018 1.292646 False
LOC_Os10g42500.1 -2.250141 0.471813 False
LOC_Os04g50110.1 0.386370 1.377951 False
LOC_Os09g37520.1 -0.581641 0.135731 False
LOC_Os05g26610.1 0.109280 2.310363 False

Create the GeneSet using the from_pandas() constructor.

dge_gs = gsf.GeneSet.from_pandas(df, name='Sim DGE')
dge_gs
<GSForge.GeneSet>
Name: Sim DGE
    Supported Genes:  19
dge_gs.data
<xarray.Dataset>
Dimensions:     (Gene: 100)
Coordinates:
  * Gene        (Gene) object 'LOC_Os12g34470.1' ... 'LOC_Os04g44990.1'
Data variables:
    sim_LFC     (Gene) float64 0.621 -2.25 0.3864 ... 0.8061 -0.4764 0.1633
    sim_pvalue  (Gene) float64 1.293 0.4718 1.378 ... 0.2606 0.02545 0.147
    support     (Gene) bool False False False False ... False False True False

Creating and Saving GeneSetCollections

We only need to provided an AnnotatedGEM and a name to create a GeneSetCollection. Then add GeneSet objects like you would entries to a dictionary:

sample_coll = gsf.GeneSetCollection(gem=agem, name='Literature DGE')
sample_coll['set example'] = dge_gs
sample_coll['simulated DGE example'] = set_example
sample_coll
<GSForge.GeneSetCollection>
Literature DGE
GeneSets (2 total): Support Count
    set example: 19
    simulated DGE example: 10

GeneSetCollections are saved as a directory, each set saved as a separate netcdf file.

save = False
if save == True:
    sample_coll.save('my_collection')