--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.10.2 kernelspec: display_name: gsfenv language: python name: gsfenv --- ***Notebook setup*** ```{code-cell} ipython3 from os import environ from pathlib import Path import numpy as np import pandas as pd import GSForge as gsf rng = np.random.default_rng(0) ``` # Creating Feature Sets and Collections A ``GeneSet`` is one of the three core data structures provided by ``GSForge``, it stores the result of a selection method so long as those results can be indexed by gene. Any number of measures can be stored, but the ``support`` attribute is a special boolean array that indicates 'selection' by a given ``GeneSet``. A ``GeneSetCollection`` is the final core data structure, it stores one ``AnnotatedGEM`` and any number of ``GeneSet`` objects. In this example we have an ``AnnotatedGEM`` already constructed: ```{code-cell} ipython3 OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa") GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc") ``` ```{code-cell} ipython3 agem = gsf.AnnotatedGEM(GEM_PATH) agem ``` ## Creating GeneSets See the API reference of [`GSForge.GeneSet`](../API/GSForge.models) for all availble creation functions, which include: + `from_pandas` + `from_GeneSets` + `from_bool_array` + `from_gene_array` + `from_xarray_dataset` + `from_netcdf` ### GeneSets from Lists or Arrays A minimal GeneSet is just that: a set of genes. Here we draw random features to demonstrate this: ```{code-cell} ipython3 random_features = rng.choice(agem.data.Gene.values, 10, replace=False) random_features ``` Provide this to the `from_gene_array()` constructor to create a simple GeneSet. ```{code-cell} ipython3 set_example = gsf.GeneSet.from_gene_array(random_features, name='Random Set') set_example ``` ### From `pandas.DataFrames` or `xarray.DataSet` objects Commonly there is some information associated with a set of features. Differential gene expression results often contain information about many (or all) of the genes, but only identify a few as 'differentially expressed'. We can store all such information in the `GeneSet` object, and indicate which genes are selected by setting a boolean array named `support`. Here we simulate an example `DataFrame`. ```{code-cell} ipython3 n = 100 random_features = rng.choice(agem.data.Gene.values, n, replace=False) df = pd.DataFrame( { 'sim_LFC': rng.normal(size=n), 'sim_pvalue': np.abs(rng.normal(size=n)), }, index=random_features ) df['support'] = (df['sim_pvalue'] < 0.05) | (df['sim_LFC'] > 1.0) df.head() ``` Create the `GeneSet` using the `from_pandas()` constructor. ```{code-cell} ipython3 dge_gs = gsf.GeneSet.from_pandas(df, name='Sim DGE') dge_gs ``` ```{code-cell} ipython3 dge_gs.data ``` ## Creating and Saving GeneSetCollections We only need to provided an AnnotatedGEM and a name to create a `GeneSetCollection`. Then add `GeneSet` objects like you would entries to a dictionary: ```{code-cell} ipython3 sample_coll = gsf.GeneSetCollection(gem=agem, name='Literature DGE') sample_coll['set example'] = dge_gs sample_coll['simulated DGE example'] = set_example sample_coll ``` GeneSetCollections are saved as a directory, each set saved as a separate netcdf file. ```{code-cell} ipython3 save = False if save == True: sample_coll.save('my_collection') ```