Notebook setup
from os import environ
from pathlib import Path
import numpy as np
import pandas as pd
import GSForge as gsf
rng = np.random.default_rng(0)
Creating Feature Sets and Collections¶
A GeneSet
is one of the three core data structures provided by GSForge
, it stores the result of a selection
method so long as those results can be indexed by gene. Any number of measures can be stored, but the support
attribute is a special boolean array that indicates ‘selection’ by a given GeneSet
.
A GeneSetCollection
is the final core data structure, it stores one AnnotatedGEM
and any number of GeneSet
objects.
In this example we have an AnnotatedGEM
already constructed:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
agem = gsf.AnnotatedGEM(GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza Sativa
Selected GEM Variable: 'counts'
Gene 66338
Sample 475
Creating GeneSets¶
See the API reference of GSForge.GeneSet
for all availble creation functions, which include:
from_pandas
from_GeneSets
from_bool_array
from_gene_array
from_xarray_dataset
from_netcdf
GeneSets from Lists or Arrays¶
A minimal GeneSet is just that: a set of genes.
Here we draw random features to demonstrate this:
random_features = rng.choice(agem.data.Gene.values, 10, replace=False)
random_features
array(['LOC_Os10g39450.1', 'LOC_Os10g12660.1', 'LOC_Os07g34630.1',
'LOC_Os06g05610.1', 'LOC_Os03g28010.1', 'LOC_Os01g26410.1',
'LOC_Os01g10420.1', 'LOC_Os03g53690.1', 'LOC_Os02g34160.1',
'LOC_Os01g49154.1'], dtype=object)
Provide this to the from_gene_array()
constructor to create a simple GeneSet.
set_example = gsf.GeneSet.from_gene_array(random_features, name='Random Set')
set_example
<GSForge.GeneSet>
Name: Random Set
Supported Genes: 10
From pandas.DataFrames
or xarray.DataSet
objects¶
Commonly there is some information associated with a set of features.
Differential gene expression results often contain information about many (or all) of the genes, but only identify a few as ‘differentially expressed’.
We can store all such information in the GeneSet
object, and indicate which genes are selected by setting a boolean array named support
.
Here we simulate an example DataFrame
.
n = 100
random_features = rng.choice(agem.data.Gene.values, n, replace=False)
df = pd.DataFrame(
{
'sim_LFC': rng.normal(size=n),
'sim_pvalue': np.abs(rng.normal(size=n)),
},
index=random_features
)
df['support'] = (df['sim_pvalue'] < 0.05) | (df['sim_LFC'] > 1.0)
df.head()
sim_LFC | sim_pvalue | support | |
---|---|---|---|
LOC_Os12g34470.1 | 0.621018 | 1.292646 | False |
LOC_Os10g42500.1 | -2.250141 | 0.471813 | False |
LOC_Os04g50110.1 | 0.386370 | 1.377951 | False |
LOC_Os09g37520.1 | -0.581641 | 0.135731 | False |
LOC_Os05g26610.1 | 0.109280 | 2.310363 | False |
Create the GeneSet
using the from_pandas()
constructor.
dge_gs = gsf.GeneSet.from_pandas(df, name='Sim DGE')
dge_gs
<GSForge.GeneSet>
Name: Sim DGE
Supported Genes: 19
dge_gs.data
<xarray.Dataset> Dimensions: (Gene: 100) Coordinates: * Gene (Gene) object 'LOC_Os12g34470.1' ... 'LOC_Os04g44990.1' Data variables: sim_LFC (Gene) float64 0.621 -2.25 0.3864 ... 0.8061 -0.4764 0.1633 sim_pvalue (Gene) float64 1.293 0.4718 1.378 ... 0.2606 0.02545 0.147 support (Gene) bool False False False False ... False False True False
- Gene: 100
- Gene(Gene)object'LOC_Os12g34470.1' ... 'LOC_Os04...
array(['LOC_Os12g34470.1', 'LOC_Os10g42500.1', 'LOC_Os04g50110.1', 'LOC_Os09g37520.1', 'LOC_Os05g26610.1', 'LOC_Os09g19200.1', 'LOC_Os08g15370.1', 'LOC_Os03g48940.3', 'LOC_Os01g13690.2', 'LOC_Os06g44080.1', 'LOC_Os08g40930.1', 'LOC_Os11g03650.1', 'LOC_Os03g33172.1', 'LOC_Os10g36210.1', 'LOC_Os08g36800.1', 'LOC_Os01g05660.2', 'LOC_Os01g38990.1', 'LOC_Os08g06020.1', 'LOC_Os02g17680.1', 'LOC_Os06g23890.1', 'LOC_Os03g10780.1', 'LOC_Os11g26540.1', 'LOC_Os03g58620.1', 'LOC_Os06g45410.1', 'LOC_Os03g55320.3', 'LOC_Os11g10760.1', 'LOC_Os09g19610.1', 'LOC_Os05g19430.1', 'LOC_Os05g41860.1', 'LOC_Os08g27310.1', 'LOC_Os10g14020.1', 'LOC_Os08g35220.1', 'LOC_Os06g13470.1', 'LOC_Os04g22070.1', 'LOC_Os03g03800.1', 'LOC_Os09g16714.1', 'LOC_Os02g02720.1', 'LOC_Os02g08150.1', 'LOC_Os04g36600.1', 'LOC_Os04g06950.1', 'LOC_Os09g21490.1', 'LOC_Os09g34070.2', 'LOC_Os06g33300.1', 'LOC_Os04g42330.2', 'LOC_Os10g32800.1', 'LOC_Os03g24300.1', 'LOC_Os06g48050.2', 'LOC_Os10g06300.1', 'LOC_Os08g32660.2', 'LOC_Os05g01780.2', 'LOC_Os04g01520.1', 'LOC_Os08g35730.1', 'LOC_Os02g34280.1', 'LOC_Os06g01700.1', 'LOC_Os03g05710.1', 'LOC_Os01g51530.1', 'LOC_Os01g04010.1', 'LOC_Os08g26160.1', 'LOC_Os11g11970.1', 'LOC_Os06g13520.1', 'LOC_Os12g03960.1', 'LOC_Os07g25980.1', 'LOC_Os08g13650.1', 'LOC_Os07g06335.1', 'LOC_Os01g17320.1', 'LOC_Os01g54250.1', 'LOC_Os03g18560.1', 'LOC_Os07g40230.1', 'LOC_Os08g04540.1', 'LOC_Os11g22650.1', 'LOC_Os01g56890.1', 'LOC_Os04g08080.1', 'LOC_Os12g43950.2', 'LOC_Os04g39410.1', 'LOC_Os07g19340.1', 'LOC_Os04g43820.1', 'LOC_Os12g03860.4', 'LOC_Os10g35810.1', 'LOC_Os07g42260.2', 'LOC_Os11g02410.1', 'LOC_Os04g33670.1', 'LOC_Os03g20000.1', 'LOC_Os01g02400.3', 'LOC_Os04g34930.1', 'LOC_Os03g61730.1', 'LOC_Os01g52110.5', 'LOC_Os01g32730.2', 'LOC_Os06g17260.1', 'LOC_Os10g28094.1', 'LOC_Os08g04520.1', 'LOC_Os04g48930.2', 'LOC_Os05g02040.1', 'LOC_Os01g47840.1', 'LOC_Os01g21510.1', 'LOC_Os08g05280.1', 'LOC_Os05g39070.3', 'LOC_Os10g01110.1', 'LOC_Os12g13400.1', 'LOC_Os11g22020.1', 'LOC_Os04g44990.1'], dtype=object)
- sim_LFC(Gene)float640.621 -2.25 ... -0.4764 0.1633
array([ 0.62101785, -2.25014117, 0.3863696 , -0.58164084, 0.1092797 , -0.07570153, 0.2021144 , 0.69417194, -0.75836975, 1.42098202, 0.72609379, 0.84373266, 1.16486398, 0.78758822, 0.84407868, 0.07559361, -1.42677385, -0.1350451 , -0.76951464, -1.42274177, 0.25845279, -0.56854945, -1.02980444, -1.04300108, 0.26841708, 0.35867195, 1.32245747, -0.01391467, 1.04183976, 1.40226483, 1.15016564, -2.36530391, 1.22868372, 0.33962001, 0.42377135, 0.37122742, 0.38275716, 0.31941422, -0.35891331, -1.9016353 , -0.10891473, -0.80373185, 1.08016341, -0.28876651, 0.08347536, -0.84960596, -0.51062247, -0.01153306, -1.48537518, 0.30068511, -0.10607225, -1.18571981, -2.39823287, 0.51305213, -0.29758404, -0.53000841, -0.23615463, 1.81647594, -0.04980097, 0.08661926, -1.48707287, 1.64733907, 0.91748798, 1.06693487, 0.04767273, 0.91665479, 0.37094684, 0.61318908, -0.15219296, -1.47388795, 1.02885435, -1.93495964, -0.23993667, -0.20452249, -1.04286014, 0.61312314, -0.2003297 , -0.43686833, 0.51984173, -0.47657904, 1.38897997, 0.35145508, -0.47433299, -1.94426498, -1.3077532 , 1.08683078, -0.05060406, -0.28312507, 1.64325161, -1.28264924, -0.5856578 , -0.47258768, 0.58633728, -0.6635352 , -0.61341785, -1.6051494 , 0.7293494 , 0.80613936, -0.47637675, 0.16333995])
- sim_pvalue(Gene)float641.293 0.4718 ... 0.02545 0.147
array([1.29264612, 0.47181315, 1.37795095, 0.13573073, 2.31036349, 0.78719274, 0.58028442, 0.19550583, 0.56581785, 0.00721136, 0.56119811, 0.86761676, 3.06603674, 0.07734506, 2.01666069, 0.64860061, 0.67803973, 0.50000843, 1.3604462 , 1.00239827, 0.15233864, 0.47221594, 1.00480101, 0.69996654, 1.47314307, 1.20439629, 1.59070079, 1.25613807, 1.18168298, 1.76851186, 0.96385423, 3.1063368 , 1.14227896, 1.2969154 , 0.34567253, 0.85458423, 0.48896906, 1.7606673 , 0.19921798, 0.38200229, 2.55242403, 0.32447186, 1.22122335, 0.20191 , 0.03883504, 1.06632455, 0.92163392, 0.80471693, 0.85274847, 0.66768729, 0.16324401, 0.83075196, 2.34580807, 0.70413956, 0.45307444, 1.06583802, 0.34612128, 0.00587603, 0.76778914, 0.61048665, 0.18577396, 1.41648937, 0.82740223, 2.75580756, 1.04124319, 0.78142404, 1.33739724, 0.97558283, 0.02169086, 0.03472779, 0.7443606 , 1.28657447, 1.42237851, 0.45168545, 0.37456802, 0.22066121, 0.52953046, 2.93604545, 0.11566183, 1.07054441, 1.0026843 , 0.64026241, 0.73230171, 1.17053081, 1.43428146, 0.63985208, 0.7543689 , 0.95893371, 0.56239768, 0.29163243, 0.30129218, 1.26096028, 0.83289445, 1.20325895, 0.63707324, 0.55833996, 3.77227516, 0.26062975, 0.02544532, 0.14704551])
- support(Gene)boolFalse False False ... True False
array([False, False, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, True, True, True, False, True, False, False, False, False, False, False, False, False, False, True, False, True, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, True, False, True, False, False, False, False, True, True, True, False, False, False, False, False, False, False, False, False, True, False, False, False, False, True, False, False, True, False, False, False, False, False, False, False, False, False, True, False])
Creating and Saving GeneSetCollections¶
We only need to provided an AnnotatedGEM and a name to create a GeneSetCollection
.
Then add GeneSet
objects like you would entries to a dictionary:
sample_coll = gsf.GeneSetCollection(gem=agem, name='Literature DGE')
sample_coll['set example'] = dge_gs
sample_coll['simulated DGE example'] = set_example
sample_coll
<GSForge.GeneSetCollection>
Literature DGE
GeneSets (2 total): Support Count
set example: 19
simulated DGE example: 10
GeneSetCollections are saved as a directory, each set saved as a separate netcdf file.
save = False
if save == True:
sample_coll.save('my_collection')