Selecting and Comparing Genes with GSForge¶

Our goal in examining RNA-seq data sets often reduces to “feature selection” – to borrow a term from machine learning. Examining the count data should give us an idea of which genes correlate (by a given measure) with a phenotype of interest. We can then examine that ‘selected’ set more closely and form biological or chemical hypothesis that explain the expression-phenotype link. Unfortunately there is not a gold standard measure to preform this selection, and practicing researchers must juggle a number of different methods and results.

Enter GSForge: A tool that helps collate and compare gene ‘selection’ results from a given method. Here we present a brief analysis of a Oryza sativa cultivar set[1], with the explicit purpose of demonstrating GSForge.

from os import  environ
from pathlib import Path

import numpy as np
import pandas as pd
import xarray as xr

import patsy
import GSForge as gsf
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import quantile_transform
from sklearn import model_selection
from sklearn import linear_model
import umap
import umap.plot

import matplotlib.pyplot as plt
import seaborn as sns
import holoviews as hv
hv.extension('matplotlib')


# Declare paths.
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
RAW_COUNT_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.raw.txt")
HYDRO_LABEL_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "raw_annotation_data", "PRJNA301554.hydroponic.annotations.txt")
SI_FILE_1_PATH = OSF_PATH.joinpath('GEMmakerGEMs', 'raw_annotation_data', 'TPC2016-00158-LSBR2_Supplemental_File_1.csv')
# Output paths.
TOUR_DGE = OSF_PATH.joinpath("GeneSetCollections", "tour_DGE")
TOUR_BORUTA = OSF_PATH.joinpath("GeneSetCollections", "tour_boruta")
LIT_DGE_GSC = OSF_PATH.joinpath("GeneSetCollections", "literature", "DGE")
LIT_TF = OSF_PATH.joinpath("GeneSetCollections", "literature", "TF")

/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/holoviews/plotting/mpl/__init__.py:220: UserWarning: Trying to register the cmap 'fire' which already exists.
  register_cmap("fire", cmap=fire_cmap)

import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR) # Supresses verbose R output.

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_5983/527480914.py in <module>
----> 1 import rpy2.rinterface_lib.callbacks
      2 import logging
      3 from rpy2.robjects import pandas2ri
      4 get_ipython().run_line_magic('load_ext', 'rpy2.ipython')
      5 pandas2ri.activate()

ModuleNotFoundError: No module named 'rpy2'

%%R
library("edgeR")

2. Select genes / features / tags¶

While we can use the formula interface in R, it is often desirable to re-use the same design matrix and contrasts across feature selection methods. We use the patsy package to accomplish this.

I end up creating multiple `GeneSetCollection` objects here, while I could have just as easily used one.
How many collection objects used depends on your preferences and the complexity of your analysis.

Create Design Matrices and Contrasts¶

Here I make some simple contrasts that select for treatment from each of our designs. Newer users may find it easier to use the makeContrasts function provided by the popular limma package. I create ordered, nested lists for easier transfer to and from R.

counts, labels = gsf.get_gem_data(agem, annotation_variables=['treatment', 'time', 'genotype'])
labels = labels.to_dataframe()
ri_counts = gsf.utils.R_interface.Py_counts_to_R(counts)

forumla_designs = [
    "0 + treatment",
    "0 + treatment:genotype",
    # These models take too long to run for a demo-notebook.
    # "0 + treatment:genotype:scale(time, center=False)",
    # "0 + treatment:genotype:C(time):C(replicate)",
]

# I use nested lists instead of dictionaries to facilitate transfering to and from R.
design_list = [[f, patsy.dmatrix(f, labels)] for f in forumla_designs]
dmatrix_list = [[f, np.asarray(v)] for f, v in design_list]


def make_contrast(design, factors):
    """Constructs a contrast array for each item in factors. Returns a one-hot array."""
    return np.sum([(np.core.defchararray.find(design.design_info.column_names, f)!=-1).astype(int)
                   for f in factors], axis=0)


treatment_factors = design_list[0][1].design_info.column_names[1:]
control_factor = design_list[0][1].design_info.column_names[0]

contrast_list = list()

for i, (formula, design) in enumerate(design_list):
    design_contrast = list()
    for j, col_name in enumerate(treatment_factors):
        contrast = make_contrast(design, [col_name]) - make_contrast(design, [control_factor])
        design_contrast.append([col_name, contrast])
    contrast_list.append(design_contrast)

Let’s view the simplest model and the contrasts created:

dmatrix_list[0]

['0 + treatment',
 array([[1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        ...,
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0.]])]

contrast_list[0]

[['treatment[DROUGHT]', array([-1,  1,  0,  0,  0])],
 ['treatment[HEAT]', array([-1,  0,  1,  0,  0])],
 ['treatment[RECOV_DROUGHT]', array([-1,  0,  0,  1,  0])],
 ['treatment[RECOV_HEAT]', array([-1,  0,  0,  0,  1])]]

Select via `edgeR`¶

Here we present a ‘standard’ feature selection with edgeR, see the edgeR Bioconductor page for more information.

%%R -i ri_counts -i dmatrix_list -i contrast_list -o results -o keep

results <- list()

di <- 1

print('Applying filter.')
# Apply default gene filter.
keep <- filterByExpr(ri_counts, design=dmatrix_list[[1]][[2]])
Y <- ri_counts[keep,]

for (design in dmatrix_list)
{
    name <- design[[1]]
    design_matrix <- design[[2]]
    print(sprintf('Examing design %s', name))
    
    print('Construct the edgeR DGEList object.')
    # Construct the edge list object.
    dge_list <- DGEList(counts=Y)
    dge_list <- calcNormFactors(dge_list)
    dge_list <- estimateDisp(dge_list, design=design_matrix)

    # Plot variation.
    dir.create('disp_plots', showWarnings = FALSE)
    png(sprintf("disp_plots/%s_counts_disp.png", name))
    plotBCV(dge_list)
    dev.off()

    # Fit the negative binomial model.
    fit <- glmQLFit(dge_list, design=design_matrix)
    
    contrast_results <- list()
    ci <- 1
    
    for (contrast in contrast_list[[di]])
    {
        cname <- contrast[[1]]
        contrast_array <- contrast[[2]]
        print(sprintf('Examing contrast %s', cname))
        contrast_result  = glmQLFTest(fit, contrast=contrast_array)
        p_values <- contrast_result$table$PValue
        FDR <- p.adjust(p_values, method="BH")
        contrast_result$table$FDR <- FDR
        contrast_result$table$support_dir <- c(decideTests(contrast_result, p.value=0.001, lfc=2.0))
        contrast_result$table$support <- abs(c(decideTests(contrast_result, p.value=0.001, lfc=2.0)))
        contrast_results[[ ci ]] = contrast_result$table
        ci <- ci + 1
    }
    
    results[[di]] <- contrast_results
    di <- di + 1
}

[1] "Applying filter."
[1] "Examing design 0 + treatment"
[1] "Construct the edgeR DGEList object."
[1] "Examing contrast treatment[DROUGHT]"
[1] "Examing contrast treatment[HEAT]"
[1] "Examing contrast treatment[RECOV_DROUGHT]"
[1] "Examing contrast treatment[RECOV_HEAT]"
[1] "Examing design 0 + treatment:genotype"
[1] "Construct the edgeR DGEList object."
[1] "Examing contrast treatment[DROUGHT]"
[1] "Examing contrast treatment[HEAT]"
[1] "Examing contrast treatment[RECOV_DROUGHT]"
[1] "Examing contrast treatment[RECOV_HEAT]"

dge_collection = gsf.GeneSetCollection(gem=agem, name='DGE Results')

dge_collection['edgeR filter'] = gsf.GeneSet.from_bool_array(
    np.asarray(keep).astype(bool), 
    agem.data.Gene.values,
    name='edgeR filter')

for result_dfs, formula, contrasts in  zip(results, forumla_designs, contrast_list):
    for contrast_df, (contrast_name, _) in zip(list(result_dfs), contrasts):
        key = f"'{formula}'__{contrast_name}"
        dge_collection[key] = gsf.GeneSet(contrast_df, name=key)

dge_collection

<GSForge.GeneSetCollection>
DGE Results
GeneSets (9 total): Support Count
    edgeR filter: 21915
    '0 + treatment:genotype'__treatment[HEAT]: 4071
    '0 + treatment:genotype'__treatment[RECOV_HEAT]: 3719
    '0 + treatment:genotype'__treatment[DROUGHT]: 3703
    '0 + treatment:genotype'__treatment[RECOV_DROUGHT]: 2806
    ... and 4 more.

dge_collection['edgeR filter']

<GSForge.GeneSet>
Name: edgeR filter
    Supported Genes:  21915

gsf.plots.gem.RasterGEM(dge_collection, hue='genotype')

gsf.plots.gem.RasterGEM(dge_collection, selected_gene_sets=['edgeR filter'], hue='genotype')

if not TOUR_DGE.exists():
    dge_collection.save(TOUR_DGE)

Random Forest via Boruta Feature Selection¶

See this FAQ for details on Boruta.

In brief, boruta lets us use a ‘minimal optimal’ feature selection model as an ‘all relevant’ method. Random forest models typically work by finding a minimum optimal set of features. In our use, we are interested in all features that may be relevant, not the minimum required to train a model. GSForge provides a wrapper which returns xarray.DataSet objects suitable for immediate GeneSet construction.

boruta_gsc = gsf.GeneSetCollection(gem=agem, name='Boruta Results')

selecting_model = RandomForestClassifier(
    class_weight='balanced',
    max_depth=4, 
    n_estimators=100, 
    n_jobs=-1)

for target in ["treatment", "genotype"]:
    boruta_treatment_ds = gsf.operations.BorutaProspector(
        agem,
        estimator=selecting_model,
        annotation_variables=target,
        max_iter=50)
    
    boruta_gsc[f"Boruta_{target}"] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")
    
boruta_gsc

<GSForge.GeneSetCollection>
Boruta Results
GeneSets (2 total): Support Count
    Boruta_treatment: 757
    Boruta_genotype: 662

if not TOUR_BORUTA.exists():
    boruta_gsc.save(TOUR_BORUTA)

We can check if boruta still has features to resolve into important or not by examining the ‘support_weak’ variable. In practice one should increase the number of iterations until there are no remaining values to be resolved.

boruta_gsc['Boruta_genotype'].data.support_weak.values.sum()

Gene Sets from Literature¶

In practice there is often a gene set that is externally defined that we wish to compare. Fortunately the only requirement for a GeneSet to be useful is that has an index that matches that in our GEM. See this notebook for more details.

Here we read in a supporting information file from the EGRIN study by Wilkins et. al., and after a bit of parsing it is converted to a series of GeneSet objects. This SI table appears to be counts of significant differential expression for time contrasts within each treatment. The full model is not (exactly) specified in the paper, but we can infer that it was probably:

\[ ~ 0 + genotype:treatment:C(time):C(replicate) \]

Where \(C(\text{label})\) indicates treatment as a categorical, rather than scalar variable. The union for each set of treatment contrasts was probably then used.

si1_df = pd.read_csv(SI_FILE_1_PATH, skiprows=3, index_col=0) 
lit_dge_coll = gsf.GeneSetCollection(gem=agem, name='Literature DGE')

for col in si1_df.columns:
    genes = si1_df[si1_df[col] > 0].index.values
    diff = np.setdiff1d(genes, agem.data.Gene.values)
    if diff.shape[0] > 0:
        print(diff)
    
    lit_dge_coll[col] = gsf.GeneSet.from_gene_array(genes, name=col)

lit_dge_coll.summarize_gene_sets()

{'DROUGHT_UP': 1175,
 'HEAT_UP': 592,
 'RECOV_DROUGHT_UP': 446,
 'DROUGHT_DOWN': 170,
 'HEAT_DOWN': 106,
 'RECOV_HEAT_UP': 76,
 'RECOV_DROUGHT_DOWN': 58,
 'RECOV_HEAT_DOWN': 43}

3. Compare Results¶

While this step is ultimately experiment-specific, it usually entails some combination of:

Comparing selected to unselected features by one or more measure.
- p-values and log-fold-change.
Ranking or comparing within a selected set.
- Random forest feature importance.
- p-values and log-fold-change.
Comparing set memberships, these usually take the form of set operations, such as:
- union
- intersection
- difference
- unique
Comparing between selection sets.
- Model prediction scores.

Entries 1 and 2 are typically routine within the same selection set, as the tool used to create that set should provide the needed measures. More complications emerge with steps 3 and 4, especially when comparing selection sets that derive from different methods. Without additional wet lab experimentation we are limited in declaring our success in feature identification. Instead we can score our selected subset as it preforms in prediction models.

For this demonstration we will combine each collection into its own set by taking the union of their selections. Then we will examine the features selected for all treatments by each method.

union_coll = gsf.GeneSetCollection(gem=agem, name='Combnied Collection')
union_coll.gene_sets.update(boruta_gsc.gene_sets)
lit_geneset = gsf.GeneSet.from_GeneSets(*lit_dge_coll.gene_sets.values(), name='literature_union')
union_coll.gene_sets.update({'literature_union': lit_geneset})

# Here I get the keys for each of the two DGE models we ran above.
dge_keys = pd.Series(dge_collection.gene_sets.keys())
key_sets = [dge_keys[dge_keys.str.startswith(f"'{f}'")].values for f in forumla_designs]

for f, keys in zip(forumla_designs, key_sets):
    name = f"combined '{f}'"
    union_coll[name] = gsf.GeneSet.from_GeneSets(
        *[dge_collection[k] for k in keys],
        name=name)
    
union_coll

<GSForge.GeneSetCollection>
Combnied Collection
GeneSets (5 total): Support Count
    combined '0 + treatment:genotype': 8941
    literature_union: 2061
    Boruta_treatment: 757
    Boruta_genotype: 662
    combined '0 + treatment': 596

Visualize Set Overlap¶

See the upsetplot documentation for more details.

Instead of a Venn diagram we use an ‘Upset plot’. This allows us to view overlaps of sets larger than three.

gsf.plots.collections.UpsetPlotInterface(union_coll)

<upsetplot.plotting.UpSet at 0x7fc3381b3850>

Comparing Selection Sets¶

We can estimate how well a given subset of genes ‘describes’ a sample (phenotype) label by comparing how well they perform using a given machine learning model.

results = dict()

for key in list(union_coll.gene_sets.keys()) + ['all']:
    counts, treatment = gsf.get_gem_data(union_coll, selected_gene_sets=[key], annotation_variables=["treatment"])
    x_train, x_test, y_train, y_test = model_selection.train_test_split(counts, treatment)
    # model = RandomForestClassifier(class_weight='balanced', n_estimators=1000, n_jobs=-1, max_depth=6)
    model = linear_model.Perceptron()
    model.fit(x_train, y_train) 
    results[key] = model.score(x_test, y_test)
    

hv.Bars(results, kdims=["Gene Selection Group"]).opts(
    xrotation=90, invert_axes=True, ylim=(0, 1.1), ylabel='Score', fig_size=150, 
    aspect=2, title='Scores vs Treatment Labels')

Ranking Features within Sets¶

GSForge provides helper functions to extract genes by a score values. Note we get the data from the original dge collection, as those logFC values are intact.

dge_ds = dge_collection["'0 + treatment:genotype'__treatment[HEAT]"]

dge_ds.get_top_n_genes("logFC", 10)

array(['LOC_Os03g14180', 'LOC_Os11g13980', 'LOC_Os04g01740',
       'LOC_Os01g04370', 'LOC_Os01g04360', 'LOC_Os07g47840',
       'LOC_Os02g15930', 'LOC_Os01g04380', 'LOC_Os11g05170',
       'LOC_Os04g36750'], dtype=object)

dge_ds.get_genes_by_threshold(3.0, "logFC")

array(['ChrSy.fgenesh.gene.85', 'ChrSy.fgenesh.gene.86', 'LOC_Os01g01170',
       ..., 'LOC_Os12g43640', 'LOC_Os12g43850', 'LOC_Os12g44250'],
      dtype=object)

Rank Genes with a Random Forest¶

Random forests and feature ranks. Robust enough to function in our case. Some values filtered prior to dge analysis…

counts.coords.dims[1]

'Gene'

counts.coords[counts.coords.dims[1]].values

array(['ChrSy.fgenesh.gene.21', 'ChrSy.fgenesh.gene.25',
       'ChrSy.fgenesh.gene.28', ..., 'LOC_Os12g44310', 'LOC_Os12g44340',
       'LOC_Os12g44370'], dtype=object)

union_coll["'0 + treatment:genotype'__treatment[HEAT]"] = dge_collection["'0 + treatment:genotype'__treatment[HEAT]"]

gene_rank_mdl = RandomForestClassifier(class_weight='balanced', n_estimators=1000, n_jobs=-2)

treatment_nFDR = gsf.operations.nFDR(
    union_coll,
    selected_gene_sets=["Boruta_treatment", "'0 + treatment:genotype'__treatment[HEAT]"],
    gene_set_mode="union",
    annotation_variables=["treatment"],
    model=gene_rank_mdl,
    n_iterations=5
)

treatment_feature_importance = gsf.operations.RankGenesByModel(
    union_coll,
    selected_gene_sets=["Boruta_treatment", "'0 + treatment:genotype'__treatment[HEAT]"],
    gene_set_mode="union",
    annotation_variables=["treatment"],
    model=gene_rank_mdl,
    n_iterations=5
)

treatment_feature_importance

<xarray.DataArray 'feature_importances' (model_iteration: 5, Gene: 4391)>
array([[1.31563510e-04, 6.79631213e-05, 2.29502259e-04, ...,
        8.92052648e-05, 1.45398837e-04, 7.24346971e-05],
       [2.84511837e-04, 6.37763120e-05, 2.20714802e-04, ...,
        1.33545198e-04, 6.46603741e-05, 5.46214076e-05],
       [9.88199062e-05, 2.69261394e-05, 1.67553915e-04, ...,
        5.66013391e-05, 1.20388075e-04, 3.19671445e-05],
       [3.36803404e-04, 4.91265354e-05, 3.29251471e-04, ...,
        5.84467514e-05, 1.42280408e-04, 5.73415577e-05],
       [2.85515835e-04, 2.21027518e-05, 2.11691414e-04, ...,
        7.19938839e-05, 3.35372519e-05, 4.97356344e-05]])
Coordinates:
  * Gene     (Gene) object 'ChrSy.fgenesh.gene.37' ... 'LOC_Os12g44250'
Dimensions without coordinates: model_iteration
Attributes:
    Ranking Model:         RandomForestClassifier(class_weight='balanced', n_...
    count_variable:        counts
    annotation_variables:  treatment

gene_union = union_coll.union(["Boruta_treatment", "'0 + treatment:genotype'__treatment[HEAT]"])

boruta_support = np.isin(gene_union, union_coll["Boruta_treatment"].get_support())
dge_support = np.isin(gene_union, 
                      union_coll["'0 + treatment:genotype'__treatment[HEAT]"].get_support())

support = np.zeros_like(gene_union)
support[boruta_support] = "Boruta"
support[dge_support] = "DGE"
support[(boruta_support * dge_support) == True] = "Both"

df = pd.DataFrame({
    "logFC": dge_ds.data['logFC'].reindex(Gene=gene_union).values,
    "PValue": dge_ds.data['PValue'].reindex(Gene=gene_union).values,
#     "F": dge_ds.data['F'].reindex(Gene=gene_union).values,
    "Gene": gene_union,
    "mean feature importance": treatment_feature_importance.mean(dim="model_iteration").values,
    "mean nFDR": treatment_nFDR.mean(dim="model_iteration").values,
    "support source": support,
}).set_index("Gene")

# sns.pairplot(df, hue="support source", markers=['.', '.', 'X'],
#              vars=["mean feature importance", "logFC", "mean nFDR"],
#              plot_kws=dict(edgecolor=None, alpha=0.25));

fig, axes = plt.subplots(1, 3, figsize=(18, 6))
sns.scatterplot(data=df, x='mean feature importance', y='mean nFDR', hue='support source',
                edgecolor=None, alpha=0.5, markers=['.', '.', 'X'], style='support source', ax=axes[0]);
sns.scatterplot(data=df, x='mean feature importance', y='logFC', hue='support source',
                edgecolor=None, alpha=0.5, markers=['.', '.', 'X'], style='support source', ax=axes[1]);
sns.scatterplot(data=df, x='mean feature importance', y='PValue', hue='support source',
                edgecolor=None, alpha=0.5, markers=['.', '.', 'X'], style='support source', ax=axes[2]);

sns.pairplot(df, hue="support source", markers=['.', '.', 'X'],
             vars=["mean feature importance", "logFC", "mean nFDR"],
             plot_kws=dict(edgecolor=None, alpha=0.25));

/home/tyler/anaconda3/envs/gsfenv/lib/python3.7/site-packages/seaborn/distributions.py:305: UserWarning: Dataset has 0 variance; skipping density estimate.
  warnings.warn(msg, UserWarning)

UMAP Embeddings of Selections¶

counts, labels = gsf.get_gem_data(union_coll, selected_gene_sets=['Boruta_treatment'], count_variable='qt_counts', 
                                  annotation_variables=['treatment', 'genotype', 'time'])
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');

counts, labels = gsf.get_gem_data(union_coll, selected_gene_sets=['literature_union'], count_variable='qt_counts', 
                                  annotation_variables=['treatment', 'genotype', 'time'])
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');

Clustermap Selection¶

def series_to_colors(series, cmap, categorical=True):
    keys = series.unique()
    colors = hv.plotting.util.process_cmap(cmap, len(keys), categorical=categorical)
    mapper = {k: c for k, c in zip(keys, colors)}
    return series.map(mapper)

# az_samples = agem.data.sel(Sample=agem.data.genotype == 'Azuenca (AZ; IRGC#328, Japonica)')['Sample'].values
df, labels = gsf.get_gem_data(union_coll, 
#                               sample_subset=az_samples, 
                              annotation_variables=['treatment', 'genotype'],
                              selected_gene_sets=["Boruta_treatment", 
                                                  "literature_union"],
                              gene_set_mode='intersection',
                              count_transform=lambda counts: np.log2(counts.where(counts > 0)),
                              output_type="pandas")


color_df = pd.DataFrame({
    "treatment": series_to_colors(labels['treatment'], "Set1"),
    "genotype": series_to_colors(labels['genotype'], "Set2")
})

sns.clustermap(df.fillna(0), 
               metric='cityblock', 
               row_colors=color_df,
               row_cluster=False, 
               dendrogram_ratio=0.1,
               cmap='jet',
               figsize=(10, 10));

Functional Enrichment / Gene Set Enrichment Analysis¶

In investigating a set of genes it is common to examine available annotations for those genes. With special interest given to those annotations that appear enriched within our selection.

GSEAPY is useful here.

import gseapy as gp
from gseapy.plot import gseaplot

func_data_path = OSF_PATH.joinpath('GEMmakerGEMs', 'raw_annotation_data', 'oryza_sativa.MSU_v7_0.genes2annots.txt')
fdf = pd.read_csv(func_data_path, sep='\t', low_memory=False)
fdf = fdf[fdf['Gene'].isin(agem.data.Gene.values)]
fdf.head()

	Gene	Term	Definition	CV	Reaction-id	EC	Protein-id	Protein-name	Gene-id
0	LOC_Os12g04990	PWY-321	cutin biosynthesis	PlantCyc	RXN-16389	EC-6.2.1.3	GN7F-26073-MONOMER	NaN	GN7F-26073
1	LOC_Os11g35400	PWY-321	cutin biosynthesis	PlantCyc	RXN-16389	EC-6.2.1.3	GN7F-17597-MONOMER	NaN	GN7F-17597
2	LOC_Os05g04170	PWY-321	cutin biosynthesis	PlantCyc	RXN-16389	EC-6.2.1.3	GN7F-17804-MONOMER	NaN	GN7F-17804
3	LOC_Os06g06350	PWY-321	cutin biosynthesis	PlantCyc	RXN-16389	EC-6.2.1.3	GN7F-18230-MONOMER	NaN	GN7F-18230
4	LOC_Os05g25310	PWY-321	cutin biosynthesis	PlantCyc	RXN-16389	EC-6.2.1.3	GN7F-26550-MONOMER	NaN	GN7F-26550

support = dge_collection["'0 + treatment'__treatment[HEAT]"].get_support()
term_counts = fdf.loc[fdf['Gene'].isin(support)].groupby('Term')['Gene'].count()
term_counts = term_counts[term_counts > 10]
scores = dge_collection["'0 + treatment'__treatment[HEAT]"].data['logFC'].sel(Gene=support).to_dataframe()
term_genes = dict()
for term in term_counts.index.values:
    term_genes[term] = fdf.groupby('Term')['Gene'].unique()[term]
    

pre_res = gp.prerank(rnk=scores, 
                     gene_sets=term_genes,
                     processes=4,
                     permutation_num=100, # reduce number to speed up testing
                     outdir='prerank', 
                     format='png', seed=6)

pre_res.res2d.sort_index().head()

	es	nes	pval	fdr	geneset_size	matched_size	genes	ledge_genes
Term
GO:0005524	0.287134	0.933461	0.543210	0.540046	2218	15	LOC_Os04g01740;LOC_Os05g44340;LOC_Os06g12370;L...	LOC_Os04g01740;LOC_Os05g44340;LOC_Os06g12370
IPR002068	0.821481	2.909723	0.000000	0.000000	39	17	LOC_Os03g14180;LOC_Os11g13980;LOC_Os01g04370;L...	LOC_Os03g14180;LOC_Os11g13980;LOC_Os01g04370;L...
IPR008978	0.793451	2.646324	0.000000	0.000000	50	20	LOC_Os03g14180;LOC_Os11g13980;LOC_Os01g04370;L...	LOC_Os03g14180;LOC_Os11g13980;LOC_Os01g04370;L...
expressed protein	0.355513	1.429853	0.106383	0.117277	14309	38	LOC_Os07g47840;LOC_Os02g15930;LOC_Os11g05170;L...	LOC_Os07g47840;LOC_Os02g15930;LOC_Os11g05170;L...
path:dosa04141	0.670797	2.644200	0.000000	0.000000	199	27	LOC_Os03g14180;LOC_Os11g13980;LOC_Os01g04370;L...	LOC_Os03g14180;LOC_Os11g13980;LOC_Os01g04370;L...

terms = pre_res.res2d.index
terms

Index(['IPR002068', 'IPR008978', 'path:dosa04141', 'expressed protein',
       'GO:0005524'],
      dtype='object', name='Term')

gseaplot(rank_metric=pre_res.ranking, term=terms[0], **pre_res.results[terms[0]])

Conclusion & Next Steps¶

Having introduced the GSForge package by example we hope you are motivated to compare and share gene selection methodologies. Recall that the scope of gene / feature selection is unique to each experiment, no gold standard method or statistical measure yet exists to unambiguously interpret RNA-seq data. In most cases researchers or their collaborators are interested in a reduced list of genes (rather than the entire genome!) from which to form a hypothesis.

We hope you feel free to contribute, share and question methodologies, and that you find GSForge helpful in this endevour.

References¶

Wilkins, O. et al. EGRINs (Environmental gene regulatory influence networks) in rice that function in the response to water deficit, high temperature, and agricultural environments. Plant Cell 28, 2365–2384 (2016).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. (2018).
Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R. & Pfister, H. UpSet: Visualization of Intersecting Sets Europe PMC Funders Group. IEEE Trans Vis Comput Graph 20, 1983–1992 (2014).

Selecting and Comparing Genes with GSForge¶

1. Create an Annotated Gene Expression Matrix¶

Add a Normalized or Transformed Count Matrix¶

Select Counts and Annotations using `get_gem_data()`¶

2. Select genes / features / tags¶

Create Design Matrices and Contrasts¶

Select via `edgeR`¶

Random Forest via Boruta Feature Selection¶

Gene Sets from Literature¶

3. Compare Results¶

Visualize Set Overlap¶

Comparing Selection Sets¶

Ranking Features within Sets¶

Rank Genes with a Random Forest¶

UMAP Embeddings of Selections¶

Clustermap Selection¶

Functional Enrichment / Gene Set Enrichment Analysis¶

Conclusion & Next Steps¶

References¶

Selecting and Comparing Genes with GSForge¶

1. Create an Annotated Gene Expression Matrix¶

Add a Normalized or Transformed Count Matrix¶

Select Counts and Annotations using get_gem_data()¶

2. Select genes / features / tags¶

Create Design Matrices and Contrasts¶

Select via edgeR¶

Random Forest via Boruta Feature Selection¶

Gene Sets from Literature¶

3. Compare Results¶

Visualize Set Overlap¶

Comparing Selection Sets¶

Ranking Features within Sets¶

Rank Genes with a Random Forest¶

UMAP Embeddings of Selections¶

Clustermap Selection¶

Functional Enrichment / Gene Set Enrichment Analysis¶

Conclusion & Next Steps¶

References¶

Select Counts and Annotations using `get_gem_data()`¶

Select via `edgeR`¶