GeneSet Analysis

This notebook gives some simple examples for:

  • Comparing GeneSets to eachother via some machine-learning model score.
  • Ranking genes within a geneset based on some machine-learning model.

Setting up the notebook

In [1]:
import os
import numpy as np

import matplotlib.pyplot as plt
import holoviews as hv
hv.extension("bokeh")
%matplotlib inline

import GSForge as gsf

Declare paths used

In [2]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
In [3]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
BORUTA_COLL_PATH = OSF_PATH.joinpath("osfstorage", "boruta_gene_sets")
assert AGEM_PATH.exists()

Load an AnnotatedGEM

In [4]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
Out[4]:
<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475
In [5]:
gsc = gsf.GeneSetCollection.from_folder(gem=agem, 
                                        target_dir=BORUTA_COLL_PATH, 
                                        name="Boruta Results")
gsc
Out[5]:
<GSForge.GeneSetCollection>
Boruta Results
GeneSets (3 total): Support Count
    Boruta_Treatment: 681
    Boruta_Genotype: 661
    Boruta_Subspecies: 231

Scoring and Judging GeneSets

Comparing Genes within a GeneSet

By feature importance

In [6]:
from sklearn.ensemble import RandomForestClassifier
In [7]:
gene_rank_mdl = RandomForestClassifier(class_weight='balanced', 
                                       n_estimators=1000, n_jobs=-1)
In [8]:
%%time

treatment_feature_importance = gsf.operations.rank_genes_by_model(
    gsc,
    selected_gene_sets=["Boruta_Treatment"],
    annotation_variables="Treatment",
    model=gene_rank_mdl,
    n_iterations=3
)
CPU times: user 18 s, sys: 27.4 s, total: 45.5 s
Wall time: 38.5 s
In [9]:
treatment_feature_importance
Out[9]:
<xarray.Dataset>
Dimensions:                  (Gene: 681, feature_importance_iter: 3)
Coordinates:
  * Gene                     (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * feature_importance_iter  (feature_importance_iter) int64 0 1 2
    feature_importance_mean  (Gene) float64 0.0009284 0.0007136 ... 0.000881
    feature_importance_std   (Gene) float64 0.0002016 9.827e-05 ... 0.000195
Data variables:
    feature_importances      (feature_importance_iter, Gene) float64 0.001208 ... 0.0007206

Then you can optionally store such results to a GeneSet using xarray update().

In [10]:
gsc.gene_sets["Boruta_Treatment"].data = gsc.gene_sets["Boruta_Treatment"].data.update(treatment_feature_importance)
gsc.gene_sets["Boruta_Treatment"].data 
Out[10]:
<xarray.Dataset>
Dimensions:                  (Gene: 55986, feature_importance_iter: 3)
Coordinates:
  * Gene                     (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
  * feature_importance_iter  (feature_importance_iter) int64 0 1 2
    feature_importance_mean  (Gene) float64 nan nan nan nan ... nan nan nan
    feature_importance_std   (Gene) float64 nan nan nan nan ... nan nan nan
Data variables:
    support                  (Gene) bool False False False ... False False False
    support_weak             (Gene) bool False False False ... False False False
    ranking                  (Gene) int64 10029 34446 34446 ... 34446 3373 34446
    feature_importances      (feature_importance_iter, Gene) float64 nan ... nan
Attributes:
    boruta_model:                   {"two_step": true, "max_iter": 50, "estim...
    ranking_model:                  {"bootstrap": true, "class_weight": "bala...
    selected_count_variable:        counts
    selected_annotation_variables:  Treatment
    __GSForge.GeneSet.params:       {"gene_index_name": "Gene", "name": "Boru...

Estimate False Discovery Rates

nFDR

In [11]:
%%time
treatment_nFDR = gsf.operations.analytics.nFDR(
    gsc,
    selected_gene_sets=["Boruta_Treatment"],
    annotation_variables="Treatment",
    model=gene_rank_mdl,
    n_iterations=3
)
CPU times: user 42 s, sys: 15.9 s, total: 57.9 s
Wall time: 39.5 s
In [12]:
treatment_nFDR
Out[12]:
<xarray.Dataset>
Dimensions:    (Gene: 681, nFDR_iter: 3)
Coordinates:
  * Gene       (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * nFDR_iter  (nFDR_iter) int64 0 1 2
    nFDR_mean  (Gene) float64 0.8003 0.9711 0.002447 ... 0.7572 0.2888 0.9745
    nFDR_std   (Gene) float64 0.2253 0.02212 0.003461 ... 0.2475 0.249 0.02381
Data variables:
    nFDR       (nFDR_iter, Gene) float64 0.4816 0.9413 0.007342 ... 0.232 0.9868
In [13]:
gsc.gene_sets["Boruta_Treatment"].data = gsc.gene_sets["Boruta_Treatment"].data.update(treatment_nFDR)

mProbes

In [14]:
%%time
treatment_mProbes = gsf.operations.analytics.mProbes(
    gsc,
    selected_gene_sets=["Boruta_Treatment"],
    annotation_variables="Treatment",
    model=gene_rank_mdl,
    n_iterations=3
)
CPU times: user 23.9 s, sys: 15.5 s, total: 39.4 s
Wall time: 29 s
In [15]:
treatment_mProbes
Out[15]:
<xarray.Dataset>
Dimensions:       (Gene: 681, mProbes_iter: 3)
Coordinates:
  * Gene          (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * mProbes_iter  (mProbes_iter) int64 0 1 2
    mProbes_mean  (Gene) float64 0.0 0.0 0.0 0.0 0.01566 ... 0.0 0.0 0.0 0.0 0.0
    mProbes_std   (Gene) float64 0.0 0.0 0.0 0.0 0.01816 ... 0.0 0.0 0.0 0.0 0.0
Data variables:
    mProbes NRD   (mProbes_iter, Gene) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
In [16]:
gsc.gene_sets["Boruta_Treatment"].data = gsc.gene_sets["Boruta_Treatment"].data.update(treatment_mProbes)

Examine Rankings

In [17]:
# View the xarray dataset of a GeneSet:
# gsc.gene_sets["Boruta_Treatment"].data
In [18]:
# View the list of 'supported' genes.
len(gsc.gene_sets["Boruta_Treatment"].gene_support())
Out[18]:
681

Subset by supported genes

In [19]:
ds = gsc.gene_sets["Boruta_Treatment"].data.sel(Gene=gsc.gene_sets["Boruta_Treatment"].gene_support())
ds
Out[19]:
<xarray.Dataset>
Dimensions:                  (Gene: 681, feature_importance_iter: 3, mProbes_iter: 3, nFDR_iter: 3)
Coordinates:
  * Gene                     (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * feature_importance_iter  (feature_importance_iter) int64 0 1 2
    feature_importance_mean  (Gene) float64 0.0009284 0.0007136 ... 0.000881
    feature_importance_std   (Gene) float64 0.0002016 9.827e-05 ... 0.000195
  * nFDR_iter                (nFDR_iter) int64 0 1 2
    nFDR_mean                (Gene) float64 0.8003 0.9711 ... 0.2888 0.9745
    nFDR_std                 (Gene) float64 0.2253 0.02212 ... 0.249 0.02381
  * mProbes_iter             (mProbes_iter) int64 0 1 2
    mProbes_mean             (Gene) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    mProbes_std              (Gene) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Data variables:
    support                  (Gene) bool True True True True ... True True True
    support_weak             (Gene) bool False False False ... False False False
    ranking                  (Gene) int64 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    feature_importances      (feature_importance_iter, Gene) float64 0.001208 ... 0.0007206
    nFDR                     (nFDR_iter, Gene) float64 0.4816 0.9413 ... 0.9868
    mProbes NRD              (mProbes_iter, Gene) float64 0.0 0.0 ... 0.0 0.0
Attributes:
    boruta_model:                   {"two_step": true, "max_iter": 50, "estim...
    ranking_model:                  {"bootstrap": true, "class_weight": "bala...
    selected_count_variable:        counts
    selected_annotation_variables:  Treatment
    __GSForge.GeneSet.params:       {"gene_index_name": "Gene", "name": "Boru...
In [20]:
# hv.Dataset(ds)
# hv.Scatter(hv.Dataset(ds), ["fwe_mean", "nrd_mean"])
In [21]:
df = ds[["feature_importance_mean", "nFDR_mean"]].to_dataframe()
plot = hv.Scatter(df, [("feature_importance_mean",  "Mean Feature Importance Rank"),
                ("nFDR_mean",  "False Discovery Rate")]
          ).opts(padding=0.1, width=400, height=400)
plot
Out[21]:
In [22]:
# hv.save(plot, "fdr_vs_importance.png")
In [23]:
df = ds[["feature_importance_mean", "mProbes_mean"]].to_dataframe()
hv.Scatter(df).opts(padding=0.1, width=400, height=400)
Out[23]:

Receiver Operating Characteristic (ROC)

Another way to evaluate a classification model is the ROC.

Viewing ROC curves for multi-label models is a bit indirect as we have to use a binary classifier for each unique target label, so we provide this walkthrough. Also examine the sklearn demo.

In [24]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score, plot_roc_curve
from sklearn import preprocessing, model_selection

Get the count and annotation data from GSForge.

In [25]:
counts, treatment = gsf.get_data(gsc, annotation_variables="Treatment",selected_gene_sets=["Boruta_Treatment"])
classes = treatment.to_series().unique()
classes
Out[25]:
array(['CONTROL', 'HEAT', 'RECOV_HEAT', 'DROUGHT', 'RECOV_DROUGHT'],
      dtype=object)

Encode the annotation labels with a one hot encoder.

In [26]:
enc = preprocessing.OneHotEncoder().fit(treatment.values.reshape(-1, 1))
treatment_onehot = enc.transform(treatment.values.reshape(-1, 1)).toarray()
treatment_onehot
Out[26]:
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.]])

Split the data and encoded annotations into a train and test set.

In [27]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(counts, treatment_onehot)

Fit the model with the training data.

In [28]:
%%time
roc_rf_model = OneVsRestClassifier(RandomForestClassifier(class_weight='balanced', max_depth=3,
                                   n_estimators=1000, n_jobs=-1))
roc_rf_model = roc_rf_model.fit(x_train, y_train)
CPU times: user 3.97 s, sys: 113 ms, total: 4.09 s
Wall time: 10.2 s

Now predict class probabilities for the test count data.

In [29]:
y_score = roc_rf_model.predict_proba(x_test)
y_score[:5]
Out[29]:
array([[0.87076133, 0.33215912, 0.00134682, 0.03518171, 0.00367744],
       [0.7983976 , 0.32986137, 0.01035656, 0.2528504 , 0.00860827],
       [0.78413877, 0.56685172, 0.02223655, 0.10846641, 0.00160146],
       [0.88358456, 0.30536646, 0.004605  , 0.0241869 , 0.00327239],
       [0.42219986, 0.60130967, 0.00482617, 0.22825346, 0.01285147]])
In [30]:
fpr = dict()
tpr = dict()
roc_auc = dict()
for i, class_ in enumerate(classes):
    fpr[class_], tpr[class_], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[class_] = auc(fpr[class_], tpr[class_])
In [31]:
roc_curves = {class_: hv.Curve((fpr[class_], tpr[class_]))
              for class_ in classes}
In [32]:
hv.NdOverlay(roc_curves).opts(padding=0.05, legend_position="top", width=400, height=450)
Out[32]:

View a Decision Tree

In [33]:
from sklearn import tree
import graphviz
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-33-8a9d2c837516> in <module>
      1 from sklearn import tree
----> 2 import graphviz

ModuleNotFoundError: No module named 'graphviz'

Extract a single tree from a list of estimators.

In [106]:
control_rfc = roc_rf_model.estimators_[0]
print(f"There are {len(control_rfc.estimators_)} trees.")
There are 1000 trees.
In [109]:
selected_tree = control_rfc.estimators_[0]
graph = graphviz.Source(tree.export_graphviz(
    selected_tree, 
    feature_names=x_train.Gene.values,  
    class_names=["Not CONTROL", "CONTROL"],
    filled=True, 
    rounded=True,  
    special_characters=True))

graph
# graph.render('decision_tree', format="svg")  
Out[109]:
Tree 0 LOC_Os06g12370 ≤ 11.5 gini = 0.5 samples = 223 value = [176.576, 181.358] class = CONTROL 1 LOC_Os07g42280 ≤ 143.0 gini = 0.343 samples = 117 value = [50.552, 179.679] class = CONTROL 0->1 True 6 LOC_Os06g04070 ≤ 3685.0 gini = 0.026 samples = 106 value = [126.024, 1.679] class = Not CONTROL 0->6 False 2 LOC_Os01g12420 ≤ 59.5 gini = 0.249 samples = 98 value = [30.616, 179.679] class = CONTROL 1->2 5 gini = -0.0 samples = 19 value = [19.936, 0.0] class = Not CONTROL 1->5 3 gini = 0.196 samples = 89 value = [22.072, 178.0] class = CONTROL 2->3 4 gini = 0.275 samples = 9 value = [8.544, 1.679] class = Not CONTROL 2->4 7 gini = 0.0 samples = 97 value = [113.208, 0.0] class = Not CONTROL 6->7 8 LOC_Os05g49770 ≤ 281.0 gini = 0.205 samples = 9 value = [12.816, 1.679] class = Not CONTROL 6->8 9 gini = 0.0 samples = 1 value = [0.0, 1.679] class = CONTROL 8->9 10 gini = 0.0 samples = 8 value = [12.816, 0.0] class = Not CONTROL 8->10

View Genes Used in a Given Trees

In [127]:
counts.Gene.isel(Gene=np.argwhere(selected_tree.feature_importances_).flatten()).values
Out[127]:
array(['LOC_Os05g49770', 'LOC_Os07g42280', 'LOC_Os06g04070',
       'LOC_Os06g12370', 'LOC_Os01g12420'], dtype=object)
In [130]:
tree_subset = counts.isel(Gene=np.argwhere(selected_tree.feature_importances_).flatten())
hv.Raster(tree_subset.values).opts(logz=True)
Out[130]:


Right click to download this notebook from GitHub.