GeneSet Analysis¶

This notebook gives some simple examples for:

Comparing GeneSets to eachother via some machine-learning model score.
Ranking genes within a geneset based on some machine-learning model.

Setting up the notebook

In [1]:

import os
import numpy as np

import matplotlib.pyplot as plt
import holoviews as hv
hv.extension("bokeh")
%matplotlib inline

import GSForge as gsf

Declare paths used

In [2]:

# OS-independent path management.
from os import fspath, environ
from pathlib import Path

In [3]:

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
BORUTA_COLL_PATH = OSF_PATH.joinpath("osfstorage", "boruta_gene_sets")
assert AGEM_PATH.exists()

Load an AnnotatedGEM

In [4]:

agem = gsf.AnnotatedGEM(AGEM_PATH)
agem

Out[4]:

<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

In [5]:

gsc = gsf.GeneSetCollection.from_folder(gem=agem, 
                                        target_dir=BORUTA_COLL_PATH, 
                                        name="Boruta Results")
gsc

Out[5]:

<GSForge.GeneSetCollection>
Boruta Results
GeneSets (3 total): Support Count
    Boruta_Treatment: 681
    Boruta_Genotype: 661
    Boruta_Subspecies: 231

Scoring and Judging GeneSets¶

Comparing Genes within a GeneSet¶

By feature importance

In [6]:

from sklearn.ensemble import RandomForestClassifier

In [7]:

gene_rank_mdl = RandomForestClassifier(class_weight='balanced', 
                                       n_estimators=1000, n_jobs=-1)

In [8]:

%%time

treatment_feature_importance = gsf.operations.rank_genes_by_model(
    gsc,
    selected_gene_sets=["Boruta_Treatment"],
    annotation_variables="Treatment",
    model=gene_rank_mdl,
    n_iterations=3
)

CPU times: user 18 s, sys: 27.4 s, total: 45.5 s
Wall time: 38.5 s

In [9]:

treatment_feature_importance

Out[9]:

<xarray.Dataset>
Dimensions:                  (Gene: 681, feature_importance_iter: 3)
Coordinates:
  * Gene                     (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * feature_importance_iter  (feature_importance_iter) int64 0 1 2
    feature_importance_mean  (Gene) float64 0.0009284 0.0007136 ... 0.000881
    feature_importance_std   (Gene) float64 0.0002016 9.827e-05 ... 0.000195
Data variables:
    feature_importances      (feature_importance_iter, Gene) float64 0.001208 ... 0.0007206

Then you can optionally store such results to a GeneSet using xarray update().

In [10]:

gsc.gene_sets["Boruta_Treatment"].data = gsc.gene_sets["Boruta_Treatment"].data.update(treatment_feature_importance)
gsc.gene_sets["Boruta_Treatment"].data

Out[10]:

<xarray.Dataset>
Dimensions:                  (Gene: 55986, feature_importance_iter: 3)
Coordinates:
  * Gene                     (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
  * feature_importance_iter  (feature_importance_iter) int64 0 1 2
    feature_importance_mean  (Gene) float64 nan nan nan nan ... nan nan nan
    feature_importance_std   (Gene) float64 nan nan nan nan ... nan nan nan
Data variables:
    support                  (Gene) bool False False False ... False False False
    support_weak             (Gene) bool False False False ... False False False
    ranking                  (Gene) int64 10029 34446 34446 ... 34446 3373 34446
    feature_importances      (feature_importance_iter, Gene) float64 nan ... nan
Attributes:
    boruta_model:                   {"two_step": true, "max_iter": 50, "estim...
    ranking_model:                  {"bootstrap": true, "class_weight": "bala...
    selected_count_variable:        counts
    selected_annotation_variables:  Treatment
    __GSForge.GeneSet.params:       {"gene_index_name": "Gene", "name": "Boru...

Estimate False Discovery Rates¶

nFDR

In [11]:

%%time
treatment_nFDR = gsf.operations.analytics.nFDR(
    gsc,
    selected_gene_sets=["Boruta_Treatment"],
    annotation_variables="Treatment",
    model=gene_rank_mdl,
    n_iterations=3
)

CPU times: user 42 s, sys: 15.9 s, total: 57.9 s
Wall time: 39.5 s

In [12]:

treatment_nFDR

Out[12]:

<xarray.Dataset>
Dimensions:    (Gene: 681, nFDR_iter: 3)
Coordinates:
  * Gene       (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * nFDR_iter  (nFDR_iter) int64 0 1 2
    nFDR_mean  (Gene) float64 0.8003 0.9711 0.002447 ... 0.7572 0.2888 0.9745
    nFDR_std   (Gene) float64 0.2253 0.02212 0.003461 ... 0.2475 0.249 0.02381
Data variables:
    nFDR       (nFDR_iter, Gene) float64 0.4816 0.9413 0.007342 ... 0.232 0.9868

In [13]:

gsc.gene_sets["Boruta_Treatment"].data = gsc.gene_sets["Boruta_Treatment"].data.update(treatment_nFDR)

mProbes

In [14]:

%%time
treatment_mProbes = gsf.operations.analytics.mProbes(
    gsc,
    selected_gene_sets=["Boruta_Treatment"],
    annotation_variables="Treatment",
    model=gene_rank_mdl,
    n_iterations=3
)

CPU times: user 23.9 s, sys: 15.5 s, total: 39.4 s
Wall time: 29 s

In [15]:

treatment_mProbes

Out[15]:

<xarray.Dataset>
Dimensions:       (Gene: 681, mProbes_iter: 3)
Coordinates:
  * Gene          (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * mProbes_iter  (mProbes_iter) int64 0 1 2
    mProbes_mean  (Gene) float64 0.0 0.0 0.0 0.0 0.01566 ... 0.0 0.0 0.0 0.0 0.0
    mProbes_std   (Gene) float64 0.0 0.0 0.0 0.0 0.01816 ... 0.0 0.0 0.0 0.0 0.0
Data variables:
    mProbes NRD   (mProbes_iter, Gene) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0

In [16]:

gsc.gene_sets["Boruta_Treatment"].data = gsc.gene_sets["Boruta_Treatment"].data.update(treatment_mProbes)

Examine Rankings¶

In [17]:

# View the xarray dataset of a GeneSet:
# gsc.gene_sets["Boruta_Treatment"].data

In [18]:

# View the list of 'supported' genes.
len(gsc.gene_sets["Boruta_Treatment"].gene_support())

Out[18]:

Subset by supported genes

In [19]:

ds = gsc.gene_sets["Boruta_Treatment"].data.sel(Gene=gsc.gene_sets["Boruta_Treatment"].gene_support())
ds

Out[19]:

<xarray.Dataset>
Dimensions:                  (Gene: 681, feature_importance_iter: 3, mProbes_iter: 3, nFDR_iter: 3)
Coordinates:
  * Gene                     (Gene) object 'LOC_Os03g11550' ... 'LOC_Os10g36690'
  * feature_importance_iter  (feature_importance_iter) int64 0 1 2
    feature_importance_mean  (Gene) float64 0.0009284 0.0007136 ... 0.000881
    feature_importance_std   (Gene) float64 0.0002016 9.827e-05 ... 0.000195
  * nFDR_iter                (nFDR_iter) int64 0 1 2
    nFDR_mean                (Gene) float64 0.8003 0.9711 ... 0.2888 0.9745
    nFDR_std                 (Gene) float64 0.2253 0.02212 ... 0.249 0.02381
  * mProbes_iter             (mProbes_iter) int64 0 1 2
    mProbes_mean             (Gene) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    mProbes_std              (Gene) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Data variables:
    support                  (Gene) bool True True True True ... True True True
    support_weak             (Gene) bool False False False ... False False False
    ranking                  (Gene) int64 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    feature_importances      (feature_importance_iter, Gene) float64 0.001208 ... 0.0007206
    nFDR                     (nFDR_iter, Gene) float64 0.4816 0.9413 ... 0.9868
    mProbes NRD              (mProbes_iter, Gene) float64 0.0 0.0 ... 0.0 0.0
Attributes:
    boruta_model:                   {"two_step": true, "max_iter": 50, "estim...
    ranking_model:                  {"bootstrap": true, "class_weight": "bala...
    selected_count_variable:        counts
    selected_annotation_variables:  Treatment
    __GSForge.GeneSet.params:       {"gene_index_name": "Gene", "name": "Boru...

In [20]:

# hv.Dataset(ds)
# hv.Scatter(hv.Dataset(ds), ["fwe_mean", "nrd_mean"])

In [21]:

df = ds[["feature_importance_mean", "nFDR_mean"]].to_dataframe()
plot = hv.Scatter(df, [("feature_importance_mean",  "Mean Feature Importance Rank"),
                ("nFDR_mean",  "False Discovery Rate")]
          ).opts(padding=0.1, width=400, height=400)
plot

Out[21]:

In [22]:

# hv.save(plot, "fdr_vs_importance.png")

In [23]:

df = ds[["feature_importance_mean", "mProbes_mean"]].to_dataframe()
hv.Scatter(df).opts(padding=0.1, width=400, height=400)

Out[23]:

Receiver Operating Characteristic (ROC)¶

Another way to evaluate a classification model is the ROC.

Viewing ROC curves for multi-label models is a bit indirect as we have to use a binary classifier for each unique target label, so we provide this walkthrough. Also examine the sklearn demo.

In [24]:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score, plot_roc_curve
from sklearn import preprocessing, model_selection

Get the count and annotation data from GSForge.

In [25]:

counts, treatment = gsf.get_data(gsc, annotation_variables="Treatment",selected_gene_sets=["Boruta_Treatment"])
classes = treatment.to_series().unique()
classes

Out[25]:

array(['CONTROL', 'HEAT', 'RECOV_HEAT', 'DROUGHT', 'RECOV_DROUGHT'],
      dtype=object)

Encode the annotation labels with a one hot encoder.

In [26]:

enc = preprocessing.OneHotEncoder().fit(treatment.values.reshape(-1, 1))
treatment_onehot = enc.transform(treatment.values.reshape(-1, 1)).toarray()
treatment_onehot

Out[26]:

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.]])

Split the data and encoded annotations into a train and test set.

In [27]:

x_train, x_test, y_train, y_test = model_selection.train_test_split(counts, treatment_onehot)

Fit the model with the training data.

In [28]:

%%time
roc_rf_model = OneVsRestClassifier(RandomForestClassifier(class_weight='balanced', max_depth=3,
                                   n_estimators=1000, n_jobs=-1))
roc_rf_model = roc_rf_model.fit(x_train, y_train)

CPU times: user 3.97 s, sys: 113 ms, total: 4.09 s
Wall time: 10.2 s

Now predict class probabilities for the test count data.

In [29]:

y_score = roc_rf_model.predict_proba(x_test)
y_score[:5]

Out[29]:

array([[0.87076133, 0.33215912, 0.00134682, 0.03518171, 0.00367744],
       [0.7983976 , 0.32986137, 0.01035656, 0.2528504 , 0.00860827],
       [0.78413877, 0.56685172, 0.02223655, 0.10846641, 0.00160146],
       [0.88358456, 0.30536646, 0.004605  , 0.0241869 , 0.00327239],
       [0.42219986, 0.60130967, 0.00482617, 0.22825346, 0.01285147]])

In [30]:

fpr = dict()
tpr = dict()
roc_auc = dict()
for i, class_ in enumerate(classes):
    fpr[class_], tpr[class_], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[class_] = auc(fpr[class_], tpr[class_])

In [31]:

roc_curves = {class_: hv.Curve((fpr[class_], tpr[class_]))
              for class_ in classes}

In [32]:

hv.NdOverlay(roc_curves).opts(padding=0.05, legend_position="top", width=400, height=450)

Out[32]:

View a Decision Tree¶

In [33]:

from sklearn import tree
import graphviz

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-33-8a9d2c837516> in <module>
      1 from sklearn import tree
----> 2 import graphviz

ModuleNotFoundError: No module named 'graphviz'

Extract a single tree from a list of estimators.

In [106]:

control_rfc = roc_rf_model.estimators_[0]
print(f"There are {len(control_rfc.estimators_)} trees.")

There are 1000 trees.

In [109]:

selected_tree = control_rfc.estimators_[0]
graph = graphviz.Source(tree.export_graphviz(
    selected_tree, 
    feature_names=x_train.Gene.values,  
    class_names=["Not CONTROL", "CONTROL"],
    filled=True, 
    rounded=True,  
    special_characters=True))

graph
# graph.render('decision_tree', format="svg")

Out[109]:

View Genes Used in a Given Trees¶

In [127]:

counts.Gene.isel(Gene=np.argwhere(selected_tree.feature_importances_).flatten()).values

Out[127]:

array(['LOC_Os05g49770', 'LOC_Os07g42280', 'LOC_Os06g04070',
       'LOC_Os06g12370', 'LOC_Os01g12420'], dtype=object)

In [130]:

tree_subset = counts.isel(Gene=np.argwhere(selected_tree.feature_importances_).flatten())
hv.Raster(tree_subset.values).opts(logz=True)

Out[130]:

Right click to download this notebook from GitHub.