Selecting Genes with Boruta and Random Forests¶

This notebook descbites howo to select genes (features) of importance using Random Forests through the Boruta algorithm.

An issue in using random forest algorithms is that many of them attempt to find a minimal viable feature set. In our use we are intrested in all potentially informative features, and this is where the Boruta algorithm or warpper helps.

Background information:

Random Forests
- The sklearn model used
- The excellent sklearn user guide
Python Boruta algorithm

Setting up the notebook

# OS-independent path management.
from os import  environ
from pathlib import Path

import numpy as np
import pandas as pd
import umap, umap.plot

from sklearn import preprocessing, model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score, plot_roc_curve

import matplotlib.pyplot as plt
import holoviews as hv
hv.extension("bokeh")
%matplotlib inline

import GSForge as gsf

Declaring used paths

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_normed.nc")

Loading a demo AnnotatedGEM

agem = gsf.AnnotatedGEM(NORMED_GEM_PATH)
agem

<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Boruta Feature Selection¶

Here we try to find all relevant features for our sample annotations of interest. This takes a few minutes to run.

RUN = True

%%time

if RUN == True:
    boruta_gsc = gsf.GeneSetCollection(gem=agem)

    for target in ["treatment", "genotype"]:
        boruta_treatment_ds = gsf.operations.BorutaProspector(
            agem,
            estimator=rf_mdl_01,
            annotation_variables=target,
            perc=100,
            max_iter=1000)

        boruta_gsc[f"Boruta_{target}"] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")

for key, geneset in boruta_gsc.gene_sets.items():
    print(f"""{key}
    
    n features selected: {geneset.data.support.sum().values}
    n potential features: {geneset.data.support_weak.sum().values}
    """)

boruta_gsc

sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['treatment', 'genotype'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');

sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['treatment'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');

sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['genotype'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');

Further Analysis¶

While the above (appears) to work well for selecting features that inform our labels, we do not yet have any information as to which feature informs which annotation, or a measure of that effect.

For these wrappers we will need to create one-hot embeddings of our labels.

sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['genotype', 'treatment'],
)

treatment_labels = labels.treatment.to_series().unique()
treatment_enc = preprocessing.OneHotEncoder().fit(labels.treatment.values[:, np.newaxis])
treatment_onehot = treatment_enc.transform(labels.treatment.values[:, np.newaxis]).toarray()

x_train, x_test, y_train, y_test = model_selection.train_test_split(
    sel_counts.values, treatment_onehot)

%%time
multi_out_cls = MultiOutputClassifier(rf_mdl_01).fit(x_train, y_train)
multi_out_cls_score = multi_out_cls.predict_proba(x_test)

fpr = dict()
tpr = dict()
roc_auc = dict()

for i, label in enumerate(treatment_labels):
    fpr[label], tpr[label], _ = roc_curve(y_test[:, i], multi_out_cls_score[i][:, 0])
    roc_auc[label] = auc(fpr[label], tpr[label])

roc_curves = {class_: hv.Curve((tpr[class_], fpr[class_]))
              for class_ in treatment_labels}

hv.NdOverlay(roc_curves).opts(padding=0.05, legend_position="right", width=650, height=450)

rf_mdl_01

multi_out_cls.estimators_[0]

# treatment_nFDR = gsf.operations.nFDR(
#     boruta_gsc,
#     selected_gene_sets=['genotype', 'treatment'],
#     gene_set_mode="union",
#     annotation_variables=["treatment"],
#     model=multi_out_cls.estimators_[0],
#     n_iterations=5
# )

# treatment_feature_importance = gsf.operations.RankGenesByModel(
#     boruta_gsc,
#     selected_gene_sets=['genotype', 'treatment'],
#     gene_set_mode="union",
#     annotation_variables=["treatment"],
#     model=multi_out_cls.estimators_[0],
#     n_iterations=5
# )

# MultiOutputClassifier()

# ClassifierChain()

# OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier

Selecting Genes with Boruta and Random Forests¶

Model Setup and Parameterization¶

Boruta Feature Selection¶

Further Analysis¶