Selecting Genes with Boruta and Random Forests

This notebook descbites howo to select genes (features) of importance using Random Forests through the Boruta algorithm.

An issue in using random forest algorithms is that many of them attempt to find a minimal viable feature set. In our use we are intrested in all potentially informative features, and this is where the Boruta algorithm or warpper helps.

Background information:


Setting up the notebook

# OS-independent path management.
from os import  environ
from pathlib import Path

import numpy as np
import pandas as pd
import umap, umap.plot

from sklearn import preprocessing, model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score, plot_roc_curve

import matplotlib.pyplot as plt
import holoviews as hv
hv.extension("bokeh")
%matplotlib inline

import GSForge as gsf

Declaring used paths

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_normed.nc")

Loading a demo AnnotatedGEM

agem = gsf.AnnotatedGEM(NORMED_GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Model Setup and Parameterization

Consider the depth and number of decision trees for your model. The authors of the Boruta algorithm recommend a maximum node depth between 3 and 7.

rf_mdl_01 = RandomForestClassifier(
    class_weight='balanced',
    max_depth=3, 
    n_estimators=10, 
    n_jobs=-1)
rf_mdl_01
RandomForestClassifier(class_weight='balanced', max_depth=3, n_estimators=10,
                       n_jobs=-1)

Now we can run the feature selection.

%%time
boruta_result = gsf.operations.BorutaProspector(
    agem, 
    estimator=rf_mdl_01, 
    annotation_variables="treatment",
    max_iter=5,
    perc=95)
CPU times: user 1min 14s, sys: 17.3 s, total: 1min 32s
Wall time: 36.3 s

This produces an xarray.Dataset object.

boruta_result
<xarray.Dataset>
Dimensions:       (Gene: 55986)
Coordinates:
  * Gene          (Gene) object 'ChrSy.fgenesh.gene.1' ... 'LOC_Os12g44390'
Data variables:
    support       (Gene) bool False False False False ... False False False
    support_weak  (Gene) bool False False False False ... False False False
    ranking       (Gene) int64 26876 26876 26876 26876 ... 26876 26876 26876
Attributes:
    boruta_model:                   {"perc": 95, "alpha": 0.05, "estimator": ...
    ranking_model:                  {"bootstrap": true, "ccp_alpha": 0.0, "cl...
    selected_count_variable:        counts
    selected_annotation_variables:  ['treatment']

You can determine if the boruta algorithm needs more iterations by checking the contents of support_weak; increase the max_iter until all features have been resolved.

print(f"""
n features selected: {boruta_result.support.sum().values}
n potential features: {boruta_result.support_weak.sum().values}
""")
n features selected: 0
n potential features: 3362

Creating a GeneSet from the result

Consider any metadata that should be stored, and add that to the xarray.Dataset result using assign_attrs()

attrs = {"selection_model": str(selecting_model)}
boruta_result = boruta_result.assign_attrs(attrs)

You will need to convert nested dictionaries into json strings.

After the object is created you can easily see the number and percent of genes selected:

gsf.GeneSet(boruta_result, name="Boruta Treatment")
<GSForge.GeneSet>
Name: Boruta Treatment
    Supported Genes:  0

Using a GeneSetCollection

For running more than one boruta model in the notebook, we recommend creating them in a loop, and adding them to a collection, as demonstrated below.

Boruta Feature Selection

Here we try to find all relevant features for our sample annotations of interest. This takes a few minutes to run.

RUN = True
%%time

if RUN == True:
    boruta_gsc = gsf.GeneSetCollection(gem=agem)

    for target in ["treatment", "genotype"]:
        boruta_treatment_ds = gsf.operations.BorutaProspector(
            agem,
            estimator=rf_mdl_01,
            annotation_variables=target,
            perc=100,
            max_iter=1000)

        boruta_gsc[f"Boruta_{target}"] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")
for key, geneset in boruta_gsc.gene_sets.items():
    print(f"""{key}
    
    n features selected: {geneset.data.support.sum().values}
    n potential features: {geneset.data.support_weak.sum().values}
    """)
boruta_gsc
sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['treatment', 'genotype'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');
sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['treatment'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');
sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['genotype'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');

Further Analysis

While the above (appears) to work well for selecting features that inform our labels, we do not yet have any information as to which feature informs which annotation, or a measure of that effect.

For these wrappers we will need to create one-hot embeddings of our labels.

sel_counts, labels = gsf.get_gem_data(
    boruta_gsc, 
    annotation_variables=['treatment', 'genotype', 'time'],
    selected_gene_sets=['genotype', 'treatment'],
)
treatment_labels = labels.treatment.to_series().unique()
treatment_enc = preprocessing.OneHotEncoder().fit(labels.treatment.values[:, np.newaxis])
treatment_onehot = treatment_enc.transform(labels.treatment.values[:, np.newaxis]).toarray()
x_train, x_test, y_train, y_test = model_selection.train_test_split(
    sel_counts.values, treatment_onehot)
%%time
multi_out_cls = MultiOutputClassifier(rf_mdl_01).fit(x_train, y_train)
multi_out_cls_score = multi_out_cls.predict_proba(x_test)
fpr = dict()
tpr = dict()
roc_auc = dict()

for i, label in enumerate(treatment_labels):
    fpr[label], tpr[label], _ = roc_curve(y_test[:, i], multi_out_cls_score[i][:, 0])
    roc_auc[label] = auc(fpr[label], tpr[label])

roc_curves = {class_: hv.Curve((tpr[class_], fpr[class_]))
              for class_ in treatment_labels}

hv.NdOverlay(roc_curves).opts(padding=0.05, legend_position="right", width=650, height=450)
rf_mdl_01
multi_out_cls.estimators_[0]
# treatment_nFDR = gsf.operations.nFDR(
#     boruta_gsc,
#     selected_gene_sets=['genotype', 'treatment'],
#     gene_set_mode="union",
#     annotation_variables=["treatment"],
#     model=multi_out_cls.estimators_[0],
#     n_iterations=5
# )
# treatment_feature_importance = gsf.operations.RankGenesByModel(
#     boruta_gsc,
#     selected_gene_sets=['genotype', 'treatment'],
#     gene_set_mode="union",
#     annotation_variables=["treatment"],
#     model=multi_out_cls.estimators_[0],
#     n_iterations=5
# )
# MultiOutputClassifier()

# ClassifierChain()
# OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier