Selecting Genes with Boruta and Random Forests¶
This notebook descbites howo to select genes (features) of importance using Random Forests through the Boruta algorithm.
An issue in using random forest algorithms is that many of them attempt to find a minimal viable feature set. In our use we are intrested in all potentially informative features, and this is where the Boruta algorithm or warpper helps.
Background information:
Random Forests
Setting up the notebook
# OS-independent path management.
from os import environ
from pathlib import Path
import numpy as np
import pandas as pd
import umap, umap.plot
from sklearn import preprocessing, model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score, plot_roc_curve
import matplotlib.pyplot as plt
import holoviews as hv
hv.extension("bokeh")
%matplotlib inline
import GSForge as gsf
Declaring used paths
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_normed.nc")
Loading a demo AnnotatedGEM
agem = gsf.AnnotatedGEM(NORMED_GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
Gene 55986
Sample 475
Model Setup and Parameterization¶
Consider the depth and number of decision trees for your model. The authors of the Boruta algorithm recommend a maximum node depth between 3 and 7.
rf_mdl_01 = RandomForestClassifier(
class_weight='balanced',
max_depth=3,
n_estimators=10,
n_jobs=-1)
rf_mdl_01
RandomForestClassifier(class_weight='balanced', max_depth=3, n_estimators=10,
n_jobs=-1)
Now we can run the feature selection.
%%time
boruta_result = gsf.operations.BorutaProspector(
agem,
estimator=rf_mdl_01,
annotation_variables="treatment",
max_iter=5,
perc=95)
CPU times: user 1min 14s, sys: 17.3 s, total: 1min 32s
Wall time: 36.3 s
This produces an xarray.Dataset
object.
boruta_result
<xarray.Dataset> Dimensions: (Gene: 55986) Coordinates: * Gene (Gene) object 'ChrSy.fgenesh.gene.1' ... 'LOC_Os12g44390' Data variables: support (Gene) bool False False False False ... False False False support_weak (Gene) bool False False False False ... False False False ranking (Gene) int64 26876 26876 26876 26876 ... 26876 26876 26876 Attributes: boruta_model: {"perc": 95, "alpha": 0.05, "estimator": ... ranking_model: {"bootstrap": true, "ccp_alpha": 0.0, "cl... selected_count_variable: counts selected_annotation_variables: ['treatment']
- Gene: 55986
- Gene(Gene)object'ChrSy.fgenesh.gene.1' ... 'LOC_...
array(['ChrSy.fgenesh.gene.1', 'ChrSy.fgenesh.gene.10', 'ChrSy.fgenesh.gene.11', ..., 'LOC_Os12g44370', 'LOC_Os12g44380', 'LOC_Os12g44390'], dtype=object)
- support(Gene)boolFalse False False ... False False
array([False, False, False, ..., False, False, False])
- support_weak(Gene)boolFalse False False ... False False
array([False, False, False, ..., False, False, False])
- ranking(Gene)int6426876 26876 26876 ... 26876 26876
array([26876, 26876, 26876, ..., 26876, 26876, 26876])
- boruta_model :
- {"perc": 95, "alpha": 0.05, "estimator": "<not serializable>", "two_step": true, "n_estimators": 1000, "random_state": null, "max_iter": 5, "verbose": 0}
- ranking_model :
- {"bootstrap": true, "ccp_alpha": 0.0, "class_weight": "balanced", "criterion": "gini", "max_depth": 3, "max_features": "auto", "max_leaf_nodes": null, "max_samples": null, "min_impurity_decrease": 0.0, "min_impurity_split": null, "min_samples_leaf": 1, "min_samples_split": 2, "min_weight_fraction_leaf": 0.0, "n_estimators": 1000, "n_jobs": -1, "oob_score": false, "random_state": "<not serializable>", "verbose": 0, "warm_start": false}
- selected_count_variable :
- counts
- selected_annotation_variables :
- ['treatment']
You can determine if the boruta algorithm needs more iterations by checking the contents of support_weak
; increase the max_iter
until all features have been resolved.
print(f"""
n features selected: {boruta_result.support.sum().values}
n potential features: {boruta_result.support_weak.sum().values}
""")
n features selected: 0
n potential features: 3362
Creating a GeneSet from the result
Consider any metadata that should be stored, and add that to the xarray.Dataset
result using assign_attrs()
attrs = {"selection_model": str(selecting_model)}
boruta_result = boruta_result.assign_attrs(attrs)
You will need to convert nested dictionaries into json strings.
After the object is created you can easily see the number and percent of genes selected:
gsf.GeneSet(boruta_result, name="Boruta Treatment")
<GSForge.GeneSet>
Name: Boruta Treatment
Supported Genes: 0
Using a GeneSetCollection
For running more than one boruta model in the notebook, we recommend creating them in a loop, and adding them to a collection, as demonstrated below.
Boruta Feature Selection¶
Here we try to find all relevant features for our sample annotations of interest. This takes a few minutes to run.
RUN = True
%%time
if RUN == True:
boruta_gsc = gsf.GeneSetCollection(gem=agem)
for target in ["treatment", "genotype"]:
boruta_treatment_ds = gsf.operations.BorutaProspector(
agem,
estimator=rf_mdl_01,
annotation_variables=target,
perc=100,
max_iter=1000)
boruta_gsc[f"Boruta_{target}"] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")
for key, geneset in boruta_gsc.gene_sets.items():
print(f"""{key}
n features selected: {geneset.data.support.sum().values}
n potential features: {geneset.data.support_weak.sum().values}
""")
boruta_gsc
sel_counts, labels = gsf.get_gem_data(
boruta_gsc,
annotation_variables=['treatment', 'genotype', 'time'],
selected_gene_sets=['treatment', 'genotype'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');
sel_counts, labels = gsf.get_gem_data(
boruta_gsc,
annotation_variables=['treatment', 'genotype', 'time'],
selected_gene_sets=['treatment'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');
sel_counts, labels = gsf.get_gem_data(
boruta_gsc,
annotation_variables=['treatment', 'genotype', 'time'],
selected_gene_sets=['genotype'],
)
mapper = umap.UMAP(densmap=True, random_state=50, metric='manhattan').fit(sel_counts.values)
fig, axes = plt.subplots(1, 3, figsize=(21, 7))
umap.plot.points(mapper, labels=labels['treatment'], background='black', ax=axes[0], color_key_cmap='Set1');
umap.plot.points(mapper, labels=labels['genotype'], background='black', ax=axes[1], color_key_cmap='Set2');
umap.plot.points(mapper, labels=labels['time'], background='black', ax=axes[2], color_key_cmap='plasma');
Further Analysis¶
While the above (appears) to work well for selecting features that inform our labels, we do not yet have any information as to which feature informs which annotation, or a measure of that effect.
For these wrappers we will need to create one-hot embeddings of our labels.
sel_counts, labels = gsf.get_gem_data(
boruta_gsc,
annotation_variables=['treatment', 'genotype', 'time'],
selected_gene_sets=['genotype', 'treatment'],
)
treatment_labels = labels.treatment.to_series().unique()
treatment_enc = preprocessing.OneHotEncoder().fit(labels.treatment.values[:, np.newaxis])
treatment_onehot = treatment_enc.transform(labels.treatment.values[:, np.newaxis]).toarray()
x_train, x_test, y_train, y_test = model_selection.train_test_split(
sel_counts.values, treatment_onehot)
%%time
multi_out_cls = MultiOutputClassifier(rf_mdl_01).fit(x_train, y_train)
multi_out_cls_score = multi_out_cls.predict_proba(x_test)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i, label in enumerate(treatment_labels):
fpr[label], tpr[label], _ = roc_curve(y_test[:, i], multi_out_cls_score[i][:, 0])
roc_auc[label] = auc(fpr[label], tpr[label])
roc_curves = {class_: hv.Curve((tpr[class_], fpr[class_]))
for class_ in treatment_labels}
hv.NdOverlay(roc_curves).opts(padding=0.05, legend_position="right", width=650, height=450)
rf_mdl_01
multi_out_cls.estimators_[0]
# treatment_nFDR = gsf.operations.nFDR(
# boruta_gsc,
# selected_gene_sets=['genotype', 'treatment'],
# gene_set_mode="union",
# annotation_variables=["treatment"],
# model=multi_out_cls.estimators_[0],
# n_iterations=5
# )
# treatment_feature_importance = gsf.operations.RankGenesByModel(
# boruta_gsc,
# selected_gene_sets=['genotype', 'treatment'],
# gene_set_mode="union",
# annotation_variables=["treatment"],
# model=multi_out_cls.estimators_[0],
# n_iterations=5
# )
# MultiOutputClassifier()
# ClassifierChain()
# OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier