Selecting Genes with Boruta¶
This notebook descbites howo to select genes (features) of importance using Random Forests through the Boruta algorithm.
Background information:
- Random Forests
- Python Boruta algorithm
Setting up the notebook
import numpy as np
import pandas as pd
import xarray as xr
import holoviews as hv
hv.extension("bokeh")
%matplotlib inline
import GSForge as gsf
Declaring used paths
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
BORUTA_COLL_PATH = OSF_PATH.joinpath("osfstorage", "boruta_gene_sets")
assert AGEM_PATH.exists()
Loading a demo AnnotatedGEM
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
agem.data
Finding Genes¶
Selecting relevent genes of importance (something that requires further experimentation to confirm) is not simple.
But we can make it appear so, by demonstrating the boruta feature selection method to construct a GSForge.GeneSet
.
First we will need a model that, when trained, can rank or infer feature importance:
from sklearn.ensemble import RandomForestClassifier
selecting_model = RandomForestClassifier(
class_weight='balanced',
max_depth=4,
n_estimators=1000,
n_jobs=-1)
Now we can run the boruta feature selection method.
%%time
boruta_result = gsf.operations.boruta_prospector(
agem,
estimator=selecting_model,
annotation_variables="Treatment",
max_iter=2,
perc=95)
This produces an xarray.Dataset
object.
boruta_result
Creating a GeneSet from the result¶
Consider any metadata that should be stored, and add that to the xarray.Dataset
result using assign_attrs()
attrs = {"selection_model": str(selecting_model)}
boruta_result = boruta_result.assign_attrs(attrs)
You will need to convert nested dictionaries into json strings.
After the object is created you can easily see the number and percent of genes selected:
boruta_treatment_gs = gsf.GeneSet(boruta_result, name="Boruta Treatment")
boruta_treatment_gs
Using a GeneSetCollection
¶
For running more than one boruta model in the notebook, we recommend creating them in a loop, and adding them to a collection, as demonstrated below.
Although if you want to run many different models, you should strongl consider using the nextflow workflows.
from tqdm.notebook import tqdm
boruta_gsc = gsf.GeneSetCollection(gem=agem)
for target in tqdm(["Treatment", "Genotype", "Subspecies]):
boruta_treatment_ds = gsf.operations.boruta_prospector(
agem,
estimator=selecting_model,
annotation_variables=target,
max_iter=50)
boruta_gsc.gene_sets[target] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")
# from tqdm.notebook import tqdm
# boruta_gsc = gsf.GeneSetCollection(gem=agem)
# for target in tqdm(["Treatment", "Genotype", "Subspecies"]):
# boruta_treatment_ds = gsf.operations.boruta_prospector(
# agem,
# estimator=selecting_model,
# annotation_variables=[target],
# max_iter=50)
# boruta_gsc.gene_sets[target] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")
You can then view the collection and its gene sets:
# boruta_gsc
Then you can save the results to a directory. Note that this does not save a copy of your GEM.
# boruta_gsc.save(BORUTA_COLL_PATH)