Selecting Genes with Boruta

This notebook descbites howo to select genes (features) of importance using Random Forests through the Boruta algorithm.

Background information:


Setting up the notebook

In [1]:
import numpy as np
import pandas as pd
import xarray as xr

import holoviews as hv
hv.extension("bokeh")
%matplotlib inline

import GSForge as gsf

Declaring used paths

In [2]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
In [3]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
BORUTA_COLL_PATH = OSF_PATH.joinpath("osfstorage", "boruta_gene_sets")
assert AGEM_PATH.exists()

Loading a demo AnnotatedGEM

In [4]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
Out[4]:
<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475
In [5]:
agem.data
Out[5]:
<xarray.Dataset>
Dimensions:     (Gene: 55986, Sample: 475)
Coordinates:
  * Gene        (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
  * Sample      (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
    quantile    float64 ...
Data variables:
    SampleSRR   (Sample) object ...
    Treatment   (Sample) object ...
    Time        (Sample) int64 ...
    Tissue      (Sample) object ...
    Genotype    (Sample) object ...
    Subspecies  (Sample) object ...
    counts      (Sample, Gene) int64 ...
    lengths     (Gene) float64 ...
    uq_counts   (Sample, Gene) float64 ...
Attributes:
    __GSForge.AnnotatedGEM.params:  {"count_array_name": "counts", "gene_inde...

Finding Genes

Selecting relevent genes of importance (something that requires further experimentation to confirm) is not simple. But we can make it appear so, by demonstrating the boruta feature selection method to construct a GSForge.GeneSet.

First we will need a model that, when trained, can rank or infer feature importance:

In [6]:
from sklearn.ensemble import RandomForestClassifier
In [7]:
selecting_model = RandomForestClassifier(
    class_weight='balanced',
    max_depth=4, 
    n_estimators=1000, 
    n_jobs=-1)

Now we can run the boruta feature selection method.

In [8]:
%%time
boruta_result = gsf.operations.boruta_prospector(
    agem, 
    estimator=selecting_model, 
    annotation_variables="Treatment",
    max_iter=2,
    perc=95)
CPU times: user 27.1 s, sys: 3.58 s, total: 30.7 s
Wall time: 12.1 s

This produces an xarray.Dataset object.

In [9]:
boruta_result
Out[9]:
<xarray.Dataset>
Dimensions:       (Gene: 55986)
Coordinates:
  * Gene          (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
Data variables:
    support       (Gene) bool False False False False ... False False False
    support_weak  (Gene) bool False False False False ... True False False False
    ranking       (Gene) int64 3 3 3 3 3 3 3 3 3 2 3 3 ... 2 2 3 3 2 3 2 2 3 3 3
Attributes:
    boruta_model:                   {"max_iter": 2, "n_estimators": 1000, "tw...
    ranking_model:                  {"bootstrap": true, "class_weight": "bala...
    selected_count_variable:        counts
    selected_annotation_variables:  ['Treatment']

Creating a GeneSet from the result

Consider any metadata that should be stored, and add that to the xarray.Dataset result using assign_attrs()

attrs = {"selection_model": str(selecting_model)}
boruta_result = boruta_result.assign_attrs(attrs)

You will need to convert nested dictionaries into json strings.

After the object is created you can easily see the number and percent of genes selected:

In [11]:
boruta_treatment_gs = gsf.GeneSet(boruta_result, name="Boruta Treatment")
boruta_treatment_gs
Out[11]:
<GSForge.GeneSet>
Name: Boruta Treatment
    Supported Genes:  1079, 1.93% of 55986

Using a GeneSetCollection

For running more than one boruta model in the notebook, we recommend creating them in a loop, and adding them to a collection, as demonstrated below.

Although if you want to run many different models, you should strongl consider using the nextflow workflows.

from tqdm.notebook import tqdm

boruta_gsc = gsf.GeneSetCollection(gem=agem)

for target in tqdm(["Treatment", "Genotype", "Subspecies]):
    boruta_treatment_ds = gsf.operations.boruta_prospector(
        agem,
        estimator=selecting_model,
        annotation_variables=target,
        max_iter=50)

    boruta_gsc.gene_sets[target] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")
In [10]:
# from tqdm.notebook import tqdm

# boruta_gsc = gsf.GeneSetCollection(gem=agem)

# for target in tqdm(["Treatment", "Genotype", "Subspecies"]):
#     boruta_treatment_ds = gsf.operations.boruta_prospector(
#         agem,
#         estimator=selecting_model,
#         annotation_variables=[target],
#         max_iter=50)
    
#     boruta_gsc.gene_sets[target] = gsf.GeneSet(boruta_treatment_ds, name=f"Boruta_{target}")

You can then view the collection and its gene sets:

In [17]:
# boruta_gsc

Then you can save the results to a directory. Note that this does not save a copy of your GEM.

In [14]:
# boruta_gsc.save(BORUTA_COLL_PATH)
/home/tyler/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Treatment.nc
/home/tyler/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Genotype.nc
/home/tyler/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Subspecies.nc


Right click to download this notebook from GitHub.