Using Workflows¶

Why use a workflow?

Some feature selection methods -- like boruta -- do not produce stable output. Meaning the results for the same parameters can differ to some degree. We could fix the random_state to force the same results -- but more of interest is how well a chosen set of parameters performs. We could also increase the number of iterations that boruta is allowed to run, but this becomes memory intensive. A more facile solution is to repeat the same parameters with as many iterations as we can get away with.

We then want to explore parameters, with repeats, and do so in a memory intensive way. Enter nextflow, a program that will streamline this process.

An Example with `boruta_multiclass`¶

The workflows are named based on the organization of the y, or target variable. These workflows essentially manage calls to boruta_prospector.

The following must be provided to the workflow:

A saved AnnotatedGEM or otherwise compatible netcdf file.
A ranking model must be selected.
A target variable must be provided.
Any required to boruta and the ranking model.

An Example Configuration File¶

Consider this example nextflow.config file:

// Singular data input and selection.
params.gem_netcdf = "~/GSForge_demo_data/rice.nc"
params.x_label = "counts"
params.y_label = ["Treatment", "Genotype", "Subspecies"]

// Ranking model options.
params.ranking_model = "RandomForestClassifier"
params.ranking_model_opts.max_depth = [3, 4, 5, 6, 7]
params.ranking_model_opts.n_jobs = [-1]

// BorutaPy options.
params.boruta_opts.perc = [95, 100]
params.boruta_opts.max_iter = [200]

// How often to repeat each set of arguments.
params.repeats = 2

// Output directory.
params.out_dir = "~/GSForge_demo_data/boruta_workflow_gene_sets"

Running the Workflow¶

Save this file as nextflow.config in some directory which you would like nextflow to operate in. Navigate to that directory, then the workflow can be run via:

NEXTFLOW_SCRIPT="<path to installation>/GSForge/workflows/boruta_multiclass/main.nf"
nextflow -C nextflow.config run $NEXTFLOW_SCRIPT -profile standard,docker

And the resulting lineament files should be stored in the out_dir.

In [1]:

import os
import GSForge as gsf
from pathlib import Path
import holoviews as hv
hv.extension("bokeh")

In [2]:

import re
import collections

Declare used paths

In [3]:

# OS-independent path management.
from os import fspath, environ
from pathlib import Path

In [4]:

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
NFWF_PATH = OSF_PATH.joinpath("osfstorage", "boruta_workflow_gene_sets")

assert AGEM_PATH.exists()
assert NFWF_PATH.exists()

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-46cce711c50a> in <module>
      4 
      5 assert AGEM_PATH.exists()
----> 6 assert NFWF_PATH.exists()

AssertionError:

Load an AnnotatedGEM

In [4]:

agem = gsf.AnnotatedGEM(AGEM_PATH)
agem

Out[4]:

<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Examine the workflow output directory¶

In [8]:

result_paths = list(NFWF_PATH.expanduser().resolve().glob("*.nc"))
result_paths[:5]

Out[8]:

[PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/019524_b48286.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/b8df5c_fcadb6.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/921936_8c7f65.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/921936_7717b7.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/4f9452_26da90.nc')]

This workflow names the files:

<argument_hash>_<nextflow_uuid>.nc

In [13]:

result_paths[0].name

Out[13]:

'019524_b48286.nc'

Load and Examine a Single Result¶

In [10]:

demo_result = gsf.GeneSet(result_paths[0])
demo_result

Out[10]:

<GSForge.GeneSet>
Name: GeneSet01321
    Supported Genes:  239, 0.43% of 55986

In [11]:

demo_result.data

Out[11]:

<xarray.Dataset>
Dimensions:       (Gene: 55986)
Coordinates:
  * Gene          (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
Data variables:
    support       (Gene) bool False False False False ... False False False
    support_weak  (Gene) bool ...
    ranking       (Gene) int64 ...
Attributes:
    boruta_model:                   {"two_step": true, "random_state": null, ...
    ranking_model:                  {"bootstrap": true, "ccp_alpha": 0.0, "cl...
    selected_count_variable:        counts
    selected_annotation_variables:  Subspecies
    arg_hash:                       019524

Combine by Argument Hash¶

In [32]:

replicates = set()

for path in result_paths:
    argument, uuid = path.name.split("_")
    replicates = set([argument]).union(replicates)
    
# replicates

In [30]:

workflow_collection = gsf.GeneSetCollection.from_folder(agem, NFWF_PATH)
workflow_collection

Out[30]:

<GSForge.GeneSetCollection>
GeneSetCollection01562
GeneSets (60 total): Support Count
    36b179_451cf8: 2414
    3a6380_737034: 2393
    3a6380_38175b: 2383
    36b179_f8be07: 2366
    644cfd_3c8807: 2321
    644cfd_4554ce: 2310
    21a38c_bbb811: 2146
    21a38c_c68f41: 2138
    f962d6_3da9fb: 1532
    f962d6_d77341: 1532

In [42]:

combined = collections.defaultdict(list)

for replicate in replicates:
    pattern = re.compile(replicate)
    for result in workflow_collection.gene_sets.keys():
        
        if pattern.match(result):
            combined[replicate].append(result)

In [54]:

label_colls = dict()

for label in ["Treatment", "Genotype", "Subspecies"]:
    combined = collections.defaultdict(list)

    for replicate in replicates:
        pattern = re.compile(replicate)
        for result, geneset in workflow_collection.gene_sets.items():
            if geneset.data.attrs["selected_annotation_variables"] == label:
                if pattern.match(result):
                    combined[replicate].append(result)
    label_colls[label] = combined

In [55]:

label_colls

Out[55]:

{'Treatment': defaultdict(list,
             {'e90ecb': ['e90ecb'],
              '3a6380': ['3a6380'],
              '36b179': ['36b179'],
              'a45d98': ['a45d98'],
              '26b197': ['26b197'],
              '644cfd': ['644cfd'],
              '21a38c': ['21a38c'],
              'eb6a41': ['eb6a41'],
              'f79407': ['f79407'],
              'f962d6': ['f962d6']}),
 'Genotype': defaultdict(list,
             {'8fa5fc': ['8fa5fc'],
              'b8df5c': ['b8df5c'],
              '9fda5b': ['9fda5b'],
              '488d45': ['488d45'],
              'd38fb9': ['d38fb9'],
              '75bfe3': ['75bfe3'],
              '4d726f': ['4d726f'],
              'a6130e': ['a6130e'],
              '891c83': ['891c83'],
              'dc6a48': ['dc6a48']}),
 'Subspecies': defaultdict(list,
             {'921936': ['921936'],
              '1ec403': ['1ec403'],
              'c3e5ee': ['c3e5ee'],
              '749d0e': ['749d0e'],
              'd75652': ['d75652'],
              '4f9452': ['4f9452'],
              '019524': ['019524'],
              '2e4ade': ['2e4ade'],
              'cfa346': ['cfa346'],
              '0a0991': ['0a0991']})}

In [57]:

collections = []

for label in ["Treatment", "Genotype", "Subspecies"]:
    coll = label_colls[label]
    
    combined_genesets = {}

    for argument_hash, keys in coll.items():
        gene_sets = [workflow_collection.gene_sets[key] for key in keys]
        combined_genesets[argument_hash] = gsf.GeneSet.from_GeneSets(gene_sets, agem.gene_index, name=argument_hash, attrs=gene_sets[0].data.attrs)

    collections.append(gsf.GeneSetCollection(gem=agem, gene_sets=combined_genesets, name=label))
    
treatment_coll, genotype_coll, subspecies_coll = collections

In [58]:

treatment_coll

Out[58]:

<GSForge.GeneSetCollection>
Treatment
GeneSets (10 total): Support Count
    3a6380: 2577
    36b179: 2576
    644cfd: 2504
    21a38c: 2289
    f962d6: 1636
    f79407: 1064
    e90ecb: 1039
    26b197: 965
    eb6a41: 876
    a45d98: 766

In [61]:

treatment_coll.gene_sets["3a6380"].data.attrs

Out[61]:

OrderedDict([('boruta_model',
              '{"estimator": "<not serializable>", "two_step": true, "verbose": 0, "max_iter": 200, "perc": 95, "alpha": 0.05, "n_estimators": 1000, "random_state": null}'),
             ('ranking_model',
              '{"bootstrap": true, "ccp_alpha": 0.0, "class_weight": null, "criterion": "gini", "max_depth": 6, "max_features": "auto", "max_leaf_nodes": null, "max_samples": null, "min_impurity_decrease": 0.0, "min_impurity_split": null, "min_samples_leaf": 1, "min_samples_split": 2, "min_weight_fraction_leaf": 0.0, "n_estimators": 1000, "n_jobs": -1, "oob_score": false, "random_state": "<not serializable>", "verbose": 0, "warm_start": false}'),
             ('selected_count_variable', 'counts'),
             ('selected_annotation_variables', 'Treatment'),
             ('arg_hash', '3a6380')])

In [ ]:

In [59]:

genotype_coll

Out[59]:

<GSForge.GeneSetCollection>
Genotype
GeneSets (10 total): Support Count
    75bfe3: 1139
    488d45: 1125
    a6130e: 1121
    891c83: 1077
    dc6a48: 910
    9fda5b: 759
    4d726f: 747
    d38fb9: 742
    8fa5fc: 729
    b8df5c: 687

In [60]:

subspecies_coll

Out[60]:

<GSForge.GeneSetCollection>
Subspecies
GeneSets (10 total): Support Count
    921936: 387
    d75652: 384
    2e4ade: 382
    1ec403: 374
    749d0e: 363
    4f9452: 279
    0a0991: 276
    c3e5ee: 273
    019524: 273
    cfa346: 273