Using Workflows

Why use a workflow?

Some feature selection methods -- like boruta -- do not produce stable output. Meaning the results for the same parameters can differ to some degree. We could fix the random_state to force the same results -- but more of interest is how well a chosen set of parameters performs. We could also increase the number of iterations that boruta is allowed to run, but this becomes memory intensive. A more facile solution is to repeat the same parameters with as many iterations as we can get away with.

We then want to explore parameters, with repeats, and do so in a memory intensive way. Enter nextflow, a program that will streamline this process.

An Example with boruta_multiclass

The workflows are named based on the organization of the y, or target variable. These workflows essentially manage calls to boruta_prospector.

The following must be provided to the workflow:

  • A saved AnnotatedGEM or otherwise compatible netcdf file.
  • A ranking model must be selected.
  • A target variable must be provided.
  • Any required to boruta and the ranking model.

An Example Configuration File

Consider this example nextflow.config file:

// Singular data input and selection.
params.gem_netcdf = "~/GSForge_demo_data/rice.nc"
params.x_label = "counts"
params.y_label = ["Treatment", "Genotype", "Subspecies"]

// Ranking model options.
params.ranking_model = "RandomForestClassifier"
params.ranking_model_opts.max_depth = [3, 4, 5, 6, 7]
params.ranking_model_opts.n_jobs = [-1]

// BorutaPy options.
params.boruta_opts.perc = [95, 100]
params.boruta_opts.max_iter = [200]

// How often to repeat each set of arguments.
params.repeats = 2

// Output directory.
params.out_dir = "~/GSForge_demo_data/boruta_workflow_gene_sets"

Running the Workflow

Save this file as nextflow.config in some directory which you would like nextflow to operate in. Navigate to that directory, then the workflow can be run via:

NEXTFLOW_SCRIPT="<path to installation>/GSForge/workflows/boruta_multiclass/main.nf"
nextflow -C nextflow.config run $NEXTFLOW_SCRIPT -profile standard,docker

And the resulting lineament files should be stored in the out_dir.


In [1]:
import os
import GSForge as gsf
from pathlib import Path
import holoviews as hv
hv.extension("bokeh")
In [2]:
import re
import collections

Declare used paths

In [3]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
In [4]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
NFWF_PATH = OSF_PATH.joinpath("osfstorage", "boruta_workflow_gene_sets")

assert AGEM_PATH.exists()
assert NFWF_PATH.exists()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-46cce711c50a> in <module>
      4 
      5 assert AGEM_PATH.exists()
----> 6 assert NFWF_PATH.exists()

AssertionError: 

Load an AnnotatedGEM

In [4]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
Out[4]:
<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Examine the workflow output directory

In [8]:
result_paths = list(NFWF_PATH.expanduser().resolve().glob("*.nc"))
result_paths[:5]
Out[8]:
[PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/019524_b48286.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/b8df5c_fcadb6.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/921936_8c7f65.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/921936_7717b7.nc'),
 PosixPath('/home/tyler/GSForge_demo_data/osfstorage/boruta_workflow_gene_sets/4f9452_26da90.nc')]

This workflow names the files:

<argument_hash>_<nextflow_uuid>.nc
In [13]:
result_paths[0].name
Out[13]:
'019524_b48286.nc'

Load and Examine a Single Result

In [10]:
demo_result = gsf.GeneSet(result_paths[0])
demo_result
Out[10]:
<GSForge.GeneSet>
Name: GeneSet01321
    Supported Genes:  239, 0.43% of 55986
In [11]:
demo_result.data
Out[11]:
<xarray.Dataset>
Dimensions:       (Gene: 55986)
Coordinates:
  * Gene          (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
Data variables:
    support       (Gene) bool False False False False ... False False False
    support_weak  (Gene) bool ...
    ranking       (Gene) int64 ...
Attributes:
    boruta_model:                   {"two_step": true, "random_state": null, ...
    ranking_model:                  {"bootstrap": true, "ccp_alpha": 0.0, "cl...
    selected_count_variable:        counts
    selected_annotation_variables:  Subspecies
    arg_hash:                       019524

Combine by Argument Hash

In [32]:
replicates = set()

for path in result_paths:
    argument, uuid = path.name.split("_")
    replicates = set([argument]).union(replicates)
    
# replicates
In [30]:
workflow_collection = gsf.GeneSetCollection.from_folder(agem, NFWF_PATH)
workflow_collection
Out[30]:
<GSForge.GeneSetCollection>
GeneSetCollection01562
GeneSets (60 total): Support Count
    36b179_451cf8: 2414
    3a6380_737034: 2393
    3a6380_38175b: 2383
    36b179_f8be07: 2366
    644cfd_3c8807: 2321
    644cfd_4554ce: 2310
    21a38c_bbb811: 2146
    21a38c_c68f41: 2138
    f962d6_3da9fb: 1532
    f962d6_d77341: 1532
In [42]:
combined = collections.defaultdict(list)

for replicate in replicates:
    pattern = re.compile(replicate)
    for result in workflow_collection.gene_sets.keys():
        
        if pattern.match(result):
            combined[replicate].append(result)
In [54]:
label_colls = dict()

for label in ["Treatment", "Genotype", "Subspecies"]:
    combined = collections.defaultdict(list)

    for replicate in replicates:
        pattern = re.compile(replicate)
        for result, geneset in workflow_collection.gene_sets.items():
            if geneset.data.attrs["selected_annotation_variables"] == label:
                if pattern.match(result):
                    combined[replicate].append(result)
    label_colls[label] = combined
In [55]:
label_colls
Out[55]:
{'Treatment': defaultdict(list,
             {'e90ecb': ['e90ecb'],
              '3a6380': ['3a6380'],
              '36b179': ['36b179'],
              'a45d98': ['a45d98'],
              '26b197': ['26b197'],
              '644cfd': ['644cfd'],
              '21a38c': ['21a38c'],
              'eb6a41': ['eb6a41'],
              'f79407': ['f79407'],
              'f962d6': ['f962d6']}),
 'Genotype': defaultdict(list,
             {'8fa5fc': ['8fa5fc'],
              'b8df5c': ['b8df5c'],
              '9fda5b': ['9fda5b'],
              '488d45': ['488d45'],
              'd38fb9': ['d38fb9'],
              '75bfe3': ['75bfe3'],
              '4d726f': ['4d726f'],
              'a6130e': ['a6130e'],
              '891c83': ['891c83'],
              'dc6a48': ['dc6a48']}),
 'Subspecies': defaultdict(list,
             {'921936': ['921936'],
              '1ec403': ['1ec403'],
              'c3e5ee': ['c3e5ee'],
              '749d0e': ['749d0e'],
              'd75652': ['d75652'],
              '4f9452': ['4f9452'],
              '019524': ['019524'],
              '2e4ade': ['2e4ade'],
              'cfa346': ['cfa346'],
              '0a0991': ['0a0991']})}
In [57]:
collections = []

for label in ["Treatment", "Genotype", "Subspecies"]:
    coll = label_colls[label]
    
    combined_genesets = {}

    for argument_hash, keys in coll.items():
        gene_sets = [workflow_collection.gene_sets[key] for key in keys]
        combined_genesets[argument_hash] = gsf.GeneSet.from_GeneSets(gene_sets, agem.gene_index, name=argument_hash, attrs=gene_sets[0].data.attrs)

    collections.append(gsf.GeneSetCollection(gem=agem, gene_sets=combined_genesets, name=label))
    
treatment_coll, genotype_coll, subspecies_coll = collections
In [58]:
treatment_coll
Out[58]:
<GSForge.GeneSetCollection>
Treatment
GeneSets (10 total): Support Count
    3a6380: 2577
    36b179: 2576
    644cfd: 2504
    21a38c: 2289
    f962d6: 1636
    f79407: 1064
    e90ecb: 1039
    26b197: 965
    eb6a41: 876
    a45d98: 766
In [61]:
treatment_coll.gene_sets["3a6380"].data.attrs
Out[61]:
OrderedDict([('boruta_model',
              '{"estimator": "<not serializable>", "two_step": true, "verbose": 0, "max_iter": 200, "perc": 95, "alpha": 0.05, "n_estimators": 1000, "random_state": null}'),
             ('ranking_model',
              '{"bootstrap": true, "ccp_alpha": 0.0, "class_weight": null, "criterion": "gini", "max_depth": 6, "max_features": "auto", "max_leaf_nodes": null, "max_samples": null, "min_impurity_decrease": 0.0, "min_impurity_split": null, "min_samples_leaf": 1, "min_samples_split": 2, "min_weight_fraction_leaf": 0.0, "n_estimators": 1000, "n_jobs": -1, "oob_score": false, "random_state": "<not serializable>", "verbose": 0, "warm_start": false}'),
             ('selected_count_variable', 'counts'),
             ('selected_annotation_variables', 'Treatment'),
             ('arg_hash', '3a6380')])
In [ ]:
 
In [59]:
genotype_coll
Out[59]:
<GSForge.GeneSetCollection>
Genotype
GeneSets (10 total): Support Count
    75bfe3: 1139
    488d45: 1125
    a6130e: 1121
    891c83: 1077
    dc6a48: 910
    9fda5b: 759
    4d726f: 747
    d38fb9: 742
    8fa5fc: 729
    b8df5c: 687
In [60]:
subspecies_coll
Out[60]:
<GSForge.GeneSetCollection>
Subspecies
GeneSets (10 total): Support Count
    921936: 387
    d75652: 384
    2e4ade: 382
    1ec403: 374
    749d0e: 363
    4f9452: 279
    0a0991: 276
    c3e5ee: 273
    019524: 273
    cfa346: 273
In [ ]:
 


Right click to download this notebook from GitHub.