Using Workflows¶
Why use a workflow?
Some feature selection methods -- like boruta -- do not produce stable output.
Meaning the results for the same parameters can differ to some degree.
We could fix the random_state
to force the same results -- but more of interest is how well a chosen set of parameters performs.
We could also increase the number of iterations that boruta is allowed to run, but this becomes memory intensive.
A more facile solution is to repeat the same parameters with as many iterations as we can get away with.
We then want to explore parameters, with repeats, and do so in a memory intensive way.
Enter nextflow
, a program that will streamline this process.
An Example with boruta_multiclass
¶
The workflows are named based on the organization of the y, or target variable.
These workflows essentially manage calls to boruta_prospector
.
The following must be provided to the workflow:
- A saved
AnnotatedGEM
or otherwise compatible netcdf file. - A ranking model must be selected.
- A target variable must be provided.
- Any required to boruta and the ranking model.
An Example Configuration File¶
Consider this example nextflow.config
file:
// Singular data input and selection.
params.gem_netcdf = "~/GSForge_demo_data/rice.nc"
params.x_label = "counts"
params.y_label = ["Treatment", "Genotype", "Subspecies"]
// Ranking model options.
params.ranking_model = "RandomForestClassifier"
params.ranking_model_opts.max_depth = [3, 4, 5, 6, 7]
params.ranking_model_opts.n_jobs = [-1]
// BorutaPy options.
params.boruta_opts.perc = [95, 100]
params.boruta_opts.max_iter = [200]
// How often to repeat each set of arguments.
params.repeats = 2
// Output directory.
params.out_dir = "~/GSForge_demo_data/boruta_workflow_gene_sets"
Running the Workflow¶
Save this file as nextflow.config
in some directory which you would like nextflow
to operate in.
Navigate to that directory, then the workflow can be run via:
NEXTFLOW_SCRIPT="<path to installation>/GSForge/workflows/boruta_multiclass/main.nf"
nextflow -C nextflow.config run $NEXTFLOW_SCRIPT -profile standard,docker
And the resulting lineament files should be stored in the out_dir
.
import os
import GSForge as gsf
from pathlib import Path
import holoviews as hv
hv.extension("bokeh")
import re
import collections
Declare used paths
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
NFWF_PATH = OSF_PATH.joinpath("osfstorage", "boruta_workflow_gene_sets")
assert AGEM_PATH.exists()
assert NFWF_PATH.exists()
Load an AnnotatedGEM
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
Examine the workflow output directory¶
result_paths = list(NFWF_PATH.expanduser().resolve().glob("*.nc"))
result_paths[:5]
This workflow names the files:
<argument_hash>_<nextflow_uuid>.nc
result_paths[0].name
Load and Examine a Single Result¶
demo_result = gsf.GeneSet(result_paths[0])
demo_result
demo_result.data
Combine by Argument Hash¶
replicates = set()
for path in result_paths:
argument, uuid = path.name.split("_")
replicates = set([argument]).union(replicates)
# replicates
workflow_collection = gsf.GeneSetCollection.from_folder(agem, NFWF_PATH)
workflow_collection
combined = collections.defaultdict(list)
for replicate in replicates:
pattern = re.compile(replicate)
for result in workflow_collection.gene_sets.keys():
if pattern.match(result):
combined[replicate].append(result)
label_colls = dict()
for label in ["Treatment", "Genotype", "Subspecies"]:
combined = collections.defaultdict(list)
for replicate in replicates:
pattern = re.compile(replicate)
for result, geneset in workflow_collection.gene_sets.items():
if geneset.data.attrs["selected_annotation_variables"] == label:
if pattern.match(result):
combined[replicate].append(result)
label_colls[label] = combined
label_colls
collections = []
for label in ["Treatment", "Genotype", "Subspecies"]:
coll = label_colls[label]
combined_genesets = {}
for argument_hash, keys in coll.items():
gene_sets = [workflow_collection.gene_sets[key] for key in keys]
combined_genesets[argument_hash] = gsf.GeneSet.from_GeneSets(gene_sets, agem.gene_index, name=argument_hash, attrs=gene_sets[0].data.attrs)
collections.append(gsf.GeneSetCollection(gem=agem, gene_sets=combined_genesets, name=label))
treatment_coll, genotype_coll, subspecies_coll = collections
treatment_coll
treatment_coll.gene_sets["3a6380"].data.attrs
genotype_coll
subspecies_coll