DESeq2 GeneSets¶
This notebook covers how to run and load a basic DESeq2
DEG result as a GSForge.GeneSet
.
Notebook Preparation¶
Declare used paths
In [1]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
In [2]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
DEG_COLL_PATH = OSF_PATH.joinpath("osfstorage", "DEG_gene_sets")
assert AGEM_PATH.exists()
Import Python packages
In [3]:
import GSForge as gsf
R integration setup
In [4]:
import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)
Import R Packages
In [5]:
%%R
library("DESeq2")
Loading an AnnotatedGEM
In [6]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
Out[6]:
Prepare input data for DESeq2¶
This requires us to drop genes that have counts of zero.
In [7]:
dropped_counts, labels = gsf.get_data(agem,
count_mask="dropped",
annotation_variables=["Treatment"])
These counts were made with Kallisto, so we must round them for use in DEseq2
.
In [8]:
dropped_counts
Out[8]:
Round counts to intergers
In [9]:
ri_dropped_counts = gsf.utils.R_interface.Py_counts_to_R(dropped_counts)
ri_dropped_counts = ri_dropped_counts.round()
ri_labels = labels.to_dataframe()
DESeq2
Runs¶
In [10]:
%%R -i ri_dropped_counts -i ri_labels -o deseq_df
dds <- DESeqDataSetFromMatrix(countData = ri_dropped_counts,
colData = ri_labels,
design= ~ treatment)
dds <- DESeq(dds)
deseq_results <- results(dds)
deseq_df = data.frame(deseq_results)
In [11]:
deseq2_treatment = gsf.GeneSet(deseq_df,
name="deseq2_treatment",
attrs={"DESeq2_formula": "~ Treatment"})
deseq2_treatment
Define Helper Functions¶
Some functions to help assign support to this GeneSet
.
In [13]:
def pvalue_filter(deseq_result_df, cutoff=0.05):
"""Returns a array of genes which have p-values above the specified cutoff."""
return deseq_result_df[deseq_result_df["padj"] < cutoff].index
def top_n_abs(dataframe, n=10, col="log2FoldChange", padj_cuttoff=0.05):
"""Returns the top n most (absolutely) differentially expressed genes from a deseq2 result.
This also filters by p-values."""
filtered_df = dataframe[dataframe["padj"] < padj_cuttof]
filtered_df = filtered_df.reindex(filtered_df["log2FoldChange"].abs().sort_values().index)
return filtered_df.tail(n).index
In [14]:
cutoff = 0.05
gene_count = len(pvalue_filter(deseq_df, cutoff=cutoff))
print(f"{gene_count} genes below P-value threshold of: {cutoff}")
In [15]:
deseq2_treatment.set_gene_support(pvalue_filter(deseq_df, cutoff=cutoff))
deseq2_treatment
Out[15]:
In [17]:
deseq2_treatment.save_as_netcdf(DEG_COLL_PATH)
Out[17]: