DESeq2 GeneSets

This notebook covers how to run and load a basic DESeq2 DEG result as a GSForge.GeneSet.


Notebook Preparation

Declare used paths

In [1]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
In [2]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
DEG_COLL_PATH = OSF_PATH.joinpath("osfstorage", "DEG_gene_sets")
assert AGEM_PATH.exists()

Import Python packages

In [3]:
import GSForge as gsf

R integration setup

In [4]:
import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

Import R Packages

In [5]:
%%R
library("DESeq2")

Loading an AnnotatedGEM

In [6]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
Out[6]:
<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Prepare input data for DESeq2

This requires us to drop genes that have counts of zero.

In [7]:
dropped_counts, labels = gsf.get_data(agem, 
                                      count_mask="dropped",
                                      annotation_variables=["Treatment"])

These counts were made with Kallisto, so we must round them for use in DEseq2.

In [8]:
dropped_counts
Out[8]:
<xarray.DataArray 'counts' (Sample: 475, Gene: 10593)>
array([[200., 102., 118., ...,  24., 342., 637.],
       [ 41.,  55.,  29., ...,  10., 152., 186.],
       [197., 104.,  78., ...,  21., 336., 545.],
       ...,
       [158., 113., 225., ...,   9., 424., 411.],
       [151.,  98., 151., ...,  15., 274., 311.],
       [128., 110., 105., ...,  19., 335., 666.]])
Coordinates:
  * Gene      (Gene) object 'LOC_Os01g55490' ... 'LOC_Os03g20020'
  * Sample    (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
    quantile  float64 0.75

Round counts to intergers

In [9]:
ri_dropped_counts = gsf.utils.R_interface.Py_counts_to_R(dropped_counts)
ri_dropped_counts = ri_dropped_counts.round()

ri_labels = labels.to_dataframe()

DESeq2 Runs

In [10]:
%%R -i ri_dropped_counts -i ri_labels -o deseq_df

dds <- DESeqDataSetFromMatrix(countData = ri_dropped_counts,
                              colData = ri_labels,
                              design= ~ treatment)
dds <- DESeq(dds)
deseq_results <- results(dds)
deseq_df = data.frame(deseq_results)
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  all variables in design formula must be columns in colData
Calls: <Anonymous> ... withVisible -> DESeqDataSetFromMatrix -> DESeqDataSet
In [11]:
deseq2_treatment = gsf.GeneSet(deseq_df, 
                               name="deseq2_treatment", 
                               attrs={"DESeq2_formula": "~ Treatment"})
deseq2_treatment
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-09626099362f> in <module>
----> 1 deseq2_treatment = gsf.GeneSet(deseq_df, 
      2                                name="deseq2_treatment",
      3                                attrs={"DESeq2_formula": "~ Treatment"})
      4 deseq2_treatment

NameError: name 'deseq_df' is not defined

Define Helper Functions

Some functions to help assign support to this GeneSet.

In [13]:
def pvalue_filter(deseq_result_df, cutoff=0.05):
    """Returns a array of genes which have p-values above the specified cutoff."""
    return deseq_result_df[deseq_result_df["padj"] < cutoff].index

def top_n_abs(dataframe, n=10, col="log2FoldChange", padj_cuttoff=0.05):
    """Returns the top n most (absolutely) differentially expressed genes from a deseq2 result.
    This also filters by p-values."""
    filtered_df = dataframe[dataframe["padj"] < padj_cuttof]
    filtered_df = filtered_df.reindex(filtered_df["log2FoldChange"].abs().sort_values().index)
    return filtered_df.tail(n).index
In [14]:
cutoff = 0.05
gene_count = len(pvalue_filter(deseq_df, cutoff=cutoff))

print(f"{gene_count} genes below P-value threshold of: {cutoff}")
117 genes below P-value threshold of: 0.05
In [15]:
deseq2_treatment.set_gene_support(pvalue_filter(deseq_df, cutoff=cutoff))
deseq2_treatment
Out[15]:
<GSForge.GeneSet>
Name: deseq2_treatment
    Supported Genes:  117, 0.75% of 15591
In [17]:
deseq2_treatment.save_as_netcdf(DEG_COLL_PATH)
Out[17]:
'/home/tyler/GSForge_demo_data/osfstorage/DEG_gene_sets/deseq2_treatment.nc'


Right click to download this notebook from GitHub.