DESeq2 GeneSets¶

This notebook covers how to run and load a basic DESeq2 DEG result as a GSForge.GeneSet.

Notebook Preparation¶

Declare used paths

In [1]:

# OS-independent path management.
from os import fspath, environ
from pathlib import Path

In [2]:

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
DEG_COLL_PATH = OSF_PATH.joinpath("osfstorage", "DEG_gene_sets")
assert AGEM_PATH.exists()

Import Python packages

In [3]:

import GSForge as gsf

R integration setup

In [4]:

import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

Import R Packages

In [5]:

%%R
library("DESeq2")

Loading an AnnotatedGEM

In [6]:

agem = gsf.AnnotatedGEM(AGEM_PATH)
agem

Out[6]:

<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Prepare input data for DESeq2¶

This requires us to drop genes that have counts of zero.

In [7]:

dropped_counts, labels = gsf.get_data(agem, 
                                      count_mask="dropped",
                                      annotation_variables=["Treatment"])

These counts were made with Kallisto, so we must round them for use in DEseq2.

In [8]:

dropped_counts

Out[8]:

<xarray.DataArray 'counts' (Sample: 475, Gene: 10593)>
array([[200., 102., 118., ...,  24., 342., 637.],
       [ 41.,  55.,  29., ...,  10., 152., 186.],
       [197., 104.,  78., ...,  21., 336., 545.],
       ...,
       [158., 113., 225., ...,   9., 424., 411.],
       [151.,  98., 151., ...,  15., 274., 311.],
       [128., 110., 105., ...,  19., 335., 666.]])
Coordinates:
  * Gene      (Gene) object 'LOC_Os01g55490' ... 'LOC_Os03g20020'
  * Sample    (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
    quantile  float64 0.75

Round counts to intergers

In [9]:

ri_dropped_counts = gsf.utils.R_interface.Py_counts_to_R(dropped_counts)
ri_dropped_counts = ri_dropped_counts.round()

ri_labels = labels.to_dataframe()

`DESeq2` Runs¶

In [10]:

%%R -i ri_dropped_counts -i ri_labels -o deseq_df

dds <- DESeqDataSetFromMatrix(countData = ri_dropped_counts,
                              colData = ri_labels,
                              design= ~ treatment)
dds <- DESeq(dds)
deseq_results <- results(dds)
deseq_df = data.frame(deseq_results)

Error in DESeqDataSet(se, design = design, ignoreRank) : 
  all variables in design formula must be columns in colData
Calls: <Anonymous> ... withVisible -> DESeqDataSetFromMatrix -> DESeqDataSet

In [11]:

deseq2_treatment = gsf.GeneSet(deseq_df, 
                               name="deseq2_treatment", 
                               attrs={"DESeq2_formula": "~ Treatment"})
deseq2_treatment

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-09626099362f> in <module>
----> 1 deseq2_treatment = gsf.GeneSet(deseq_df, 
      2                                name="deseq2_treatment",
      3                                attrs={"DESeq2_formula": "~ Treatment"})
      4 deseq2_treatment

NameError: name 'deseq_df' is not defined

Define Helper Functions¶

Some functions to help assign support to this GeneSet.

In [13]:

def pvalue_filter(deseq_result_df, cutoff=0.05):
    """Returns a array of genes which have p-values above the specified cutoff."""
    return deseq_result_df[deseq_result_df["padj"] < cutoff].index

def top_n_abs(dataframe, n=10, col="log2FoldChange", padj_cuttoff=0.05):
    """Returns the top n most (absolutely) differentially expressed genes from a deseq2 result.
    This also filters by p-values."""
    filtered_df = dataframe[dataframe["padj"] < padj_cuttof]
    filtered_df = filtered_df.reindex(filtered_df["log2FoldChange"].abs().sort_values().index)
    return filtered_df.tail(n).index

In [14]:

cutoff = 0.05
gene_count = len(pvalue_filter(deseq_df, cutoff=cutoff))

print(f"{gene_count} genes below P-value threshold of: {cutoff}")

117 genes below P-value threshold of: 0.05

In [15]:

deseq2_treatment.set_gene_support(pvalue_filter(deseq_df, cutoff=cutoff))
deseq2_treatment

Out[15]:

<GSForge.GeneSet>
Name: deseq2_treatment
    Supported Genes:  117, 0.75% of 15591

In [17]:

deseq2_treatment.save_as_netcdf(DEG_COLL_PATH)

Out[17]:

'/home/tyler/GSForge_demo_data/osfstorage/DEG_gene_sets/deseq2_treatment.nc'

Right click to download this notebook from GitHub.

DESeq2 GeneSets¶

Notebook Preparation¶

Prepare input data for DESeq2¶

DESeq2 Runs¶

Define Helper Functions¶

`DESeq2` Runs¶