Interface Guide

Set operations, intersections, unions and differences and combinations thereof answer many of the most basic questions. Moreover, we often desires to examine subsets of the original GEM based a selection determined by some set operation. The Interface provided by GSForge provides a uniform API access to both the AnnotatedGEM and the GeneSetCollection objects for retrieving count values and sample annotations.

Notebook setup

import numpy as np
import pandas as pd
import xarray as xr
import GSForge as gsf
import holoviews as hv
hv.extension('bokeh')

# OS-independent path management.
from os import fspath, environ
from pathlib import Path

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
TOUR_DGE = OSF_PATH.joinpath("GeneSetCollections", "DEG_gene_sets")

Load Data

agem = gsf.AnnotatedGEM(GEM_PATH)
agem 
<GSForge.AnnotatedGEM>
Name: Oryza Sativa
Selected GEM Variable: 'counts'
    Gene   66338
    Sample 475
agem.data
<xarray.Dataset>
Dimensions:             (Sample: 475, Gene: 66338)
Coordinates:
  * Sample              (Sample) object 'SRX1423934' ... 'SRX1424408'
  * Gene                (Gene) object 'LOC_Os01g01010.1' ... 'ChrSy.fgenesh.m...
Data variables: (12/29)
    BioSample           (Sample) object 'SAMN04251848' ... 'SAMN04251607'
    LoadDate            (Sample) object '2015-11-20' ... '2015-11-19'
    MBases              (Sample) int64 4016 5202 4053 1166 ... 3098 3529 2922
    MBytes              (Sample) int64 2738 3652 2719 764 ... 1983 2370 1862
    Run                 (Sample) object 'SRR2931040' ... 'SRR2931514'
    SRA_Sample          (Sample) object 'SRS1156722' ... 'SRS1156251'
    ...                  ...
    Platform            (Sample) object 'ILLUMINA' 'ILLUMINA' ... 'ILLUMINA'
    ReleaseDate         (Sample) object '2016-01-04' ... '2016-01-04'
    SRA_Study           (Sample) object 'SRP065945' 'SRP065945' ... 'SRP065945'
    source_name         (Sample) object 'Rice leaf' 'Rice leaf' ... 'Rice leaf'
    tissue              (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf'
    counts              (Sample, Gene) float64 ...
Attributes:
    __GSForge.AnnotatedGEM.params:  {"count_array_name": "counts", "gene_inde...

Selecting Data using the Interface

Select data through the interface via the get_gem_data function.

The simplest possible call to this function returns the zero filled count matrix within a two-item tuple.

counts, empty_labels = gsf.get_gem_data(agem)
counts
<xarray.DataArray 'counts' (Sample: 475, Gene: 66338)>
array([[  1.     ,   0.     ,   0.     , ..., 106.37   , 730.919  ,
        881.     ],
       [  0.     ,   0.     ,   0.     , ...,  45.0493 , 198.64   ,
        261.     ],
       [  0.     ,   0.     ,   0.     , ..., 201.085  , 826.987  ,
        873.     ],
       ...,
       [  0.     ,   4.5    ,   0.     , ...,  91.2347 , 625.153  ,
        698.     ],
       [  1.26659,   8.5    ,   0.     , ..., 106.927  , 569.145  ,
        720.     ],
       [  2.     ,   5.5    ,   0.     , ..., 146.451  , 740.039  ,
        570.     ]])
Coordinates:
  * Sample   (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
  * Gene     (Gene) object 'ChrSy.fgenesh.mRNA.1' ... 'LOC_Os12g44390.1'

When annotation_variables are not provided, the second item is the None singleton.

empty_labels == None
True

This is for the sake of consistency in handling the number of expected output objects when annotation_variables are provided:

counts, label_ds = gsf.get_gem_data(agem, annotation_variables=['treatment'])

counts remain the same, but this time a corresponding label dataset is also returned.

label_ds
<xarray.DataArray 'treatment' (Sample: 475)>
array(['CONTROL', 'CONTROL', 'CONTROL', ..., 'RECOV_DROUGHT', 'RECOV_DROUGHT',
       'RECOV_DROUGHT'], dtype=object)
Coordinates:
  * Sample   (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'

Masking or Dropping Samples or Genes

Count values can be returned in one of three ‘forms’ with respect to NaN or zero values:

  • zero counts ‘masked’ as NaN.

  • zero counts ‘dropped’ gene-wise.

  • zero counts ‘complete’, or included within the matrix, this is the default selection.

This is done using the count_mask argument.

counts, _ = gsf.get_gem_data(agem, count_mask='complete')
counts.shape
(475, 66338)
counts, _ = gsf.get_gem_data(agem, count_mask='dropped')
counts.shape
(475, 13992)

For samples there are only two options:

  • Samples with a missing annotation are ‘dropped’.

  • Use the ‘complete’ set of samples.

This only has an effect when used with annotation_variables.

counts, label_ds = gsf.get_gem_data(agem, annotation_mask='complete', annotation_variables=['treatment'])

Transforming the Count Matrix

A transform can be applied when selecting data, such as a log transform. GSFoge allows users to supply a function to transform the subset of counts returned, this function only operates on the count values returned.

counts, label_ds = gsf.get_gem_data(agem, annotation_mask='complete', annotation_variables=['treatment'],
                                    count_transform=lambda c: np.log(c + 1.0))