GEM Normalization

This notebook is a how-to guide on normalizing gene expression matrice using GEMprospector. It does not cover considerations as to which normalization should be preformed.


Setting up the notebook

In [1]:
import os
import GSForge as gsf
from pathlib import Path
import numpy as np
import holoviews as hv

hv.extension("bokeh")

Declare paths used

In [2]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
In [3]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
assert AGEM_PATH.exists()

Load an AnnotatedGEM

In [4]:
agem = gsf.AnnotatedGEM(AGEM_PATH)
agem
Out[4]:
<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Saving Normalizations to the AnnotatedGEM

If a normalization is expensive to compute it can be worth saving to the AnnoatedGEM object.

In [5]:
uq_counts = gsf.operations.UpperQuartile(agem)
agem.data["uq_counts"] = uq_counts
agem.data
Out[5]:
<xarray.Dataset>
Dimensions:     (Gene: 55986, Sample: 475)
Coordinates:
  * Gene        (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
  * Sample      (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
Data variables:
    SampleSRR   (Sample) object ...
    Treatment   (Sample) object ...
    Time        (Sample) int64 ...
    Tissue      (Sample) object ...
    Genotype    (Sample) object ...
    Subspecies  (Sample) object ...
    counts      (Sample, Gene) int64 ...
    lengths     (Gene) float64 ...
    uq_counts   (Sample, Gene) float64 18.01 0.08958 0.08958 ... 728.9 0.1094
Attributes:
    __GSForge.AnnotatedGEM.params:  {"count_array_name": "counts", "gene_inde...

Save the AnnotatedGEM as a .netcdf file

In [6]:
# agem.save(AGEM_PATH)

Viewing the effect of transforms and normalizations.

In [7]:
gsf.plots.ScatterDistributionBase(agem, count_variable="uq_counts", datashade_=True).opts(
    hv.opts.Area(bgcolor="lightgrey", show_grid=True, show_legend=False, alpha=0.25),
    hv.opts.Area("dist_x", width=150),
    hv.opts.Area("dist_y", height=150),
    hv.opts.RGB(width=500, height=500, bgcolor="lightgrey", show_grid=True),
)
Out[7]:
In [8]:
gsf.plots.ScatterDistributionBase(agem, count_variable="counts", datashade_=True).opts(
    hv.opts.Area(bgcolor="lightgrey", show_grid=True, show_legend=False, alpha=0.25),
    hv.opts.Area("dist_x", width=150),
    hv.opts.Area("dist_y", height=150),
    hv.opts.RGB(width=500, height=500, bgcolor="lightgrey", show_grid=True),
)
Out[8]:


Right click to download this notebook from GitHub.