GEM Normalization¶

This notebook is a how-to guide on normalizing gene expression matrice using GEMprospector. It does not cover considerations as to which normalization should be preformed.

Setting up the notebook

In [1]:

import os
import GSForge as gsf
from pathlib import Path
import numpy as np
import holoviews as hv

hv.extension("bokeh")

Declare paths used

In [2]:

# OS-independent path management.
from os import fspath, environ
from pathlib import Path

In [3]:

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
assert AGEM_PATH.exists()

Load an AnnotatedGEM

In [4]:

agem = gsf.AnnotatedGEM(AGEM_PATH)
agem

Out[4]:

<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Saving Normalizations to the AnnotatedGEM¶

If a normalization is expensive to compute it can be worth saving to the AnnoatedGEM object.

In [5]:

uq_counts = gsf.operations.UpperQuartile(agem)
agem.data["uq_counts"] = uq_counts
agem.data

Out[5]:

<xarray.Dataset>
Dimensions:     (Gene: 55986, Sample: 475)
Coordinates:
  * Gene        (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
  * Sample      (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
Data variables:
    SampleSRR   (Sample) object ...
    Treatment   (Sample) object ...
    Time        (Sample) int64 ...
    Tissue      (Sample) object ...
    Genotype    (Sample) object ...
    Subspecies  (Sample) object ...
    counts      (Sample, Gene) int64 ...
    lengths     (Gene) float64 ...
    uq_counts   (Sample, Gene) float64 18.01 0.08958 0.08958 ... 728.9 0.1094
Attributes:
    __GSForge.AnnotatedGEM.params:  {"count_array_name": "counts", "gene_inde...

Save the AnnotatedGEM as a .netcdf file¶

In [6]:

# agem.save(AGEM_PATH)

Viewing the effect of transforms and normalizations.

In [7]:

gsf.plots.ScatterDistributionBase(agem, count_variable="uq_counts", datashade_=True).opts(
    hv.opts.Area(bgcolor="lightgrey", show_grid=True, show_legend=False, alpha=0.25),
    hv.opts.Area("dist_x", width=150),
    hv.opts.Area("dist_y", height=150),
    hv.opts.RGB(width=500, height=500, bgcolor="lightgrey", show_grid=True),
)

Out[7]:

In [8]:

gsf.plots.ScatterDistributionBase(agem, count_variable="counts", datashade_=True).opts(
    hv.opts.Area(bgcolor="lightgrey", show_grid=True, show_legend=False, alpha=0.25),
    hv.opts.Area("dist_x", width=150),
    hv.opts.Area("dist_y", height=150),
    hv.opts.RGB(width=500, height=500, bgcolor="lightgrey", show_grid=True),
)

Out[8]:

Right click to download this notebook from GitHub.