02 Saving Normalizations to AnnotatedGEMs¶

This notebook is a how-to guide on normalizing gene expression matrices using GSForge. It does not cover considerations as to which normalization should be preformed.

Recall that .netcdf files cannot be modified once written; meaning the choice to add a normalized count matrix should be considered carefully. Normalizations that are not ‘reversible’ are good candidates to save to an AnnotatedGEM object; as they may require other data (i.e. gene lengths). Many other normalization methods can be run ‘as-needed’.

Set up the notebook

import pandas as pd
import xarray as xr
import umap
import umap.plot
import sklearn.preprocessing
import GSForge as gsf
import holoviews as hv
hv.extension("bokeh")

Declare used paths

# OS-independent path management.
from os import  environ
from pathlib import Path

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
TPM_GEM_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.TPM.txt")

Declare an path to which the created .nc file will saved.

NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_normed.nc")
NORMED_GEM_PATH

PosixPath('/home/tyler/GSForge_demo_data/osfstorage/oryza_sativa/AnnotatedGEMs/oryza_sativa_hisat2_normed.nc')

Load an AnnotatedGEM

agem = gsf.AnnotatedGEM(GEM_PATH)
agem

<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Adding Normalizations to an AnnotatedGEM¶

Here we demonstrate adding an externally generated (TPM) counts to an existing AnnotatedGEM object.

%%time
tpm_count_df = pd.read_csv(TPM_GEM_PATH, sep="\t", index_col=0)

CPU times: user 4.29 s, sys: 316 ms, total: 4.6 s
Wall time: 4.61 s

tpm_count_df.head()

	SRX1423934	SRX1423935	SRX1423936	SRX1423937	SRX1423938	SRX1423939	SRX1423940	SRX1423941	SRX1423942	SRX1423943	...	SRX1424399	SRX1424400	SRX1424401	SRX1424402	SRX1424403	SRX1424404	SRX1424405	SRX1424406	SRX1424407	SRX1424408
LOC_Os01g01010	10.557777	5.802739	9.664971	9.802068	8.468090	10.420499	6.704176	10.111569	10.711922	8.904408	...	7.307394	11.057385	7.095994	9.834905	13.636573	8.350987	10.928863	9.101974	8.468531	9.617651
LOC_Os01g01019	0.257810	0.000000	0.149549	0.000000	0.244159	0.153994	0.224923	0.427355	0.130201	0.320050	...	0.000000	0.000000	0.000000	0.000000	0.072867	0.000000	0.090800	0.000000	0.000000	0.000000
LOC_Os01g01030	0.653512	1.213873	0.864323	1.159157	1.562642	1.292170	1.056508	1.141737	1.543654	1.576418	...	0.784792	1.123123	1.069542	1.201486	1.233291	1.117724	1.197164	1.374664	0.842728	1.317539
LOC_Os01g01040	9.493268	6.200687	6.744709	7.150879	7.992403	8.539720	8.062660	8.530299	10.291631	6.111945	...	5.420519	8.218357	7.300565	9.891362	10.965607	9.908169	9.994303	7.261079	7.945092	8.197905
LOC_Os01g01050	8.646006	4.583269	6.917632	6.150803	7.886282	7.751430	6.607637	8.311967	7.300650	7.646735	...	6.132597	9.411695	5.969155	8.985373	10.236788	12.085556	12.390430	8.441613	6.919355	6.130914

5 rows × 475 columns

There is a pandas.DataFrame.to_xarray() function, but the coordinates are not quite what we want. Instead we can generate an xarray.DataArray quickly through the standard creation call.

Adding to the existing gem xarray.Dataset can be done via a call to update().

agem.count_array_names

['counts', 'TPM_counts']

%%time
uq_raw_counts = gsf.operations.UpperQuartile(agem)

CPU times: user 2.26 s, sys: 464 ms, total: 2.72 s
Wall time: 2.73 s

%%time
uq_tpm_counts = gsf.operations.UpperQuartile(agem, count_variable='TPM_counts')

CPU times: user 1.51 s, sys: 264 ms, total: 1.77 s
Wall time: 1.77 s

We can also use dictionary-like assignment.

agem.data["uq_raw_counts"] = uq_raw_counts
agem.data["uq_tpm_counts"] = uq_tpm_counts
agem.count_array_names

['counts', 'TPM_counts', 'uq_raw_counts', 'uq_tpm_counts']

Normalizations from Scikit-Learn¶

Select counts using get_gem_data()

counts, _ = gsf.get_gem_data(agem)

%%time
quantile_counts = sklearn.preprocessing.quantile_transform(counts, axis=1, output_distribution='normal', copy=True)
quantile_counts = xr.DataArray(quantile_counts, coords=counts.coords)

CPU times: user 10.1 s, sys: 136 ms, total: 10.3 s
Wall time: 10.2 s

agem.data["quantile_counts"] = quantile_counts
agem.count_array_names

['counts', 'TPM_counts', 'uq_raw_counts', 'uq_tpm_counts', 'quantile_counts']

Using R to Normalize GEMs¶

See TODO for more information and examples on interacting with R.

R integration setup

import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR) # Supresses verbose R output.

%%R
library("edgeR")

Select counts using get_gem_data()

counts, _ = gsf.get_gem_data(agem)

Prepare the counts for R

Notice the counts are transposed after this step to a form more common in R. (features by samples).

ri_counts = gsf.utils.R_interface.Py_counts_to_R(counts)
ri_counts.shape

(55986, 475)

Run the normalization within R

%%R -i ri_counts -o tmm_counts

dge_list <- DGEList(counts=ri_counts)
dge_list <- calcNormFactors(dge_list, method="TMM")
tmm_counts <- cpm(dge_list, normalized.lib.sizes=TRUE, log=FALSE)

tmm_counts = xr.DataArray(tmm_counts.T, coords=counts.coords, name='tmm_counts')
tmm_counts

<xarray.DataArray 'tmm_counts' (Sample: 475, Gene: 55986)>
array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.42048297e+01, 4.12840093e+01, 3.19315181e+01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.46012729e+01, 2.24112561e+01, 2.49013957e+01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.45942064e+01, 2.68799459e+01, 3.12621205e+01],
       ...,
       [0.00000000e+00, 1.43059783e-01, 0.00000000e+00, ...,
        9.10813950e+00, 2.54169547e+01, 2.74674783e+01],
       [4.57581900e-02, 0.00000000e+00, 0.00000000e+00, ...,
        8.73981430e+00, 2.24215131e+01, 3.05664709e+01],
       [0.00000000e+00, 1.43518254e-01, 0.00000000e+00, ...,
        1.64089203e+01, 6.14258126e+01, 2.20539716e+01]])
Coordinates:
  * Gene     (Gene) object 'ChrSy.fgenesh.gene.1' ... 'LOC_Os12g44390'
  * Sample   (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'

Add the counts to the GEM .data attribute.

agem.data['tmm_counts'] = tmm_counts

Save the AnnotatedGEM as a .netcdf file¶

Recall that .nc files cannot be overwritten, nor have variables added to them. So we either need to delete and save the file with the same name, or save as a new file.

if not NORMED_GEM_PATH.exists():
    agem.save(NORMED_GEM_PATH)