02 Saving Normalizations to AnnotatedGEMs

This notebook is a how-to guide on normalizing gene expression matrices using GSForge. It does not cover considerations as to which normalization should be preformed.

Recall that .netcdf files cannot be modified once written; meaning the choice to add a normalized count matrix should be considered carefully. Normalizations that are not ‘reversible’ are good candidates to save to an AnnotatedGEM object; as they may require other data (i.e. gene lengths). Many other normalization methods can be run ‘as-needed’.

Set up the notebook

import pandas as pd
import xarray as xr
import umap
import umap.plot
import sklearn.preprocessing
import GSForge as gsf
import holoviews as hv
hv.extension("bokeh")

Declare used paths

# OS-independent path management.
from os import  environ
from pathlib import Path
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
TPM_GEM_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.TPM.txt")

Declare an path to which the created .nc file will saved.

NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_normed.nc")
NORMED_GEM_PATH
PosixPath('/home/tyler/GSForge_demo_data/osfstorage/oryza_sativa/AnnotatedGEMs/oryza_sativa_hisat2_normed.nc')

Load an AnnotatedGEM

agem = gsf.AnnotatedGEM(GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Adding Normalizations to an AnnotatedGEM

Here we demonstrate adding an externally generated (TPM) counts to an existing AnnotatedGEM object.

%%time
tpm_count_df = pd.read_csv(TPM_GEM_PATH, sep="\t", index_col=0)
CPU times: user 4.29 s, sys: 316 ms, total: 4.6 s
Wall time: 4.61 s
tpm_count_df.head()
SRX1423934 SRX1423935 SRX1423936 SRX1423937 SRX1423938 SRX1423939 SRX1423940 SRX1423941 SRX1423942 SRX1423943 ... SRX1424399 SRX1424400 SRX1424401 SRX1424402 SRX1424403 SRX1424404 SRX1424405 SRX1424406 SRX1424407 SRX1424408
LOC_Os01g01010 10.557777 5.802739 9.664971 9.802068 8.468090 10.420499 6.704176 10.111569 10.711922 8.904408 ... 7.307394 11.057385 7.095994 9.834905 13.636573 8.350987 10.928863 9.101974 8.468531 9.617651
LOC_Os01g01019 0.257810 0.000000 0.149549 0.000000 0.244159 0.153994 0.224923 0.427355 0.130201 0.320050 ... 0.000000 0.000000 0.000000 0.000000 0.072867 0.000000 0.090800 0.000000 0.000000 0.000000
LOC_Os01g01030 0.653512 1.213873 0.864323 1.159157 1.562642 1.292170 1.056508 1.141737 1.543654 1.576418 ... 0.784792 1.123123 1.069542 1.201486 1.233291 1.117724 1.197164 1.374664 0.842728 1.317539
LOC_Os01g01040 9.493268 6.200687 6.744709 7.150879 7.992403 8.539720 8.062660 8.530299 10.291631 6.111945 ... 5.420519 8.218357 7.300565 9.891362 10.965607 9.908169 9.994303 7.261079 7.945092 8.197905
LOC_Os01g01050 8.646006 4.583269 6.917632 6.150803 7.886282 7.751430 6.607637 8.311967 7.300650 7.646735 ... 6.132597 9.411695 5.969155 8.985373 10.236788 12.085556 12.390430 8.441613 6.919355 6.130914

5 rows × 475 columns

There is a pandas.DataFrame.to_xarray() function, but the coordinates are not quite what we want. Instead we can generate an xarray.DataArray quickly through the standard creation call.

tpm_counts = xr.Dataset(
    data_vars={"TPM_counts": (("Sample", "Gene"), tpm_count_df.values.transpose())},
    coords={
        "Sample": tpm_count_df.columns.values,
        "Gene": tpm_count_df.index.values
    }
)
tpm_counts
<xarray.Dataset>
Dimensions:     (Sample: 475, Gene: 55986)
Coordinates:
  * Sample      (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
  * Gene        (Gene) object 'LOC_Os01g01010' ... 'ChrUn.fgenesh.gene.66'
Data variables:
    TPM_counts  (Sample, Gene) float64 10.56 0.2578 0.6535 9.493 ... 0.0 0.0 0.0

Adding to the existing gem xarray.Dataset can be done via a call to update().

agem.data.update(tpm_counts)
<xarray.Dataset>
Dimensions:             (Gene: 55986, Sample: 475)
Coordinates:
  * Gene                (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
  * Sample              (Sample) object 'SRX1423934' ... 'SRX1424408'
Data variables: (12/30)
    BioSample           (Sample) object 'SAMN04251848' ... 'SAMN04251607'
    LoadDate            (Sample) object '2015-11-20' ... '2015-11-19'
    MBases              (Sample) int64 4016 5202 4053 1166 ... 3098 3529 2922
    MBytes              (Sample) int64 2738 3652 2719 764 ... 1983 2370 1862
    Run                 (Sample) object 'SRR2931040' ... 'SRR2931514'
    SRA_Sample          (Sample) object 'SRS1156722' ... 'SRS1156251'
    ...                  ...
    ReleaseDate         (Sample) object '2016-01-04' ... '2016-01-04'
    SRA_Study           (Sample) object 'SRP065945' 'SRP065945' ... 'SRP065945'
    source_name         (Sample) object 'Rice leaf' 'Rice leaf' ... 'Rice leaf'
    tissue              (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf'
    counts              (Sample, Gene) int64 ...
    TPM_counts          (Sample, Gene) float64 0.2341 0.0 0.0 ... 0.0 17.26 0.0
Attributes:
    __GSForge.AnnotatedGEM.params:  {"count_array_name": "counts", "gene_inde...
agem.count_array_names
['counts', 'TPM_counts']
%%time
uq_raw_counts = gsf.operations.UpperQuartile(agem)
CPU times: user 2.26 s, sys: 464 ms, total: 2.72 s
Wall time: 2.73 s
%%time
uq_tpm_counts = gsf.operations.UpperQuartile(agem, count_variable='TPM_counts')
CPU times: user 1.51 s, sys: 264 ms, total: 1.77 s
Wall time: 1.77 s

We can also use dictionary-like assignment.

agem.data["uq_raw_counts"] = uq_raw_counts
agem.data["uq_tpm_counts"] = uq_tpm_counts
agem.count_array_names
['counts', 'TPM_counts', 'uq_raw_counts', 'uq_tpm_counts']

Normalizations from Scikit-Learn

Select counts using get_gem_data()

counts, _ = gsf.get_gem_data(agem)
%%time
quantile_counts = sklearn.preprocessing.quantile_transform(counts, axis=1, output_distribution='normal', copy=True)
quantile_counts = xr.DataArray(quantile_counts, coords=counts.coords)
CPU times: user 10.1 s, sys: 136 ms, total: 10.3 s
Wall time: 10.2 s
agem.data["quantile_counts"] = quantile_counts
agem.count_array_names
['counts', 'TPM_counts', 'uq_raw_counts', 'uq_tpm_counts', 'quantile_counts']

Using R to Normalize GEMs

See TODO for more information and examples on interacting with R.

R integration setup

import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR) # Supresses verbose R output.
%%R
library("edgeR")

Select counts using get_gem_data()

counts, _ = gsf.get_gem_data(agem)

Prepare the counts for R

Notice the counts are transposed after this step to a form more common in R. (features by samples).

ri_counts = gsf.utils.R_interface.Py_counts_to_R(counts)
ri_counts.shape
(55986, 475)

Run the normalization within R

%%R -i ri_counts -o tmm_counts

dge_list <- DGEList(counts=ri_counts)
dge_list <- calcNormFactors(dge_list, method="TMM")
tmm_counts <- cpm(dge_list, normalized.lib.sizes=TRUE, log=FALSE)
tmm_counts = xr.DataArray(tmm_counts.T, coords=counts.coords, name='tmm_counts')
tmm_counts
<xarray.DataArray 'tmm_counts' (Sample: 475, Gene: 55986)>
array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.42048297e+01, 4.12840093e+01, 3.19315181e+01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.46012729e+01, 2.24112561e+01, 2.49013957e+01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.45942064e+01, 2.68799459e+01, 3.12621205e+01],
       ...,
       [0.00000000e+00, 1.43059783e-01, 0.00000000e+00, ...,
        9.10813950e+00, 2.54169547e+01, 2.74674783e+01],
       [4.57581900e-02, 0.00000000e+00, 0.00000000e+00, ...,
        8.73981430e+00, 2.24215131e+01, 3.05664709e+01],
       [0.00000000e+00, 1.43518254e-01, 0.00000000e+00, ...,
        1.64089203e+01, 6.14258126e+01, 2.20539716e+01]])
Coordinates:
  * Gene     (Gene) object 'ChrSy.fgenesh.gene.1' ... 'LOC_Os12g44390'
  * Sample   (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'

Add the counts to the GEM .data attribute.

agem.data['tmm_counts'] = tmm_counts

Save the AnnotatedGEM as a .netcdf file

Recall that .nc files cannot be overwritten, nor have variables added to them. So we either need to delete and save the file with the same name, or save as a new file.

if not NORMED_GEM_PATH.exists():
    agem.save(NORMED_GEM_PATH)