02 Saving Normalizations to AnnotatedGEMs¶
This notebook is a how-to guide on normalizing gene expression matrices using GSForge. It does not cover considerations as to which normalization should be preformed.
Recall that .netcdf files cannot be modified once written; meaning the choice to add a normalized count matrix
should be considered carefully. Normalizations that are not ‘reversible’ are good candidates to save to an
AnnotatedGEM
object; as they may require other data (i.e. gene lengths). Many other
normalization methods can be run ‘as-needed’.
Set up the notebook
import pandas as pd
import xarray as xr
import umap
import umap.plot
import sklearn.preprocessing
import GSForge as gsf
import holoviews as hv
hv.extension("bokeh")
Declare used paths
# OS-independent path management.
from os import environ
from pathlib import Path
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
TPM_GEM_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.TPM.txt")
Declare an path to which the created .nc
file will saved.
NORMED_GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_normed.nc")
NORMED_GEM_PATH
PosixPath('/home/tyler/GSForge_demo_data/osfstorage/oryza_sativa/AnnotatedGEMs/oryza_sativa_hisat2_normed.nc')
Load an AnnotatedGEM
agem = gsf.AnnotatedGEM(GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
Gene 55986
Sample 475
Adding Normalizations to an AnnotatedGEM¶
Here we demonstrate adding an externally generated (TPM) counts to an existing AnnotatedGEM
object.
%%time
tpm_count_df = pd.read_csv(TPM_GEM_PATH, sep="\t", index_col=0)
CPU times: user 4.29 s, sys: 316 ms, total: 4.6 s
Wall time: 4.61 s
tpm_count_df.head()
SRX1423934 | SRX1423935 | SRX1423936 | SRX1423937 | SRX1423938 | SRX1423939 | SRX1423940 | SRX1423941 | SRX1423942 | SRX1423943 | ... | SRX1424399 | SRX1424400 | SRX1424401 | SRX1424402 | SRX1424403 | SRX1424404 | SRX1424405 | SRX1424406 | SRX1424407 | SRX1424408 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LOC_Os01g01010 | 10.557777 | 5.802739 | 9.664971 | 9.802068 | 8.468090 | 10.420499 | 6.704176 | 10.111569 | 10.711922 | 8.904408 | ... | 7.307394 | 11.057385 | 7.095994 | 9.834905 | 13.636573 | 8.350987 | 10.928863 | 9.101974 | 8.468531 | 9.617651 |
LOC_Os01g01019 | 0.257810 | 0.000000 | 0.149549 | 0.000000 | 0.244159 | 0.153994 | 0.224923 | 0.427355 | 0.130201 | 0.320050 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.072867 | 0.000000 | 0.090800 | 0.000000 | 0.000000 | 0.000000 |
LOC_Os01g01030 | 0.653512 | 1.213873 | 0.864323 | 1.159157 | 1.562642 | 1.292170 | 1.056508 | 1.141737 | 1.543654 | 1.576418 | ... | 0.784792 | 1.123123 | 1.069542 | 1.201486 | 1.233291 | 1.117724 | 1.197164 | 1.374664 | 0.842728 | 1.317539 |
LOC_Os01g01040 | 9.493268 | 6.200687 | 6.744709 | 7.150879 | 7.992403 | 8.539720 | 8.062660 | 8.530299 | 10.291631 | 6.111945 | ... | 5.420519 | 8.218357 | 7.300565 | 9.891362 | 10.965607 | 9.908169 | 9.994303 | 7.261079 | 7.945092 | 8.197905 |
LOC_Os01g01050 | 8.646006 | 4.583269 | 6.917632 | 6.150803 | 7.886282 | 7.751430 | 6.607637 | 8.311967 | 7.300650 | 7.646735 | ... | 6.132597 | 9.411695 | 5.969155 | 8.985373 | 10.236788 | 12.085556 | 12.390430 | 8.441613 | 6.919355 | 6.130914 |
5 rows × 475 columns
There is a pandas.DataFrame.to_xarray()
function, but the coordinates are not quite what we want.
Instead we can generate an xarray.DataArray
quickly through the standard creation call.
tpm_counts = xr.Dataset(
data_vars={"TPM_counts": (("Sample", "Gene"), tpm_count_df.values.transpose())},
coords={
"Sample": tpm_count_df.columns.values,
"Gene": tpm_count_df.index.values
}
)
tpm_counts
<xarray.Dataset> Dimensions: (Sample: 475, Gene: 55986) Coordinates: * Sample (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408' * Gene (Gene) object 'LOC_Os01g01010' ... 'ChrUn.fgenesh.gene.66' Data variables: TPM_counts (Sample, Gene) float64 10.56 0.2578 0.6535 9.493 ... 0.0 0.0 0.0
- Sample: 475
- Gene: 55986
- Sample(Sample)object'SRX1423934' ... 'SRX1424408'
array(['SRX1423934', 'SRX1423935', 'SRX1423936', ..., 'SRX1424406', 'SRX1424407', 'SRX1424408'], dtype=object)
- Gene(Gene)object'LOC_Os01g01010' ... 'ChrUn.fgen...
array(['LOC_Os01g01010', 'LOC_Os01g01019', 'LOC_Os01g01030', ..., 'ChrUn.fgenesh.gene.17', 'ChrUn.fgenesh.gene.21', 'ChrUn.fgenesh.gene.66'], dtype=object)
- TPM_counts(Sample, Gene)float6410.56 0.2578 0.6535 ... 0.0 0.0 0.0
array([[10.557777, 0.25781 , 0.653512, ..., 0. , 0. , 0. ], [ 5.802739, 0. , 1.213873, ..., 0. , 0. , 0. ], [ 9.664971, 0.149549, 0.864323, ..., 0. , 0. , 0. ], ..., [ 9.101974, 0. , 1.374664, ..., 0. , 0. , 0. ], [ 8.468531, 0. , 0.842728, ..., 0. , 0. , 0. ], [ 9.617651, 0. , 1.317539, ..., 0. , 0. , 0. ]])
Adding to the existing gem xarray.Dataset
can be done via a call to update()
.
agem.data.update(tpm_counts)
<xarray.Dataset> Dimensions: (Gene: 55986, Sample: 475) Coordinates: * Gene (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418' * Sample (Sample) object 'SRX1423934' ... 'SRX1424408' Data variables: (12/30) BioSample (Sample) object 'SAMN04251848' ... 'SAMN04251607' LoadDate (Sample) object '2015-11-20' ... '2015-11-19' MBases (Sample) int64 4016 5202 4053 1166 ... 3098 3529 2922 MBytes (Sample) int64 2738 3652 2719 764 ... 1983 2370 1862 Run (Sample) object 'SRR2931040' ... 'SRR2931514' SRA_Sample (Sample) object 'SRS1156722' ... 'SRS1156251' ... ... ReleaseDate (Sample) object '2016-01-04' ... '2016-01-04' SRA_Study (Sample) object 'SRP065945' 'SRP065945' ... 'SRP065945' source_name (Sample) object 'Rice leaf' 'Rice leaf' ... 'Rice leaf' tissue (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf' counts (Sample, Gene) int64 ... TPM_counts (Sample, Gene) float64 0.2341 0.0 0.0 ... 0.0 17.26 0.0 Attributes: __GSForge.AnnotatedGEM.params: {"count_array_name": "counts", "gene_inde...
- Gene: 55986
- Sample: 475
- Gene(Gene)object'LOC_Os06g05820' ... 'LOC_Os07g0...
array(['LOC_Os06g05820', 'LOC_Os10g27460', 'LOC_Os02g35980', ..., 'LOC_Os03g50190', 'LOC_Os03g20020', 'LOC_Os07g03418'], dtype=object)
- Sample(Sample)object'SRX1423934' ... 'SRX1424408'
array(['SRX1423934', 'SRX1423935', 'SRX1423936', ..., 'SRX1424406', 'SRX1424407', 'SRX1424408'], dtype=object)
- BioSample(Sample)object...
array(['SAMN04251848', 'SAMN04251849', 'SAMN04251850', ..., 'SAMN04251605', 'SAMN04251606', 'SAMN04251607'], dtype=object)
- LoadDate(Sample)object...
array(['2015-11-20', '2015-11-20', '2015-11-20', ..., '2015-11-19', '2015-11-19', '2015-11-19'], dtype=object)
- MBases(Sample)int64...
array([4016, 5202, 4053, ..., 3098, 3529, 2922])
- MBytes(Sample)int64...
array([2738, 3652, 2719, ..., 1983, 2370, 1862])
- Run(Sample)object...
array(['SRR2931040', 'SRR2931041', 'SRR2931042', ..., 'SRR2931512', 'SRR2931513', 'SRR2931514'], dtype=object)
- SRA_Sample(Sample)object...
array(['SRS1156722', 'SRS1156721', 'SRS1156718', ..., 'SRS1156254', 'SRS1156252', 'SRS1156251'], dtype=object)
- Sample_Name(Sample)object...
array(['GSM1933346', 'GSM1933347', 'GSM1933348', ..., 'GSM1933818', 'GSM1933819', 'GSM1933820'], dtype=object)
- genotype(Sample)object...
array(['Azuenca', 'Azuenca', 'Azuenca', ..., 'Tadukan', 'Tadukan', 'Tadukan'], dtype=object)
- time(Sample)int64...
array([ 15, 15, 30, ..., 270, 300, 300])
- treatment(Sample)object...
array(['CONTROL', 'CONTROL', 'CONTROL', ..., 'RECOV_DROUGHT', 'RECOV_DROUGHT', 'RECOV_DROUGHT'], dtype=object)
- Assay_Type(Sample)object...
array(['RNA-Seq', 'RNA-Seq', 'RNA-Seq', ..., 'RNA-Seq', 'RNA-Seq', 'RNA-Seq'], dtype=object)
- AvgSpotLen(Sample)int64...
array([102, 102, 102, ..., 102, 102, 102])
- BioProject(Sample)object...
array(['PRJNA301554', 'PRJNA301554', 'PRJNA301554', ..., 'PRJNA301554', 'PRJNA301554', 'PRJNA301554'], dtype=object)
- Center_Name(Sample)object...
array(['GEO', 'GEO', 'GEO', ..., 'GEO', 'GEO', 'GEO'], dtype=object)
- Consent(Sample)object...
array(['public', 'public', 'public', ..., 'public', 'public', 'public'], dtype=object)
- DATASTORE_filetype(Sample)object...
array(['sra', 'sra', 'sra', ..., 'sra', 'sra', 'sra'], dtype=object)
- DATASTORE_provider(Sample)object...
array(['ncbi', 'ncbi', 'ncbi', ..., 'ncbi', 'ncbi', 'ncbi'], dtype=object)
- InsertSize(Sample)int64...
array([0, 0, 0, ..., 0, 0, 0])
- Instrument(Sample)object...
array(['Illumina HiSeq 2000', 'Illumina HiSeq 2000', 'Illumina HiSeq 2000', ..., 'Illumina HiSeq 2000', 'Illumina HiSeq 2000', 'Illumina HiSeq 2000'], dtype=object)
- LibraryLayout(Sample)object...
array(['PAIRED', 'PAIRED', 'PAIRED', ..., 'PAIRED', 'PAIRED', 'PAIRED'], dtype=object)
- LibrarySelection(Sample)object...
array(['cDNA', 'cDNA', 'cDNA', ..., 'cDNA', 'cDNA', 'cDNA'], dtype=object)
- LibrarySource(Sample)object...
array(['TRANSCRIPTOMIC', 'TRANSCRIPTOMIC', 'TRANSCRIPTOMIC', ..., 'TRANSCRIPTOMIC', 'TRANSCRIPTOMIC', 'TRANSCRIPTOMIC'], dtype=object)
- Organism(Sample)object...
array(['Oryza sativa', 'Oryza sativa', 'Oryza sativa', ..., 'Oryza sativa', 'Oryza sativa', 'Oryza sativa'], dtype=object)
- Platform(Sample)object...
array(['ILLUMINA', 'ILLUMINA', 'ILLUMINA', ..., 'ILLUMINA', 'ILLUMINA', 'ILLUMINA'], dtype=object)
- ReleaseDate(Sample)object...
array(['2016-01-04', '2016-01-04', '2016-01-04', ..., '2016-01-04', '2016-01-04', '2016-01-04'], dtype=object)
- SRA_Study(Sample)object...
array(['SRP065945', 'SRP065945', 'SRP065945', ..., 'SRP065945', 'SRP065945', 'SRP065945'], dtype=object)
- source_name(Sample)object...
array(['Rice leaf', 'Rice leaf', 'Rice leaf', ..., 'Rice leaf', 'Rice leaf', 'Rice leaf'], dtype=object)
- tissue(Sample)object...
array(['leaf', 'leaf', 'leaf', ..., 'leaf', 'leaf', 'leaf'], dtype=object)
- counts(Sample, Gene)int64...
[26593350 values with dtype=int64]
- TPM_counts(Sample, Gene)float640.2341 0.0 0.0 ... 0.0 17.26 0.0
array([[ 0.2341 , 0. , 0. , ..., 0. , 14.616986, 0. ], [ 0.044422, 0. , 0. , ..., 0. , 8.403654, 0. ], [ 0.221362, 0. , 0. , ..., 0. , 10.900209, 0. ], ..., [ 0.085393, 0. , 0. , ..., 0. , 9.491459, 0. ], [ 0.087561, 0. , 0. , ..., 0. , 7.073131, 0. ], [ 0.270327, 0. , 0. , ..., 0. , 17.256233, 0. ]])
- __GSForge.AnnotatedGEM.params :
- {"count_array_name": "counts", "gene_index_name": "Gene", "name": "Oryza sativa", "sample_index_name": "Sample"}
agem.count_array_names
['counts', 'TPM_counts']
%%time
uq_raw_counts = gsf.operations.UpperQuartile(agem)
CPU times: user 2.26 s, sys: 464 ms, total: 2.72 s
Wall time: 2.73 s
%%time
uq_tpm_counts = gsf.operations.UpperQuartile(agem, count_variable='TPM_counts')
CPU times: user 1.51 s, sys: 264 ms, total: 1.77 s
Wall time: 1.77 s
We can also use dictionary-like assignment.
agem.data["uq_raw_counts"] = uq_raw_counts
agem.data["uq_tpm_counts"] = uq_tpm_counts
agem.count_array_names
['counts', 'TPM_counts', 'uq_raw_counts', 'uq_tpm_counts']
Normalizations from Scikit-Learn¶
Select counts using get_gem_data()
counts, _ = gsf.get_gem_data(agem)
%%time
quantile_counts = sklearn.preprocessing.quantile_transform(counts, axis=1, output_distribution='normal', copy=True)
quantile_counts = xr.DataArray(quantile_counts, coords=counts.coords)
CPU times: user 10.1 s, sys: 136 ms, total: 10.3 s
Wall time: 10.2 s
agem.data["quantile_counts"] = quantile_counts
agem.count_array_names
['counts', 'TPM_counts', 'uq_raw_counts', 'uq_tpm_counts', 'quantile_counts']
Using R to Normalize GEMs¶
See TODO for more information and examples on interacting with R.
R integration setup
import rpy2.rinterface_lib.callbacks
import logging
from rpy2.robjects import pandas2ri
%load_ext rpy2.ipython
pandas2ri.activate()
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR) # Supresses verbose R output.
%%R
library("edgeR")
Select counts using get_gem_data()
counts, _ = gsf.get_gem_data(agem)
Prepare the counts for R
Notice the counts are transposed after this step to a form more common in R. (features by samples).
ri_counts = gsf.utils.R_interface.Py_counts_to_R(counts)
ri_counts.shape
(55986, 475)
Run the normalization within R
%%R -i ri_counts -o tmm_counts
dge_list <- DGEList(counts=ri_counts)
dge_list <- calcNormFactors(dge_list, method="TMM")
tmm_counts <- cpm(dge_list, normalized.lib.sizes=TRUE, log=FALSE)
tmm_counts = xr.DataArray(tmm_counts.T, coords=counts.coords, name='tmm_counts')
tmm_counts
<xarray.DataArray 'tmm_counts' (Sample: 475, Gene: 55986)> array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 1.42048297e+01, 4.12840093e+01, 3.19315181e+01], [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 1.46012729e+01, 2.24112561e+01, 2.49013957e+01], [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 1.45942064e+01, 2.68799459e+01, 3.12621205e+01], ..., [0.00000000e+00, 1.43059783e-01, 0.00000000e+00, ..., 9.10813950e+00, 2.54169547e+01, 2.74674783e+01], [4.57581900e-02, 0.00000000e+00, 0.00000000e+00, ..., 8.73981430e+00, 2.24215131e+01, 3.05664709e+01], [0.00000000e+00, 1.43518254e-01, 0.00000000e+00, ..., 1.64089203e+01, 6.14258126e+01, 2.20539716e+01]]) Coordinates: * Gene (Gene) object 'ChrSy.fgenesh.gene.1' ... 'LOC_Os12g44390' * Sample (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
- Sample: 475
- Gene: 55986
- 0.0 0.0 0.0 0.0 0.1174 0.0 0.0 ... 6.698 0.0 23.25 16.41 61.43 22.05
array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 1.42048297e+01, 4.12840093e+01, 3.19315181e+01], [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 1.46012729e+01, 2.24112561e+01, 2.49013957e+01], [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., 1.45942064e+01, 2.68799459e+01, 3.12621205e+01], ..., [0.00000000e+00, 1.43059783e-01, 0.00000000e+00, ..., 9.10813950e+00, 2.54169547e+01, 2.74674783e+01], [4.57581900e-02, 0.00000000e+00, 0.00000000e+00, ..., 8.73981430e+00, 2.24215131e+01, 3.05664709e+01], [0.00000000e+00, 1.43518254e-01, 0.00000000e+00, ..., 1.64089203e+01, 6.14258126e+01, 2.20539716e+01]])
- Gene(Gene)object'ChrSy.fgenesh.gene.1' ... 'LOC_...
array(['ChrSy.fgenesh.gene.1', 'ChrSy.fgenesh.gene.10', 'ChrSy.fgenesh.gene.11', ..., 'LOC_Os12g44370', 'LOC_Os12g44380', 'LOC_Os12g44390'], dtype=object)
- Sample(Sample)object'SRX1423934' ... 'SRX1424408'
array(['SRX1423934', 'SRX1423935', 'SRX1423936', ..., 'SRX1424406', 'SRX1424407', 'SRX1424408'], dtype=object)
Add the counts to the GEM .data attribute.
agem.data['tmm_counts'] = tmm_counts
Save the AnnotatedGEM as a .netcdf file¶
Recall that .nc
files cannot be overwritten, nor have variables added to them.
So we either need to delete and save the file with the same name, or save as a new file.
if not NORMED_GEM_PATH.exists():
agem.save(NORMED_GEM_PATH)