Interface Guide¶
Set operations, intersections, unions and differences and combinations thereof answer many of the most basic questions.
Moreover, we often desires to examine subsets of the original GEM based a selection determined by some set operation.
The Interface
provided by GSForge
provides a uniform API access to both the AnnotatedGEM
and the
GeneSetCollection
objects for retrieving count values and sample annotations.
Notebook setup
import numpy as np
import pandas as pd
import xarray as xr
import GSForge as gsf
import holoviews as hv
hv.extension('bokeh')
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
TOUR_DGE = OSF_PATH.joinpath("GeneSetCollections", "DEG_gene_sets")
Load Data¶
agem = gsf.AnnotatedGEM(GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza Sativa
Selected GEM Variable: 'counts'
Gene 66338
Sample 475
agem.data
<xarray.Dataset> Dimensions: (Sample: 475, Gene: 66338) Coordinates: * Sample (Sample) object 'SRX1423934' ... 'SRX1424408' * Gene (Gene) object 'LOC_Os01g01010.1' ... 'ChrSy.fgenesh.m... Data variables: (12/29) BioSample (Sample) object 'SAMN04251848' ... 'SAMN04251607' LoadDate (Sample) object '2015-11-20' ... '2015-11-19' MBases (Sample) int64 4016 5202 4053 1166 ... 3098 3529 2922 MBytes (Sample) int64 2738 3652 2719 764 ... 1983 2370 1862 Run (Sample) object 'SRR2931040' ... 'SRR2931514' SRA_Sample (Sample) object 'SRS1156722' ... 'SRS1156251' ... ... Platform (Sample) object 'ILLUMINA' 'ILLUMINA' ... 'ILLUMINA' ReleaseDate (Sample) object '2016-01-04' ... '2016-01-04' SRA_Study (Sample) object 'SRP065945' 'SRP065945' ... 'SRP065945' source_name (Sample) object 'Rice leaf' 'Rice leaf' ... 'Rice leaf' tissue (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf' counts (Sample, Gene) float64 ... Attributes: __GSForge.AnnotatedGEM.params: {"count_array_name": "counts", "gene_inde...
- Sample: 475
- Gene: 66338
- Sample(Sample)object'SRX1423934' ... 'SRX1424408'
array(['SRX1423934', 'SRX1423935', 'SRX1423936', ..., 'SRX1424406', 'SRX1424407', 'SRX1424408'], dtype=object)
- Gene(Gene)object'LOC_Os01g01010.1' ... 'ChrSy.fg...
array(['LOC_Os01g01010.1', 'LOC_Os01g01010.2', 'LOC_Os01g01019.1', ..., 'ChrSy.fgenesh.mRNA.87', 'ChrSy.fgenesh.mRNA.88', 'ChrSy.fgenesh.mRNA.89'], dtype=object)
- BioSample(Sample)object...
array(['SAMN04251848', 'SAMN04251849', 'SAMN04251850', ..., 'SAMN04251605', 'SAMN04251606', 'SAMN04251607'], dtype=object)
- LoadDate(Sample)object...
array(['2015-11-20', '2015-11-20', '2015-11-20', ..., '2015-11-19', '2015-11-19', '2015-11-19'], dtype=object)
- MBases(Sample)int64...
array([4016, 5202, 4053, ..., 3098, 3529, 2922])
- MBytes(Sample)int64...
array([2738, 3652, 2719, ..., 1983, 2370, 1862])
- Run(Sample)object...
array(['SRR2931040', 'SRR2931041', 'SRR2931042', ..., 'SRR2931512', 'SRR2931513', 'SRR2931514'], dtype=object)
- SRA_Sample(Sample)object...
array(['SRS1156722', 'SRS1156721', 'SRS1156718', ..., 'SRS1156254', 'SRS1156252', 'SRS1156251'], dtype=object)
- Sample_Name(Sample)object...
array(['GSM1933346', 'GSM1933347', 'GSM1933348', ..., 'GSM1933818', 'GSM1933819', 'GSM1933820'], dtype=object)
- genotype(Sample)object...
array(['Azuenca', 'Azuenca', 'Azuenca', ..., 'Tadukan', 'Tadukan', 'Tadukan'], dtype=object)
- time(Sample)int64...
array([ 15, 15, 30, ..., 270, 300, 300])
- treatment(Sample)object...
array(['CONTROL', 'CONTROL', 'CONTROL', ..., 'RECOV_DROUGHT', 'RECOV_DROUGHT', 'RECOV_DROUGHT'], dtype=object)
- Assay_Type(Sample)object...
array(['RNA-Seq', 'RNA-Seq', 'RNA-Seq', ..., 'RNA-Seq', 'RNA-Seq', 'RNA-Seq'], dtype=object)
- AvgSpotLen(Sample)int64...
array([102, 102, 102, ..., 102, 102, 102])
- BioProject(Sample)object...
array(['PRJNA301554', 'PRJNA301554', 'PRJNA301554', ..., 'PRJNA301554', 'PRJNA301554', 'PRJNA301554'], dtype=object)
- Center_Name(Sample)object...
array(['GEO', 'GEO', 'GEO', ..., 'GEO', 'GEO', 'GEO'], dtype=object)
- Consent(Sample)object...
array(['public', 'public', 'public', ..., 'public', 'public', 'public'], dtype=object)
- DATASTORE_filetype(Sample)object...
array(['sra', 'sra', 'sra', ..., 'sra', 'sra', 'sra'], dtype=object)
- DATASTORE_provider(Sample)object...
array(['ncbi', 'ncbi', 'ncbi', ..., 'ncbi', 'ncbi', 'ncbi'], dtype=object)
- InsertSize(Sample)int64...
array([0, 0, 0, ..., 0, 0, 0])
- Instrument(Sample)object...
array(['Illumina HiSeq 2000', 'Illumina HiSeq 2000', 'Illumina HiSeq 2000', ..., 'Illumina HiSeq 2000', 'Illumina HiSeq 2000', 'Illumina HiSeq 2000'], dtype=object)
- LibraryLayout(Sample)object...
array(['PAIRED', 'PAIRED', 'PAIRED', ..., 'PAIRED', 'PAIRED', 'PAIRED'], dtype=object)
- LibrarySelection(Sample)object...
array(['cDNA', 'cDNA', 'cDNA', ..., 'cDNA', 'cDNA', 'cDNA'], dtype=object)
- LibrarySource(Sample)object...
array(['TRANSCRIPTOMIC', 'TRANSCRIPTOMIC', 'TRANSCRIPTOMIC', ..., 'TRANSCRIPTOMIC', 'TRANSCRIPTOMIC', 'TRANSCRIPTOMIC'], dtype=object)
- Organism(Sample)object...
array(['Oryza sativa', 'Oryza sativa', 'Oryza sativa', ..., 'Oryza sativa', 'Oryza sativa', 'Oryza sativa'], dtype=object)
- Platform(Sample)object...
array(['ILLUMINA', 'ILLUMINA', 'ILLUMINA', ..., 'ILLUMINA', 'ILLUMINA', 'ILLUMINA'], dtype=object)
- ReleaseDate(Sample)object...
array(['2016-01-04', '2016-01-04', '2016-01-04', ..., '2016-01-04', '2016-01-04', '2016-01-04'], dtype=object)
- SRA_Study(Sample)object...
array(['SRP065945', 'SRP065945', 'SRP065945', ..., 'SRP065945', 'SRP065945', 'SRP065945'], dtype=object)
- source_name(Sample)object...
array(['Rice leaf', 'Rice leaf', 'Rice leaf', ..., 'Rice leaf', 'Rice leaf', 'Rice leaf'], dtype=object)
- tissue(Sample)object...
array(['leaf', 'leaf', 'leaf', ..., 'leaf', 'leaf', 'leaf'], dtype=object)
- counts(Sample, Gene)float64...
[31510550 values with dtype=float64]
- __GSForge.AnnotatedGEM.params :
- {"count_array_name": "counts", "gene_index_name": "Gene", "name": "Oryza Sativa", "sample_index_name": "Sample"}
Selecting Data using the Interface¶
Select data through the interface via the get_gem_data
function.
The simplest possible call to this function returns the zero filled count matrix within a two-item tuple.
counts, empty_labels = gsf.get_gem_data(agem)
counts
<xarray.DataArray 'counts' (Sample: 475, Gene: 66338)> array([[ 1. , 0. , 0. , ..., 106.37 , 730.919 , 881. ], [ 0. , 0. , 0. , ..., 45.0493 , 198.64 , 261. ], [ 0. , 0. , 0. , ..., 201.085 , 826.987 , 873. ], ..., [ 0. , 4.5 , 0. , ..., 91.2347 , 625.153 , 698. ], [ 1.26659, 8.5 , 0. , ..., 106.927 , 569.145 , 720. ], [ 2. , 5.5 , 0. , ..., 146.451 , 740.039 , 570. ]]) Coordinates: * Sample (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408' * Gene (Gene) object 'ChrSy.fgenesh.mRNA.1' ... 'LOC_Os12g44390.1'
- Sample: 475
- Gene: 66338
- 1.0 0.0 0.0 0.0 0.164 2.0 0.0 ... 64.12 288.0 414.5 146.5 740.0 570.0
array([[ 1. , 0. , 0. , ..., 106.37 , 730.919 , 881. ], [ 0. , 0. , 0. , ..., 45.0493 , 198.64 , 261. ], [ 0. , 0. , 0. , ..., 201.085 , 826.987 , 873. ], ..., [ 0. , 4.5 , 0. , ..., 91.2347 , 625.153 , 698. ], [ 1.26659, 8.5 , 0. , ..., 106.927 , 569.145 , 720. ], [ 2. , 5.5 , 0. , ..., 146.451 , 740.039 , 570. ]])
- Sample(Sample)object'SRX1423934' ... 'SRX1424408'
array(['SRX1423934', 'SRX1423935', 'SRX1423936', ..., 'SRX1424406', 'SRX1424407', 'SRX1424408'], dtype=object)
- Gene(Gene)object'ChrSy.fgenesh.mRNA.1' ... 'LOC_...
array(['ChrSy.fgenesh.mRNA.1', 'ChrSy.fgenesh.mRNA.10', 'ChrSy.fgenesh.mRNA.11', ..., 'LOC_Os12g44380.2', 'LOC_Os12g44380.3', 'LOC_Os12g44390.1'], dtype=object)
When annotation_variables
are not provided, the second item is the None
singleton.
empty_labels == None
True
This is for the sake of consistency in handling the number of expected output objects when annotation_variables
are provided:
counts, label_ds = gsf.get_gem_data(agem, annotation_variables=['treatment'])
counts
remain the same, but this time a corresponding label dataset is also returned.
label_ds
<xarray.DataArray 'treatment' (Sample: 475)> array(['CONTROL', 'CONTROL', 'CONTROL', ..., 'RECOV_DROUGHT', 'RECOV_DROUGHT', 'RECOV_DROUGHT'], dtype=object) Coordinates: * Sample (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
- Sample: 475
- 'CONTROL' 'CONTROL' 'CONTROL' ... 'RECOV_DROUGHT' 'RECOV_DROUGHT'
array(['CONTROL', 'CONTROL', 'CONTROL', ..., 'RECOV_DROUGHT', 'RECOV_DROUGHT', 'RECOV_DROUGHT'], dtype=object)
- Sample(Sample)object'SRX1423934' ... 'SRX1424408'
array(['SRX1423934', 'SRX1423935', 'SRX1423936', ..., 'SRX1424406', 'SRX1424407', 'SRX1424408'], dtype=object)
Masking or Dropping Samples or Genes¶
Count values can be returned in one of three ‘forms’ with respect to NaN or zero values:
zero counts ‘masked’ as
NaN
.zero counts ‘dropped’ gene-wise.
zero counts ‘complete’, or included within the matrix, this is the default selection.
This is done using the count_mask
argument.
counts, _ = gsf.get_gem_data(agem, count_mask='complete')
counts.shape
(475, 66338)
counts, _ = gsf.get_gem_data(agem, count_mask='dropped')
counts.shape
(475, 13992)
For samples there are only two options:
Samples with a missing annotation are ‘dropped’.
Use the ‘complete’ set of samples.
This only has an effect when used with annotation_variables
.
counts, label_ds = gsf.get_gem_data(agem, annotation_mask='complete', annotation_variables=['treatment'])
Transforming the Count Matrix¶
A transform can be applied when selecting data, such as a log transform. GSFoge
allows users to supply a function
to transform the subset of counts returned, this function only operates on the count values returned.
counts, label_ds = gsf.get_gem_data(agem, annotation_mask='complete', annotation_variables=['treatment'],
count_transform=lambda c: np.log(c + 1.0))