---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.10.2
kernelspec:
  display_name: gsfenv
  language: python
  name: gsfenv
---

# Interface Guide

Set operations, intersections, unions and differences and combinations thereof answer many of the most basic questions.
Moreover, we often desires to examine subsets of the original GEM based a selection determined by some set operation.
The ``Interface`` provided by ``GSForge`` provides a uniform API access to both the ``AnnotatedGEM`` and the
``GeneSetCollection`` objects for retrieving count values and sample annotations.


***Notebook setup***

```{code-cell}
import numpy as np
import pandas as pd
import xarray as xr
import GSForge as gsf
import holoviews as hv
hv.extension('bokeh')

# OS-independent path management.
from os import fspath, environ
from pathlib import Path

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
TOUR_DGE = OSF_PATH.joinpath("GeneSetCollections", "DEG_gene_sets")
```

## Load Data

```{code-cell}
agem = gsf.AnnotatedGEM(GEM_PATH)
agem 
```

```{code-cell}
agem.data
```

## Selecting Data using the Interface

Select data through the interface via the `get_gem_data` function.

The simplest possible call to this function returns the zero filled count matrix within a two-item tuple.

```{code-cell}
counts, empty_labels = gsf.get_gem_data(agem)
```

```{code-cell}
counts
```

When `annotation_variables` are not provided, the second item is the `None` singleton.

```{code-cell}
empty_labels == None
```

This is for the sake of consistency in handling the number of expected output objects when `annotation_variables` 
are provided:

```{code-cell}
counts, label_ds = gsf.get_gem_data(agem, annotation_variables=['treatment'])
```

`counts` remain the same, but this time a corresponding label dataset is also returned.

```{code-cell}
label_ds
```

### Masking or Dropping Samples or Genes

Count values can be returned in one of three 'forms' with respect to NaN or zero values:
+ zero counts 'masked' as `NaN`.
+ zero counts 'dropped' gene-wise.
+ zero counts 'complete', or included within the matrix, this is the default selection.

This is done using the `count_mask` argument.

```{code-cell}
counts, _ = gsf.get_gem_data(agem, count_mask='complete')
counts.shape
```

```{code-cell}
counts, _ = gsf.get_gem_data(agem, count_mask='dropped')
counts.shape
```

For samples there are only two options:
+ Samples with a missing annotation are 'dropped'.
+ Use the 'complete' set of samples.

This only has an effect when used with `annotation_variables`.

```{code-cell}
counts, label_ds = gsf.get_gem_data(agem, annotation_mask='complete', annotation_variables=['treatment'])
```

### Transforming the Count Matrix

A transform can be applied when selecting data, such as a log transform. ``GSFoge`` allows users to supply a function 
to transform the subset of counts returned, this function only operates on the count values returned.

```{code-cell}
counts, label_ds = gsf.get_gem_data(agem, annotation_mask='complete', annotation_variables=['treatment'],
                                    count_transform=lambda c: np.log(c + 1.0))
```