Creating an Annotated Expression Matrix¶
An AnnotatedGEM
is one of the three core data structures provided by GSForge
, it contains:
a gene expression matrices (GEM), with named coordinates for genes and samples.
sample annotations / descriptors.
gene annotation / descriptors.
More than one transform of the GEM can be stored, so long as it shares the same coordinates.
Here we demonstrate the most common ways to load expression data into GSForge
.
Notebook setup
from os import environ
from pathlib import Path
import numpy as np
import pandas as pd
import GSForge as gsf
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
COUNT_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.raw.txt")
LABEL_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "raw_annotation_data", "PRJNA301554.hydroponic.annotations.txt")
# Output path.
GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")
From Text Files¶
If your count and annotation files have matching sample indices you can create an AnnotatedGEM in a single step:
agem = gsf.AnnotatedGEM.from_files(
count_path=COUNT_PATH,
label_path=LABEL_PATH,
# These are the default arguments passed to from_files,
# to the individual calls to `pandas.read_csv`.
count_kwargs=dict(index_col=0, sep="\t"),
label_kwargs=dict(index_col=1, sep="\t"),
)
It is not uncommon to have to wrangle sample or gene names to some degree.
Once complete you may supply the data as a pair of pandas.Dataframe
or a single xarray.DataSet
object.
count_df = pd.read_csv(COUNT_PATH, sep="\t", index_col=0)
# Wrangle data here...
label_df = pd.read_csv(LABEL_PATH, index_col=1, sep="\t")
label_df['genotype'] = label_df['genotype'].str.split(" ", expand=True).iloc[:, 0]
label_df['time'] = label_df['time'].str.split(' ', expand=True).iloc[:, 0].astype(int)
# Perhaps even more wrangling...
# Then provide them to GSForge:
del agem
agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Oryza sativa")
if not GEM_PATH.exists():
agem.save(GEM_PATH)
agem
<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
Gene 55986
Sample 475