--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.10.2 kernelspec: display_name: gsfenv language: python name: gsfenv --- # Creating an Annotated Expression Matrix An ``AnnotatedGEM`` is one of the three core data structures provided by ``GSForge``, it contains: + a gene expression matrices (GEM), with named coordinates for genes and samples. + sample annotations / descriptors. + gene annotation / descriptors. More than one transform of the GEM can be stored, so long as it shares the same coordinates. Here we demonstrate the most common ways to load expression data into `GSForge`. ***Notebook setup*** ```{code-cell} ipython3 from os import environ from pathlib import Path import numpy as np import pandas as pd import GSForge as gsf OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa") COUNT_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.raw.txt") LABEL_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "raw_annotation_data", "PRJNA301554.hydroponic.annotations.txt") # Output path. GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc") ``` ## From Text Files If your count and annotation files have matching sample indices you can create an AnnotatedGEM in a single step: ```{code-cell} ipython3 agem = gsf.AnnotatedGEM.from_files( count_path=COUNT_PATH, label_path=LABEL_PATH, # These are the default arguments passed to from_files, # to the individual calls to `pandas.read_csv`. count_kwargs=dict(index_col=0, sep="\t"), label_kwargs=dict(index_col=1, sep="\t"), ) ``` It is not uncommon to have to wrangle sample or gene names to some degree. Once complete you may supply the data as a pair of `pandas.Dataframe` or a single `xarray.DataSet` object. ```{code-cell} ipython3 count_df = pd.read_csv(COUNT_PATH, sep="\t", index_col=0) # Wrangle data here... label_df = pd.read_csv(LABEL_PATH, index_col=1, sep="\t") label_df['genotype'] = label_df['genotype'].str.split(" ", expand=True).iloc[:, 0] label_df['time'] = label_df['time'].str.split(' ', expand=True).iloc[:, 0].astype(int) # Perhaps even more wrangling... # Then provide them to GSForge: del agem agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Oryza sativa") if not GEM_PATH.exists(): agem.save(GEM_PATH) agem ```