01 Loading data into an AnnotatedGEM

This notebook describes how to create and save an AnnotatedGEM object from separate count and label text files.

Download the demo data

An expression matrix and accompanying annotation text files are available in a public OSF project. You can download them by:

  • Navigating to the data repository on osf and manually download them. or

  • Installing the OSF CLI utility and clone to a directory:

    Linux

    # Install the osfclient.
    pip install osfclient
    
    # To clone the entire osf project:
    osf -p rbhfz clone ~/GSForge_demo_data
    
    # To pull the minimum number of files to complete the examples:
    osf 
    

Set up the notebook

# OS-independent path management.
from os import  environ
from pathlib import Path
import pandas as pd
import GSForge as gsf

Declare used paths

Declare the OSF project directory path. This is the root directory of the data files used in this notebook.

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data/")).expanduser().joinpath("osfstorage", "oryza_sativa")
RAW_COUNT_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "rice_heat_drought.GEM.raw.txt")
HYDRO_LABEL_PATH = OSF_PATH.joinpath("GEMmakerGEMs", "raw_annotation_data", "PRJNA301554.hydroponic.annotations.txt")

Ensure these files exist.

assert RAW_COUNT_PATH.exists()
assert HYDRO_LABEL_PATH.exists()

Finally, declare a path to save the created .nc file.

GEM_PATH = OSF_PATH.joinpath("AnnotatedGEMs", "oryza_sativa_hisat2_raw.nc")

Loading data with pandas

Loading the count matrix

%%time
count_df = pd.read_csv(RAW_COUNT_PATH, sep="\t", index_col=0)
CPU times: user 2.18 s, sys: 245 ms, total: 2.42 s
Wall time: 2.45 s
print(count_df.shape)
count_df.head()
(55986, 475)
SRX1423934 SRX1423935 SRX1423936 SRX1423937 SRX1423938 SRX1423939 SRX1423940 SRX1423941 SRX1423942 SRX1423943 ... SRX1424399 SRX1424400 SRX1424401 SRX1424402 SRX1424403 SRX1424404 SRX1424405 SRX1424406 SRX1424407 SRX1424408
LOC_Os06g05820 20 2 22 11 23 39 24 34 33 20 ... 5 20 20 38 35 43 25 8 8 21
LOC_Os10g27460 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
LOC_Os02g35980 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
LOC_Os09g23260 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
LOC_Os01g41670 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 2 0 0 0 0 0 0

5 rows × 475 columns

Loading the annotation table

%%time
label_df = pd.read_csv(HYDRO_LABEL_PATH, index_col=1, sep="\t")
label_df['genotype'] = label_df['genotype'].str.split(" ", expand=True).iloc[:, 0]
label_df['time'] = label_df['time'].str.split(' ', expand=True).iloc[:, 0].astype(int)
CPU times: user 18.1 ms, sys: 0 ns, total: 18.1 ms
Wall time: 17.7 ms
label_df.head(2)
BioSample LoadDate MBases MBytes Run SRA_Sample Sample_Name genotype time treatment ... Instrument LibraryLayout LibrarySelection LibrarySource Organism Platform ReleaseDate SRA_Study source_name tissue
Experiment
SRX1423937 SAMN04251851 2015-11-20 1166 764 SRR2931043 SRS1156717 GSM1933349 Azuenca 30 CONTROL ... Illumina HiSeq 2000 PAIRED cDNA TRANSCRIPTOMIC Oryza sativa ILLUMINA 2016-01-04 SRP065945 Rice leaf leaf
SRX1423938 SAMN04251852 2015-11-20 4005 2500 SRR2931044 SRS1156720 GSM1933350 Azuenca 45 CONTROL ... Illumina HiSeq 2000 PAIRED cDNA TRANSCRIPTOMIC Oryza sativa ILLUMINA 2016-01-04 SRP065945 Rice leaf leaf

2 rows × 28 columns

Combine the dataframes into an AnnotatedGEM:

AnnotatedGEM.from_pandas does a bit of data wrangling, and loads the data into a single xarray.Dataset.

agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Oryza sativa")
agem
<GSForge.AnnotatedGEM>
Name: Oryza sativa
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Examine the data

agem.data
<xarray.Dataset>
Dimensions:             (Sample: 475, Gene: 55986)
Coordinates:
  * Sample              (Sample) object 'SRX1423934' ... 'SRX1424408'
  * Gene                (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
Data variables: (12/29)
    BioSample           (Sample) object 'SAMN04251848' ... 'SAMN04251607'
    LoadDate            (Sample) object '2015-11-20' ... '2015-11-19'
    MBases              (Sample) int64 4016 5202 4053 1166 ... 3098 3529 2922
    MBytes              (Sample) int64 2738 3652 2719 764 ... 1983 2370 1862
    Run                 (Sample) object 'SRR2931040' ... 'SRR2931514'
    SRA_Sample          (Sample) object 'SRS1156722' ... 'SRS1156251'
    ...                  ...
    Platform            (Sample) object 'ILLUMINA' 'ILLUMINA' ... 'ILLUMINA'
    ReleaseDate         (Sample) object '2016-01-04' ... '2016-01-04'
    SRA_Study           (Sample) object 'SRP065945' 'SRP065945' ... 'SRP065945'
    source_name         (Sample) object 'Rice leaf' 'Rice leaf' ... 'Rice leaf'
    tissue              (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf'
    counts              (Sample, Gene) int64 20 0 0 0 0 0 ... 0 52 335 0 666 0

Save the AnnotatedGEM

if not GEM_PATH.exists():
    agem.save(GEM_PATH)

Creating an AnnotatedGEM from files

If you are fortunate enough to have consistently formatted data (like the above example) you can directly load your data into an AnnotatedGEM.

If you do not provide a sep argument in the count_kwargs or label_kwargs dictionaries, GSForge will attempt to infer it by reading the first line of each file.

del agem

agem = gsf.AnnotatedGEM.from_files(
    count_path=RAW_COUNT_PATH,
    label_path=HYDRO_LABEL_PATH,
    # These are the default arguments passed to from_files,
    # to the individual calls to `pandas.read_csv`.
    count_kwargs=dict(index_col=0, sep="\t"),
    label_kwargs=dict(index_col=1, sep="\t"),
)
agem
<GSForge.AnnotatedGEM>
Name: AnnotatedGEM00194
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475