AnnotatedGEM from pandas¶

This notebook describes how to create and save an AnnotatedGEM object from separate count and label text files.

A count matrix and an annotation table are often created as separate text files. The count matrix is often formatted with samples as columns and genes as rows due to the way counts are calculated. An annotation file must have a matching 'sample' index to the count file.

Downloading the demo data

A demo gene expression matrix and accompanying annotation text files are stored in a public OSF project. You can download them by:

Navigating to the data repository on osf and manually download them.

Installing the OSF CLI utility and clone to a directory:
```
osf -p rbhfz clone ~/GSForge_demo_data
```

The paths used in this example assume the second method was used.

Declaring used paths

In [1]:

# OS-independent path management.
from os import fspath, environ
from pathlib import Path

Declare the OSF project directory path.

In [2]:

OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
OSF_PATH

Out[2]:

PosixPath('/home/travis/GSForge_demo_data')

View the files within:

In [3]:

list(OSF_PATH.glob("**/*"))

Out[3]:

[PosixPath('/home/travis/GSForge_demo_data/osfstorage'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/rice.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/srx_sample_annots.txt'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/rice_heat_drought.GEM.raw.txt'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/DEG_gene_sets'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/all.gff3'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Treatment.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Genotype.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Subspecies.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/DEG_gene_sets/deseq2_treatment.nc')]

Declare the paths to the count and label files.

In [4]:

COUNT_PATH = OSF_PATH.joinpath("osfstorage", "rice_heat_drought.GEM.raw.txt")
LABEL_PATH = OSF_PATH.joinpath("osfstorage", "srx_sample_annots.txt")
GFF3_PATH = OSF_PATH.joinpath("osfstorage", "all.gff3")
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")

Ensure these files exsist.

In [5]:

assert COUNT_PATH.exists()
assert LABEL_PATH.exists()
assert GFF3_PATH.exists()

Preparing the notebook

In [6]:

import pandas as pd
import GSForge as gsf

Loading data with `pandas`¶

Loading the count matrix

In [7]:

%%time
count_df = pd.read_csv(COUNT_PATH, sep="\t", index_col=0)

CPU times: user 1.5 s, sys: 324 ms, total: 1.82 s
Wall time: 1.82 s

In [8]:

count_df.head()

Out[8]:

	SRX1423934	SRX1423935	SRX1423936	SRX1423937	SRX1423938	SRX1423939	SRX1423940	SRX1423941	SRX1423942	SRX1423943	...	SRX1424399	SRX1424400	SRX1424401	SRX1424402	SRX1424403	SRX1424404	SRX1424405	SRX1424406	SRX1424407	SRX1424408
LOC_Os06g05820	20	2	22	11	23	39	24	34	33	20	...	5	20	20	38	35	43	25	8	8	21
LOC_Os10g27460	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
LOC_Os02g35980	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
LOC_Os09g23260	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
LOC_Os01g41670	0	0	0	0	0	0	0	0	0	0	...	0	0	0	2	0	0	0	0	0	0

5 rows × 475 columns

Loading the annotation table

In [9]:

%%time
label_df = pd.read_csv(LABEL_PATH, index_col=0)

CPU times: user 4.01 ms, sys: 0 ns, total: 4.01 ms
Wall time: 3.61 ms

In [10]:

label_df.head()

Out[10]:

	SampleSRR	Treatment	Time	Tissue	Genotype	Subspecies
Sample
SRX1423934	SRR2931040	CONTROL	15	leaf	AZ	Japonica
SRX1423935	SRR2931041	CONTROL	15	leaf	AZ	Japonica
SRX1423936	SRR2931042	CONTROL	30	leaf	AZ	Japonica
SRX1423937	SRR2931043	CONTROL	30	leaf	AZ	Japonica
SRX1423938	SRR2931044	CONTROL	45	leaf	AZ	Japonica

Ensure sample indexes overlap

Check that the number of samples is the same in both files, and that their overlap is that same length.

In [11]:

assert len(count_df.columns) == len(label_df.index) == len(label_df.index.intersection(count_df.columns))

Combine the dataframes into an AnnotatedGEM:¶

AnnotatedGEM.from_pandas does a bit of data wrangling, and loads the data into a single xarray.Dataset.

In [12]:

agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Rice")
agem

Out[12]:

<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Examine the data

In [13]:

agem.data

Out[13]:

<xarray.Dataset>
Dimensions:     (Gene: 55986, Sample: 475)
Coordinates:
  * Sample      (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
  * Gene        (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
Data variables:
    SampleSRR   (Sample) object 'SRR2931040' 'SRR2931041' ... 'SRR2931514'
    Treatment   (Sample) object 'CONTROL' 'CONTROL' ... 'RECOV_DROUGHT'
    Time        (Sample) int64 15 15 30 30 45 45 60 ... 240 240 270 270 300 300
    Tissue      (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf' 'leaf'
    Genotype    (Sample) object 'AZ' 'AZ' 'AZ' 'AZ' 'AZ' ... 'TD' 'TD' 'TD' 'TD'
    Subspecies  (Sample) object 'Japonica' 'Japonica' ... 'Indica' 'Indica'
    counts      (Sample, Gene) int64 20 0 0 0 0 0 200 ... 19 0 52 335 0 666 0

Add gene annotations

In [14]:

pd.read_csv(GFF3_PATH, sep="\t", comment="#",
             names=['seqname', 'source', 'feature', 'start', 'end',
                    'score', 'strand', 'frame', 'attribute']).head(2)

Out[14]:

	seqname	source	feature	start	end	score	strand	frame	attribute
0	Chr1	MSU_osa1r7	gene	2903	10817	.	+	.	ID=LOC_Os01g01010;Name=LOC_Os01g01010;Note=TBC...
1	Chr1	MSU_osa1r7	mRNA	2903	10817	.	+	.	ID=LOC_Os01g01010.1;Name=LOC_Os01g01010.1;Pare...

In [15]:

def extract_gff3_gene_lengths(gff3_file):
    """A custom function to extract gene lengths."""
    df = pd.read_csv(gff3_file, sep="\t", comment="#",
                     names=['seqname', 'source', 'feature', 'start', 'end',
                            'score', 'strand', 'frame', 'attribute'])
    gene_ids = df["attribute"].str.extract(r"ID=(\w+)")
    df = df[gene_ids.notna().values]
    df['Gene'] = gene_ids
    df = df.drop_duplicates("Gene")
    df = df.set_index("Gene")
    return df["end"] - df["start"]

Because gene_lengths is already (hopefully) indexed correctly, it is trivial to incorporate into our AnnotatedGEM.

In [16]:

gene_lengths = extract_gff3_gene_lengths(GFF3_PATH)

In [17]:

agem.data["lengths"] = gene_lengths

Save the `AnnotatedGEM`¶

In [18]:

agem.save(AGEM_PATH)

Out[18]:

PosixPath('/home/travis/GSForge_demo_data/osfstorage/rice.nc')

Creating an AnnotatedGEM from files¶

If you are fortunate enough to have consistenly formatted data (like the above example) you can directly load your data into an AnnotatedGEM.

If you do not provide a sep argument in the count_kwargs or label_kwargs dictionaries, GEMprospector will attempt to infer it by reading the first line of each file.

In [19]:

agem = gsf.AnnotatedGEM.from_files(
    count_path=COUNT_PATH,
    label_path=LABEL_PATH,
    # These are the default arguments passed to from_files,
    # to the individual calls to `pandas.read_csv`.
    count_kwargs=dict(index_col=0),
    label_kwargs=dict(index_col=0),
)
agem

Out[19]:

<GSForge.AnnotatedGEM>
Name: AnnotatedGEM01384
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Right click to download this notebook from GitHub.

AnnotatedGEM from pandas¶

Loading data with pandas¶

Combine the dataframes into an AnnotatedGEM:¶

Save the AnnotatedGEM¶

Creating an AnnotatedGEM from files¶

Loading data with `pandas`¶

Save the `AnnotatedGEM`¶