AnnotatedGEM from pandas

This notebook describes how to create and save an AnnotatedGEM object from separate count and label text files.

A count matrix and an annotation table are often created as separate text files. The count matrix is often formatted with samples as columns and genes as rows due to the way counts are calculated. An annotation file must have a matching 'sample' index to the count file.

Downloading the demo data

A demo gene expression matrix and accompanying annotation text files are stored in a public OSF project. You can download them by:

or

  • Installing the OSF CLI utility and clone to a directory:
    osf -p rbhfz clone ~/GSForge_demo_data
    

The paths used in this example assume the second method was used.

Declaring used paths

In [1]:
# OS-independent path management.
from os import fspath, environ
from pathlib import Path

Declare the OSF project directory path.

In [2]:
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
OSF_PATH
Out[2]:
PosixPath('/home/travis/GSForge_demo_data')

View the files within:

In [3]:
list(OSF_PATH.glob("**/*"))
Out[3]:
[PosixPath('/home/travis/GSForge_demo_data/osfstorage'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/rice.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/srx_sample_annots.txt'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/rice_heat_drought.GEM.raw.txt'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/DEG_gene_sets'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/all.gff3'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Treatment.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Genotype.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/boruta_gene_sets/Boruta_Subspecies.nc'),
 PosixPath('/home/travis/GSForge_demo_data/osfstorage/DEG_gene_sets/deseq2_treatment.nc')]

Declare the paths to the count and label files.

In [4]:
COUNT_PATH = OSF_PATH.joinpath("osfstorage", "rice_heat_drought.GEM.raw.txt")
LABEL_PATH = OSF_PATH.joinpath("osfstorage", "srx_sample_annots.txt")
GFF3_PATH = OSF_PATH.joinpath("osfstorage", "all.gff3")
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")

Ensure these files exsist.

In [5]:
assert COUNT_PATH.exists()
assert LABEL_PATH.exists()
assert GFF3_PATH.exists()

Preparing the notebook

In [6]:
import pandas as pd
import GSForge as gsf

Loading data with pandas

Loading the count matrix

In [7]:
%%time
count_df = pd.read_csv(COUNT_PATH, sep="\t", index_col=0)
CPU times: user 1.5 s, sys: 324 ms, total: 1.82 s
Wall time: 1.82 s
In [8]:
count_df.head()
Out[8]:
SRX1423934 SRX1423935 SRX1423936 SRX1423937 SRX1423938 SRX1423939 SRX1423940 SRX1423941 SRX1423942 SRX1423943 ... SRX1424399 SRX1424400 SRX1424401 SRX1424402 SRX1424403 SRX1424404 SRX1424405 SRX1424406 SRX1424407 SRX1424408
LOC_Os06g05820 20 2 22 11 23 39 24 34 33 20 ... 5 20 20 38 35 43 25 8 8 21
LOC_Os10g27460 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
LOC_Os02g35980 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
LOC_Os09g23260 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
LOC_Os01g41670 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 2 0 0 0 0 0 0

5 rows × 475 columns

Loading the annotation table

In [9]:
%%time
label_df = pd.read_csv(LABEL_PATH, index_col=0)
CPU times: user 4.01 ms, sys: 0 ns, total: 4.01 ms
Wall time: 3.61 ms
In [10]:
label_df.head()
Out[10]:
SampleSRR Treatment Time Tissue Genotype Subspecies
Sample
SRX1423934 SRR2931040 CONTROL 15 leaf AZ Japonica
SRX1423935 SRR2931041 CONTROL 15 leaf AZ Japonica
SRX1423936 SRR2931042 CONTROL 30 leaf AZ Japonica
SRX1423937 SRR2931043 CONTROL 30 leaf AZ Japonica
SRX1423938 SRR2931044 CONTROL 45 leaf AZ Japonica

Ensure sample indexes overlap

Check that the number of samples is the same in both files, and that their overlap is that same length.

In [11]:
assert len(count_df.columns) == len(label_df.index) == len(label_df.index.intersection(count_df.columns))

Combine the dataframes into an AnnotatedGEM:

AnnotatedGEM.from_pandas does a bit of data wrangling, and loads the data into a single xarray.Dataset.

In [12]:
agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Rice")
agem
Out[12]:
<GSForge.AnnotatedGEM>
Name: Rice
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475

Examine the data

In [13]:
agem.data
Out[13]:
<xarray.Dataset>
Dimensions:     (Gene: 55986, Sample: 475)
Coordinates:
  * Sample      (Sample) object 'SRX1423934' 'SRX1423935' ... 'SRX1424408'
  * Gene        (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'
Data variables:
    SampleSRR   (Sample) object 'SRR2931040' 'SRR2931041' ... 'SRR2931514'
    Treatment   (Sample) object 'CONTROL' 'CONTROL' ... 'RECOV_DROUGHT'
    Time        (Sample) int64 15 15 30 30 45 45 60 ... 240 240 270 270 300 300
    Tissue      (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf' 'leaf'
    Genotype    (Sample) object 'AZ' 'AZ' 'AZ' 'AZ' 'AZ' ... 'TD' 'TD' 'TD' 'TD'
    Subspecies  (Sample) object 'Japonica' 'Japonica' ... 'Indica' 'Indica'
    counts      (Sample, Gene) int64 20 0 0 0 0 0 200 ... 19 0 52 335 0 666 0

Add gene annotations

In [14]:
pd.read_csv(GFF3_PATH, sep="\t", comment="#",
             names=['seqname', 'source', 'feature', 'start', 'end',
                    'score', 'strand', 'frame', 'attribute']).head(2)
Out[14]:
seqname source feature start end score strand frame attribute
0 Chr1 MSU_osa1r7 gene 2903 10817 . + . ID=LOC_Os01g01010;Name=LOC_Os01g01010;Note=TBC...
1 Chr1 MSU_osa1r7 mRNA 2903 10817 . + . ID=LOC_Os01g01010.1;Name=LOC_Os01g01010.1;Pare...
In [15]:
def extract_gff3_gene_lengths(gff3_file):
    """A custom function to extract gene lengths."""
    df = pd.read_csv(gff3_file, sep="\t", comment="#",
                     names=['seqname', 'source', 'feature', 'start', 'end',
                            'score', 'strand', 'frame', 'attribute'])
    gene_ids = df["attribute"].str.extract(r"ID=(\w+)")
    df = df[gene_ids.notna().values]
    df['Gene'] = gene_ids
    df = df.drop_duplicates("Gene")
    df = df.set_index("Gene")
    return df["end"] - df["start"]

Because gene_lengths is already (hopefully) indexed correctly, it is trivial to incorporate into our AnnotatedGEM.

In [16]:
gene_lengths = extract_gff3_gene_lengths(GFF3_PATH)
In [17]:
agem.data["lengths"] = gene_lengths

Save the AnnotatedGEM

In [18]:
agem.save(AGEM_PATH)
Out[18]:
PosixPath('/home/travis/GSForge_demo_data/osfstorage/rice.nc')

Creating an AnnotatedGEM from files

If you are fortunate enough to have consistenly formatted data (like the above example) you can directly load your data into an AnnotatedGEM.

If you do not provide a sep argument in the count_kwargs or label_kwargs dictionaries, GEMprospector will attempt to infer it by reading the first line of each file.

In [19]:
agem = gsf.AnnotatedGEM.from_files(
    count_path=COUNT_PATH,
    label_path=LABEL_PATH,
    # These are the default arguments passed to from_files,
    # to the individual calls to `pandas.read_csv`.
    count_kwargs=dict(index_col=0),
    label_kwargs=dict(index_col=0),
)
agem
Out[19]:
<GSForge.AnnotatedGEM>
Name: AnnotatedGEM01384
Selected GEM Variable: 'counts'
    Gene   55986
    Sample 475


Right click to download this notebook from GitHub.