AnnotatedGEM from pandas¶
This notebook describes how to create and save an AnnotatedGEM
object from separate count and label text files.
A count matrix and an annotation table are often created as separate text files. The count matrix is often formatted with samples as columns and genes as rows due to the way counts are calculated. An annotation file must have a matching 'sample' index to the count file.
Downloading the demo data
A demo gene expression matrix and accompanying annotation text files are stored in a public OSF project. You can download them by:
- Navigating to the data repository on osf and manually download them.
or
- Installing the OSF CLI utility and clone to a directory:
osf -p rbhfz clone ~/GSForge_demo_data
The paths used in this example assume the second method was used.
Declaring used paths
# OS-independent path management.
from os import fspath, environ
from pathlib import Path
Declare the OSF project directory path.
OSF_PATH = Path(environ.get("GSFORGE_DEMO_DATA", default="~/GSForge_demo_data")).expanduser()
OSF_PATH
View the files within:
list(OSF_PATH.glob("**/*"))
Declare the paths to the count and label files.
COUNT_PATH = OSF_PATH.joinpath("osfstorage", "rice_heat_drought.GEM.raw.txt")
LABEL_PATH = OSF_PATH.joinpath("osfstorage", "srx_sample_annots.txt")
GFF3_PATH = OSF_PATH.joinpath("osfstorage", "all.gff3")
AGEM_PATH = OSF_PATH.joinpath("osfstorage", "rice.nc")
Ensure these files exsist.
assert COUNT_PATH.exists()
assert LABEL_PATH.exists()
assert GFF3_PATH.exists()
Preparing the notebook
import pandas as pd
import GSForge as gsf
Loading data with pandas
¶
Loading the count matrix
%%time
count_df = pd.read_csv(COUNT_PATH, sep="\t", index_col=0)
count_df.head()
Loading the annotation table
%%time
label_df = pd.read_csv(LABEL_PATH, index_col=0)
label_df.head()
Ensure sample indexes overlap
Check that the number of samples is the same in both files, and that their overlap is that same length.
assert len(count_df.columns) == len(label_df.index) == len(label_df.index.intersection(count_df.columns))
Combine the dataframes into an AnnotatedGEM:¶
AnnotatedGEM.from_pandas
does a bit of data wrangling, and loads the data into a single xarray.Dataset
.
agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name="Rice")
agem
Examine the data
agem.data
Add gene annotations
pd.read_csv(GFF3_PATH, sep="\t", comment="#",
names=['seqname', 'source', 'feature', 'start', 'end',
'score', 'strand', 'frame', 'attribute']).head(2)
def extract_gff3_gene_lengths(gff3_file):
"""A custom function to extract gene lengths."""
df = pd.read_csv(gff3_file, sep="\t", comment="#",
names=['seqname', 'source', 'feature', 'start', 'end',
'score', 'strand', 'frame', 'attribute'])
gene_ids = df["attribute"].str.extract(r"ID=(\w+)")
df = df[gene_ids.notna().values]
df['Gene'] = gene_ids
df = df.drop_duplicates("Gene")
df = df.set_index("Gene")
return df["end"] - df["start"]
Because gene_lengths is already (hopefully) indexed correctly, it is trivial to incorporate into our AnnotatedGEM.
gene_lengths = extract_gff3_gene_lengths(GFF3_PATH)
agem.data["lengths"] = gene_lengths
Save the AnnotatedGEM
¶
agem.save(AGEM_PATH)
Creating an AnnotatedGEM from files¶
If you are fortunate enough to have consistenly formatted data (like the above example) you can directly load your data into an AnnotatedGEM.
If you do not provide a sep argument in the count_kwargs or label_kwargs dictionaries, GEMprospector will attempt to infer it by reading the first line of each file.
agem = gsf.AnnotatedGEM.from_files(
count_path=COUNT_PATH,
label_path=LABEL_PATH,
# These are the default arguments passed to from_files,
# to the individual calls to `pandas.read_csv`.
count_kwargs=dict(index_col=0),
label_kwargs=dict(index_col=0),
)
agem