{ "cells": [ { "cell_type": "markdown", "id": "08631867", "metadata": {}, "source": [ "# 01 Loading data into an AnnotatedGEM\n", "\n", "This notebook describes how to create and save an `AnnotatedGEM` object from separate count and label text files.\n", "\n", "***Download the demo data***\n", "\n", "An expression matrix and accompanying annotation text files are available in a public [OSF](https://osf.io) project.\n", "You can download them by:\n", "+ Navigating to the [data repository on osf](https://osf.io/t3xpw/) and manually download them.\n", "or\n", "+ Installing the [OSF CLI utility](https://osfclient.readthedocs.io/en/latest/index.html) and clone to a directory:\n", " \n", " **Linux**\n", " ```bash\n", " # Install the osfclient.\n", " pip install osfclient\n", " \n", " # To clone the entire osf project:\n", " osf -p rbhfz clone ~/GSForge_demo_data\n", " \n", " # To pull the minimum number of files to complete the examples:\n", " osf \n", " ```\n", "\n", "***Set up the notebook***" ] }, { "cell_type": "code", "execution_count": 1, "id": "013c3b81", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:05:49.791345Z", "iopub.status.busy": "2021-08-06T21:05:49.790501Z", "iopub.status.idle": "2021-08-06T21:05:58.059572Z", "shell.execute_reply": "2021-08-06T21:05:58.060154Z" } }, "outputs": [], "source": [ "# OS-independent path management.\n", "from os import environ\n", "from pathlib import Path\n", "import pandas as pd\n", "import GSForge as gsf" ] }, { "cell_type": "markdown", "id": "8dc4c712", "metadata": {}, "source": [ "***Declare used paths***\n", "\n", "Declare the OSF project directory path. This is the root directory of the data files used in this notebook." ] }, { "cell_type": "code", "execution_count": 2, "id": "a204dd0b", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:05:58.068374Z", "iopub.status.busy": "2021-08-06T21:05:58.067400Z", "iopub.status.idle": "2021-08-06T21:05:58.069609Z", "shell.execute_reply": "2021-08-06T21:05:58.070305Z" } }, "outputs": [], "source": [ "OSF_PATH = Path(environ.get(\"GSFORGE_DEMO_DATA\", default=\"~/GSForge_demo_data/\")).expanduser().joinpath(\"osfstorage\", \"oryza_sativa\")\n", "RAW_COUNT_PATH = OSF_PATH.joinpath(\"GEMmakerGEMs\", \"rice_heat_drought.GEM.raw.txt\")\n", "HYDRO_LABEL_PATH = OSF_PATH.joinpath(\"GEMmakerGEMs\", \"raw_annotation_data\", \"PRJNA301554.hydroponic.annotations.txt\")" ] }, { "cell_type": "markdown", "id": "3aa0f927", "metadata": {}, "source": [ "Ensure these files exist." ] }, { "cell_type": "code", "execution_count": 3, "id": "06ea372d", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:05:58.076132Z", "iopub.status.busy": "2021-08-06T21:05:58.075241Z", "iopub.status.idle": "2021-08-06T21:05:58.078058Z", "shell.execute_reply": "2021-08-06T21:05:58.078764Z" } }, "outputs": [], "source": [ "assert RAW_COUNT_PATH.exists()\n", "assert HYDRO_LABEL_PATH.exists()" ] }, { "cell_type": "markdown", "id": "74af1f6b", "metadata": {}, "source": [ "Finally, declare a path to save the created `.nc` file." ] }, { "cell_type": "code", "execution_count": 4, "id": "e8671a53", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:05:58.085257Z", "iopub.status.busy": "2021-08-06T21:05:58.084322Z", "iopub.status.idle": "2021-08-06T21:05:58.088426Z", "shell.execute_reply": "2021-08-06T21:05:58.087687Z" } }, "outputs": [], "source": [ "GEM_PATH = OSF_PATH.joinpath(\"AnnotatedGEMs\", \"oryza_sativa_hisat2_raw.nc\")" ] }, { "cell_type": "markdown", "id": "c7710e56", "metadata": {}, "source": [ "## Loading data with `pandas`\n", "\n", "***Loading the count matrix***" ] }, { "cell_type": "code", "execution_count": 5, "id": "ad9a8010", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:05:58.096753Z", "iopub.status.busy": "2021-08-06T21:05:58.095957Z", "iopub.status.idle": "2021-08-06T21:06:00.548243Z", "shell.execute_reply": "2021-08-06T21:06:00.548789Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.18 s, sys: 245 ms, total: 2.42 s\n", "Wall time: 2.45 s\n" ] } ], "source": [ "%%time\n", "count_df = pd.read_csv(RAW_COUNT_PATH, sep=\"\\t\", index_col=0)" ] }, { "cell_type": "code", "execution_count": 6, "id": "f0926b5e", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:06:00.569618Z", "iopub.status.busy": "2021-08-06T21:06:00.568851Z", "iopub.status.idle": "2021-08-06T21:06:00.580494Z", "shell.execute_reply": "2021-08-06T21:06:00.581449Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(55986, 475)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SRX1423934SRX1423935SRX1423936SRX1423937SRX1423938SRX1423939SRX1423940SRX1423941SRX1423942SRX1423943...SRX1424399SRX1424400SRX1424401SRX1424402SRX1424403SRX1424404SRX1424405SRX1424406SRX1424407SRX1424408
LOC_Os06g058202022211233924343320...52020383543258821
LOC_Os10g274600000000000...0000000000
LOC_Os02g359800000000000...0000000000
LOC_Os09g232600000000000...0000000000
LOC_Os01g416700000000000...0002000000
\n", "

5 rows × 475 columns

\n", "
" ], "text/plain": [ " SRX1423934 SRX1423935 SRX1423936 SRX1423937 SRX1423938 \\\n", "LOC_Os06g05820 20 2 22 11 23 \n", "LOC_Os10g27460 0 0 0 0 0 \n", "LOC_Os02g35980 0 0 0 0 0 \n", "LOC_Os09g23260 0 0 0 0 0 \n", "LOC_Os01g41670 0 0 0 0 0 \n", "\n", " SRX1423939 SRX1423940 SRX1423941 SRX1423942 SRX1423943 \\\n", "LOC_Os06g05820 39 24 34 33 20 \n", "LOC_Os10g27460 0 0 0 0 0 \n", "LOC_Os02g35980 0 0 0 0 0 \n", "LOC_Os09g23260 0 0 0 0 0 \n", "LOC_Os01g41670 0 0 0 0 0 \n", "\n", " ... SRX1424399 SRX1424400 SRX1424401 SRX1424402 \\\n", "LOC_Os06g05820 ... 5 20 20 38 \n", "LOC_Os10g27460 ... 0 0 0 0 \n", "LOC_Os02g35980 ... 0 0 0 0 \n", "LOC_Os09g23260 ... 0 0 0 0 \n", "LOC_Os01g41670 ... 0 0 0 2 \n", "\n", " SRX1424403 SRX1424404 SRX1424405 SRX1424406 SRX1424407 \\\n", "LOC_Os06g05820 35 43 25 8 8 \n", "LOC_Os10g27460 0 0 0 0 0 \n", "LOC_Os02g35980 0 0 0 0 0 \n", "LOC_Os09g23260 0 0 0 0 0 \n", "LOC_Os01g41670 0 0 0 0 0 \n", "\n", " SRX1424408 \n", "LOC_Os06g05820 21 \n", "LOC_Os10g27460 0 \n", "LOC_Os02g35980 0 \n", "LOC_Os09g23260 0 \n", "LOC_Os01g41670 0 \n", "\n", "[5 rows x 475 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(count_df.shape)\n", "count_df.head()" ] }, { "cell_type": "markdown", "id": "965321ab", "metadata": {}, "source": [ "***Loading the annotation table***" ] }, { "cell_type": "code", "execution_count": 7, "id": "29195143", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:06:00.589299Z", "iopub.status.busy": "2021-08-06T21:06:00.588574Z", "iopub.status.idle": "2021-08-06T21:06:00.609114Z", "shell.execute_reply": "2021-08-06T21:06:00.609818Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 18.1 ms, sys: 0 ns, total: 18.1 ms\n", "Wall time: 17.7 ms\n" ] } ], "source": [ "%%time\n", "label_df = pd.read_csv(HYDRO_LABEL_PATH, index_col=1, sep=\"\\t\")\n", "label_df['genotype'] = label_df['genotype'].str.split(\" \", expand=True).iloc[:, 0]\n", "label_df['time'] = label_df['time'].str.split(' ', expand=True).iloc[:, 0].astype(int)" ] }, { "cell_type": "code", "execution_count": 8, "id": "20e5393c", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:06:00.642122Z", "iopub.status.busy": "2021-08-06T21:06:00.641367Z", "iopub.status.idle": "2021-08-06T21:06:00.645084Z", "shell.execute_reply": "2021-08-06T21:06:00.644486Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BioSampleLoadDateMBasesMBytesRunSRA_SampleSample_Namegenotypetimetreatment...InstrumentLibraryLayoutLibrarySelectionLibrarySourceOrganismPlatformReleaseDateSRA_Studysource_nametissue
Experiment
SRX1423937SAMN042518512015-11-201166764SRR2931043SRS1156717GSM1933349Azuenca30CONTROL...Illumina HiSeq 2000PAIREDcDNATRANSCRIPTOMICOryza sativaILLUMINA2016-01-04SRP065945Rice leafleaf
SRX1423938SAMN042518522015-11-2040052500SRR2931044SRS1156720GSM1933350Azuenca45CONTROL...Illumina HiSeq 2000PAIREDcDNATRANSCRIPTOMICOryza sativaILLUMINA2016-01-04SRP065945Rice leafleaf
\n", "

2 rows × 28 columns

\n", "
" ], "text/plain": [ " BioSample LoadDate MBases MBytes Run SRA_Sample \\\n", "Experiment \n", "SRX1423937 SAMN04251851 2015-11-20 1166 764 SRR2931043 SRS1156717 \n", "SRX1423938 SAMN04251852 2015-11-20 4005 2500 SRR2931044 SRS1156720 \n", "\n", " Sample_Name genotype time treatment ... Instrument \\\n", "Experiment ... \n", "SRX1423937 GSM1933349 Azuenca 30 CONTROL ... Illumina HiSeq 2000 \n", "SRX1423938 GSM1933350 Azuenca 45 CONTROL ... Illumina HiSeq 2000 \n", "\n", " LibraryLayout LibrarySelection LibrarySource Organism \\\n", "Experiment \n", "SRX1423937 PAIRED cDNA TRANSCRIPTOMIC Oryza sativa \n", "SRX1423938 PAIRED cDNA TRANSCRIPTOMIC Oryza sativa \n", "\n", " Platform ReleaseDate SRA_Study source_name tissue \n", "Experiment \n", "SRX1423937 ILLUMINA 2016-01-04 SRP065945 Rice leaf leaf \n", "SRX1423938 ILLUMINA 2016-01-04 SRP065945 Rice leaf leaf \n", "\n", "[2 rows x 28 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "label_df.head(2)" ] }, { "cell_type": "markdown", "id": "e271b5da", "metadata": {}, "source": [ "## Combine the dataframes into an AnnotatedGEM:\n", "\n", "`AnnotatedGEM.from_pandas` does a bit of data wrangling, and loads the data into a single `xarray.Dataset`." ] }, { "cell_type": "code", "execution_count": 9, "id": "d87ae56f", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:06:00.717950Z", "iopub.status.busy": "2021-08-06T21:06:00.684799Z", "iopub.status.idle": "2021-08-06T21:06:00.721876Z", "shell.execute_reply": "2021-08-06T21:06:00.721259Z" } }, "outputs": [ { "data": { "text/plain": [ "\n", "Name: Oryza sativa\n", "Selected GEM Variable: 'counts'\n", " Gene 55986\n", " Sample 475" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agem = gsf.AnnotatedGEM.from_pandas(count_df=count_df, label_df=label_df, name=\"Oryza sativa\")\n", "agem" ] }, { "cell_type": "markdown", "id": "e659415e", "metadata": {}, "source": [ "***Examine the data***" ] }, { "cell_type": "code", "execution_count": 10, "id": "2548970a", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:06:00.740642Z", "iopub.status.busy": "2021-08-06T21:06:00.739850Z", "iopub.status.idle": "2021-08-06T21:06:00.830907Z", "shell.execute_reply": "2021-08-06T21:06:00.831499Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:             (Sample: 475, Gene: 55986)\n",
       "Coordinates:\n",
       "  * Sample              (Sample) object 'SRX1423934' ... 'SRX1424408'\n",
       "  * Gene                (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'\n",
       "Data variables: (12/29)\n",
       "    BioSample           (Sample) object 'SAMN04251848' ... 'SAMN04251607'\n",
       "    LoadDate            (Sample) object '2015-11-20' ... '2015-11-19'\n",
       "    MBases              (Sample) int64 4016 5202 4053 1166 ... 3098 3529 2922\n",
       "    MBytes              (Sample) int64 2738 3652 2719 764 ... 1983 2370 1862\n",
       "    Run                 (Sample) object 'SRR2931040' ... 'SRR2931514'\n",
       "    SRA_Sample          (Sample) object 'SRS1156722' ... 'SRS1156251'\n",
       "    ...                  ...\n",
       "    Platform            (Sample) object 'ILLUMINA' 'ILLUMINA' ... 'ILLUMINA'\n",
       "    ReleaseDate         (Sample) object '2016-01-04' ... '2016-01-04'\n",
       "    SRA_Study           (Sample) object 'SRP065945' 'SRP065945' ... 'SRP065945'\n",
       "    source_name         (Sample) object 'Rice leaf' 'Rice leaf' ... 'Rice leaf'\n",
       "    tissue              (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf'\n",
       "    counts              (Sample, Gene) int64 20 0 0 0 0 0 ... 0 52 335 0 666 0
" ], "text/plain": [ "\n", "Dimensions: (Sample: 475, Gene: 55986)\n", "Coordinates:\n", " * Sample (Sample) object 'SRX1423934' ... 'SRX1424408'\n", " * Gene (Gene) object 'LOC_Os06g05820' ... 'LOC_Os07g03418'\n", "Data variables: (12/29)\n", " BioSample (Sample) object 'SAMN04251848' ... 'SAMN04251607'\n", " LoadDate (Sample) object '2015-11-20' ... '2015-11-19'\n", " MBases (Sample) int64 4016 5202 4053 1166 ... 3098 3529 2922\n", " MBytes (Sample) int64 2738 3652 2719 764 ... 1983 2370 1862\n", " Run (Sample) object 'SRR2931040' ... 'SRR2931514'\n", " SRA_Sample (Sample) object 'SRS1156722' ... 'SRS1156251'\n", " ... ...\n", " Platform (Sample) object 'ILLUMINA' 'ILLUMINA' ... 'ILLUMINA'\n", " ReleaseDate (Sample) object '2016-01-04' ... '2016-01-04'\n", " SRA_Study (Sample) object 'SRP065945' 'SRP065945' ... 'SRP065945'\n", " source_name (Sample) object 'Rice leaf' 'Rice leaf' ... 'Rice leaf'\n", " tissue (Sample) object 'leaf' 'leaf' 'leaf' ... 'leaf' 'leaf'\n", " counts (Sample, Gene) int64 20 0 0 0 0 0 ... 0 52 335 0 666 0" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agem.data" ] }, { "cell_type": "markdown", "id": "2fa8104b", "metadata": {}, "source": [ "## Save the `AnnotatedGEM`" ] }, { "cell_type": "code", "execution_count": 11, "id": "7a33fc57", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:06:00.837461Z", "iopub.status.busy": "2021-08-06T21:06:00.836553Z", "iopub.status.idle": "2021-08-06T21:06:00.839157Z", "shell.execute_reply": "2021-08-06T21:06:00.839828Z" } }, "outputs": [], "source": [ "if not GEM_PATH.exists():\n", " agem.save(GEM_PATH)" ] }, { "cell_type": "markdown", "id": "339a50f6", "metadata": {}, "source": [ "## Creating an AnnotatedGEM from files\n", "\n", "If you are fortunate enough to have consistently formatted data (like the above example) you can directly\n", "load your data into an AnnotatedGEM.\n", "\n", "If you do not provide a sep argument in the count_kwargs or label_kwargs dictionaries, `GSForge`\n", "will attempt to infer it by reading the first line of each file." ] }, { "cell_type": "code", "execution_count": 12, "id": "cec06ef3", "metadata": { "execution": { "iopub.execute_input": "2021-08-06T21:06:00.846175Z", "iopub.status.busy": "2021-08-06T21:06:00.845326Z", "iopub.status.idle": "2021-08-06T21:06:03.338073Z", "shell.execute_reply": "2021-08-06T21:06:03.337445Z" } }, "outputs": [ { "data": { "text/plain": [ "\n", "Name: AnnotatedGEM00194\n", "Selected GEM Variable: 'counts'\n", " Gene 55986\n", " Sample 475" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del agem\n", "\n", "agem = gsf.AnnotatedGEM.from_files(\n", " count_path=RAW_COUNT_PATH,\n", " label_path=HYDRO_LABEL_PATH,\n", " # These are the default arguments passed to from_files,\n", " # to the individual calls to `pandas.read_csv`.\n", " count_kwargs=dict(index_col=0, sep=\"\\t\"),\n", " label_kwargs=dict(index_col=1, sep=\"\\t\"),\n", ")\n", "agem" ] } ], "metadata": { "kernelspec": { "display_name": "gsfenv", "language": "python", "name": "gsfenv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }