.. _contribute-meta-recipe: Creating a ggd meta-recipe ========================== [:ref:`Click here to return to the home page `] This page is specific to creating a ggd **meta-recipe**. If you are looking to create a normal ggd recipe see :ref:`Creating a ggd recipe ` The following steps outline how to create, check, and add a ggd data meta-recipe. 1. Update local forked repo --------------------------- You will need to update the forked ggd-recipes repo on your local machine before you add a recipe to it. * Navigate to the forked ggd-recipes repo on your local machine * Once in the directory run the following commands :: $ git checkout master $ git pull upstream master $ git push origin master 2. Writing the curation script(s) --------------------------------- A meta-recipe script should be quite a bit more detailed then a general ggd recipe script. Additionally, it is common to consider using multiple scripts for a meta-recipe, where you have a single main bash script which is used to control the process of all other scripts. .. note:: Whether using a single or multiple scripts, the main script must be a bash script Things to consider when building a meta-recipe (This list is by far NOT a comprehensive list of things to think about while creating a meta-recipe): * What types of identifiers do I need in order to download the data from the database of choice? * Are there different types of identifiers? If so, how do I handle them? * Does the database have and FTP site, a SQL database, etc? (How and where am I going to get the data from?) * Is there a way to check if the ID and/or the data exists in the database? * How do I handle a bad ID or the absence of data? * Is there some hash value, like an md5sum hash value, I can use to validate that the contents of the data downloaded is correct and there wasn't an error during downloading? * What data should be downloaded, and where is it coming from? * Is there additional processing that needs to happen once the data is downloaded? * With an ID, and potentially the downloaded data, what information can I get and used to update the ID specific recipe metadata? * How am I going to handle errors? * etc. Creating the main bash script +++++++++++++++++++++++++++++ This script should handle the ID. Data download, curation, etc. can be handled by this script or can be passed to a different supporting script. GGD provides 4 environment variables to use during meta-recipe installation in order to help the process. They are: +----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_ID** | This is the ID provided during installation (Example: GSE123 for the GEO meta-recipe) | +----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ | **SCRIPTS_PATH** | The directory path to where the additional scripts are stored. (This path is required in order to run any supporting scripts) | +----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_ENV_VAR_FILE** | This is the file path to store the ID specific updates to the meta-recipe. (More on this later) | +----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_FINAL_COMMANDS_FILE** | This is the file path to a bash script which is used to store the final/actual commands used to install the ID specific data. (More on this later) | +----------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+ In addition to providing these four environment variables these variables are passed into the main bash script as the first four parameters as follows: (Strict Order) .. code-block:: bash id=$1 # 1st argument is GGD_METARECIPE_ID script_path=$2 # 2nd argument is SCRIPTS_PATH env_var_file=$3 # 3rd argument is GGD_METARECIPE_ENV_VAR_FILE commands_file=$4 # 4rth argument is GGD_METARECIPE_FINAL_COMMANDS_FILE The **GGD_METARECIPE_ID** will match exactly what was entered in the install command. GGD will not change case or order. A) ID (GGD_METARECIPE_ID): The ID should be used to identify and download the data that is associated with that ID. If the ID doesn't exists or there is no data for that ID then the bash script should print a warning/error message and exit. B) Script Path (SCRIPTS_PATH): In order to use a supporting script, the script path must be used. For example, if you have a supporting script named "get_id_metadata.py" which you run from within the main bash script you would do: .. code-block:: bash script_path=$2 # 2nd argument is SCRIPTS_PATH python $script_path/get_id_metadata.py or .. code-block:: bash python $SCRIPTS_PATH/get_id_metadata.py where are the arguments needed for the "get_id_metadata.py" script. C) Updating ID specific metadata (GGD_METARECIPE_ENV_VAR_FILE, GGD_METARECIPE_FINAL_COMMANDS_FILE): One of the main advantages of meta-recipes is the ability to update a recipes metadata based on information about the specific ID supplied. That is, based on the ID what information is there that should be added to the metadata. Although it is not required to updated the metadata, it is highly recommended that you do. Otherwise, the metadata will consist of the general information of the meta-recipe without any ID specific info. GGD provides two environment variables to use for this purpose. **GGD_METARECIPE_FINAL_COMMANDS_FILE**: This represents a bash file that should store the commands used for installing and processing the data files specific to the ID. The main meta-recipe script will being doing a lot of work. This file should capture the essential pieces for determining where and how the data was installed and processed. Other information should be kept out. This file acts as a place holder for what will be updated in the ID specific meta-recipe metadata. That is, after the meta-recipe is installed and the metadata has been updated, a user will be able to access these commands through the :code:`ggd pkg-info` command. This helps to support reproducibility and transparency. Again, although it is not required it is highly recommended that this step is taken. **GGD_METARECIPE_ENV_VAR_FILE**: This file represents different "environment variables" that can be set in order to update the metadata of an ID specific meta-recipe. This file is a ``.json`` file. This means that the meta-recipe needs to save the contents of the file as a .json file, otherwise GGD will not be able to use the updated environment variables. The json file should act as a dictionary/map with environment variable to change as keys and the content changes as values. The available keys are: +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_SUMMARY** | (string) A summary of the installed data | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_SPECIES** | (string) The species of the installed data | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_GENOME_BUILD** | (string) The genome build of the installed data | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_VERSION** | (string) The version of the data installed | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_KEYWORDS** | (list) A list of key words to add to the metadata | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_DATA_PROVIDER** | (string) The data provider of the recipe. (Should already exists. Should not be used) | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_FILE_TYPE** | (list) A list of file types for the files installed by the package | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ | **GGD_METARECIPE_GENOMIC_COORDINATE_BASE** | (string) A string that represented the coordinate base of the installed files | +---------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------+ Not all keys are required to be set. It is recommended that the **GGD_METARECIPE_SUMMARY** be updated, the **GGD_METARECIPE_SPECIES** and **GGD_METARECIPE_GENOME_BUILD** be updated if data is available to update them, the **GGD_METARECIPE_VERSION** be updated, and the **GGD_METARECIPE_KEYWORDS** be updated. The remaining keys/environment variable names can be used if data is available to update them. .. note:: The data provider can be updated, but it is recommended that the data provider is not updated. If the data provider needs to be updated we suggest that a different recipe be created for that data provider specifically. After an ID specific meta-recipe is installed, GGD will check to see if any of the two files exists. If they do, GGD will update the metadata of the ID specific meta-recipe. These updates are available via the :code:`ggd pkg-info` command. Please try to update the metadata whenever possible. The meta-recipe main bash script should also clean up any extra files or other processes that were needed during the installation process. Creating the supporting script(s) +++++++++++++++++++++++++++++++++ Supporting scripts are not needed if everything can be done easily in the main bash script without them. However, supporting scripts can be helpful in defining the updated metadata for the ID specific recipe installed, or for other tasks that aren't done easily in bash. Supporting scripts need to be accessible through the main bash script, and any arguments needed for the supporting scripts needs to be accessible and/or generated within the main bash script. There is not requirement for language of supporting scripts. However, if a supporting script is written in another language other then bash, the language needs to be added to the dependencies list when making a ggd meta-recipe to ensure that the language is available when installing the meta-recipe It is recommended that the json file used for updating the metadata be created from a supporting script because creating json files from a bash script is not as straight forward as it is in some other languages. For example, if you are using a python script to create the json file, a simple example would be: .. code-block:: python import json import os import sys json_outfile = sys.argv[1] ## This file path should be the GGD_METARECIPE_ENV_VAR_FILE passed in from the main bash script ## Create dictionary metadata_dict = dict() ## Add updated info to the dictionary metadata_dict["GGD_METARECIPE_SUMMARY"] = . . . #save data as json file to the GGD_METARECIPE_ENV_VAR_FILE location json.dump(metadata_dict, open(json_outfile, "w")) .. note:: The json file needs to be formatted as a dictionary: {"GGD_METARECIPE_SUMMARY": "An Updated Summary", "GGD_METARECIPE_SPECIES": "ID specific species", ...} Supporting scripts can be as simple or complicated as needs be. We recommend you stay on the side of simple as much as possible as to help provide transparency with what is going on. **An example of the GEO meta-recipe scripts are provided below at number 6** 3. Creating a ggd meta-recipe using the ggd cli ----------------------------------------------- The ggd command line interface (cli) contains tools to create and test a data meta-recipe. If it has not been installed, install the ggd cli following the steps outlined in :ref:`Using GGD `. With the ggd cli installed you can now transform your meta-recipe script(s) created in the previous step into a ggd meta-recipe. To do this you will use the :code:`ggd make-meta-recipe` command. See the :ref:`make-meta-recipe ` docs page for more information on the command . .. note:: The :code:`make-meta-recipe` command is different then the :code:`make-recipe` command. The first creates a meta-recipe while the second creates a normal ggd recipe. It is important that the summary of the meta-recipe provides enough information about what the meta-recipe is and what it does, as well as what it expects in terms of an ID, so that a user can simply identify which meta-recipe they would like to use and how to use it. None of the information added during the :code:`make-meta-recipe` stage should include ID specific information other then the summary stating how to use IDs. A meta-recipe requires the following fields to be field out: - species: **"meta-recipe"** - genome build: **"meta-recipe"** - data version: **"meta-recipe"** (Not required, but suggested so that the version can be updated based on the installation of a specific ID recipe) - data provider: The data provider where the meta-recipe will pull data from - summary: A detailed summary of the meta-recipe - author: Who created the meta-recipe - package version: The version of the meta-recipe/package (Usually "1" for the first version of a meta-recipe) - keywords: Keywords that will help to distinguish the meta-recipe - coordinate base: **"NA"** unless otherwise known. (Can be updated by the meta-recipe during an ID specific recipe installation) - name: A defining name to use for the meta-recipe - script: The main bash script for the meta-recipe - extra scripts: A space separated list of all extra/supporting scripts that are used by the meta-recipe - dependency: Any software or ggd data dependencies required by the main or supporting scripts of the meta-recipe Example of making a meta-recipe: .. code-block:: bash $ ggd make-meta-recipe \ --authors mjc \ --package-version 1 \ --data-provider GEO \ --data-version "meta-recipe" \ --species "meta-recipe" \ --genome-build "meta-recipe" \ --cb "NA" \ --summary "A meta-recipe for the Gene Expression Omnibus (GEO) database from NCBI. ... " \ --extra-scripts parse_geo_header.py \ -k Gene-Expression-Omnibus \ -k GEO \ -k GEO-Accession-ID \ -k GEO-meta-recipe \ --name geo-accession \ geo_meta_recipe_script.sh This will create a new ggd meta-recipe named **meta-recipe-geo-accession-geo-v1** *meta-recipe-geo-accession-geo-v1* is a directory with the following files in it: - checksums_file.txt - meta.yaml - metarecipe.sh - parse_geo_header.py - post-link.sh - recipe.sh 4. Checking/Testing the new meta-recipe --------------------------------------- The new meta-recipe needs to be tested. GGD provides an easy to use tool to do this. The tool will check if the meta-recipe can be built into a data-package, if it can be installed, along with other aspects of the recipe that are pertinent for successful data meta-recipes. This tool is :code:`ggd check-recipe`. :code:`check-recipe` is used to test both a normal ggd data recipe along with a ggd data meta-recipe. One major difference from the user side is that for a meta-recipe the :code:`--id` parameter is required while it is ignored during a normal recipe check. This means that ggd will not only check that a meta-recipe works properly on its own, but also that it fulfills its requirements of installing ID specific data. Using the meta-recipe created in the previous step, you would run the following command in order to test the new meta-recipe: .. code-block:: bash ggd check-recipe meta-recipe-geo-accession-geo-v1 --id GSE123 The ID can be any one of the IDs that can be used with the meta-recipe, :code:`check-recipe` just requires that a proper ID be used for testing. .. note:: :code:`check-recipe` will fail for a meta-recipe if no :code:`--id` is provided. Additionally, the meta-recipe should be able to handle the occurrence of a bad ID. If :code:`check-recipe` fails there will be information on why it failed. Fix the problems and continue to test the meta-recipe until it passes. Once the meta-recipe has passed the tests it can be added to GGD. 5. Submit the new ggd meta-recipe to the original ggd-recipes repo ------------------------------------------------------------------- Once the new ggd meta-recipe you created passes the previous step you are ready to add it to the original ggd-recipes repo. To do this you will need to create a **pull request**. From your local machine, add the new data meta-recipe you created to the forked ggd-recipes repo. You will add it to the ``recipes/`` directory. If you do not put it in the right directory it will be rejected. The recipes file convention is as follows: * All recipes are stored within the **ggd-recipes/recipes** directory * The recipes directory has the following format:: //recipes//// * :code:`` is the path to the forked ggd-recipes repo on your local machine. * :code:`recipes` is the **recipes** directory. * :code:`` is the ggd channel that recipe should go in. This depends on the type of data you are adding. for a meta-recipe you should add it to: //recipes//**meta-recipe**/**meta-recipe**/ For the meta-recipe-geo-accession-geo-v1 meta-recipe created above you would use the following commands: :: $ mv meta-recipe-geo-accession-geo-v1 //recipes/genomics/meta-recipe/meta-recipe/ Once the meta-recipe is there you will need to add it to your forked ggd-recipe repo. Navigate to the forked ggd-recipe directory and use the following commands: * Add the met-recipe to the git repo: :: $ git add /recipes/genomics/meta-recipe/meta-recipe/meta-recipe-geo-accession-geo-v1/ * Commit the addition to the repo (The vim text editor will open up. Add a comment about the new meta-recipe and save it): :: $ git commit * Push the commit to your fork repo on github (You will be asked to fill out your github credentials): :: $ git push origin * Go to the ggd-recipes github page for your username (https://github.com//ggd-recipes/). * Under the green "Clone or download" button click on **Pull request**. * Where it says **base fork:** make sure it is on **gogetdata/ggd-recipes**. And where it says **base:** make sure it is on **master**. * Click the green **Create pull request** button. * Add some comments and complete the pull request. You have now created a pull request with your new data meta-recipe. The meta-recipe will go through a continuous integration step where the recipe will be tested. If it passes, the recipe will be added to the gogetdata/ggd-recipes repo and anyone using the ggd tool will be able to access it. If it does not pass, you will be informed by the ggd team, and they will work with you on getting it working. .. note:: Because of the ID required by meta-recipes, there are additional steps that need to be taken during the continuous integration process. In the pull request comments make sure to indicate the test ID you would like used during the testing phase. The GGD team will work with you during this process to make sure that the process is done correctly. 6. Example of the Gene Expression Omnibus (GEO) main bash script and supporting python script ---------------------------------------------------------------------------------------------- Below is an example of a the main bash script and a supporting python script used to create a meta-recipe for the GEO database. This stands as one example of how to create a meta-recipe, but does not indicate how every meta-recipe should be created. As with all ggd recipes, the recipe scripts should be created in order to correctly install and process the data the recipe is created for. A) Main bash: .. code-block:: bash ## GEO accession number geo_acc_id=$1 ## Script Location: The file path the script script_path=$2 ## Json File name json_outfile=$3 ## file path for the subsetted commands used to download the data commands_outfile=$4 ## Force Upper Case #geo_acc_id=$(echo ${geo_acc_id^^}) Requires bash >= 4.2 (macOSX bash version == < 4) geo_acc_id="$(echo $geo_acc_id | tr '[:lower:]' '[:upper:]')" echo -e "\n Checking GEO for $geo_acc_id" echo -e " ================================\n" ## Get the GEO number excluding the prefix geo_digit="${geo_acc_id//[^[:digit:]]/}" ## Get GEO URL stub based on the number of digits if [[ "${#geo_digit}" -ge 3 ]] then stub=$(echo "$geo_acc_id" | sed 's/...$/nnn/') elif [[ "${#geo_digit}" -eq 2 ]] then stub=$(echo "$geo_acc_id" | sed 's/..$/nnn/') elif [[ "${#geo_digit}" -eq 1 ]] then stub=$(echo "$geo_acc_id" | sed 's/.$/nnn/') fi ## URL vars prefix="" soft_url="" matrix_url="" annot_url="" gsm_url="" sup_url="" ## Check accession number prefix if [[ $geo_acc_id == "GDS"* ]] then ## Set PREFIX prefix="GDS" ## Get the soft file from the dataset soft_url="https://ftp.ncbi.nlm.nih.gov/geo/datasets/$stub/$geo_acc_id/soft/$geo_acc_id.soft.gz" ## Supplemental URL sup_url="https://ftp.ncbi.nlm.nih.gov/geo/datasets/$stub/$geo_acc_id/suppl/" elif [[ $geo_acc_id == "GSE"* ]] then ## Set PREFIX prefix="GSE" ## Get the soft file for the series soft_url="https://ftp.ncbi.nlm.nih.gov/geo/series/$stub/$geo_acc_id/soft/$geo_acc_id""_family.soft.gz" ## Get the matrix file for the series matrix_url="https://ftp.ncbi.nlm.nih.gov/geo/series/$stub/$geo_acc_id/matrix/$geo_acc_id""_series_matrix.txt.gz" ## Supplemental URL sup_url="https://ftp.ncbi.nlm.nih.gov/geo/series/$stub/$geo_acc_id/suppl/" elif [[ $geo_acc_id == "GPL"* ]] then ## Set PREFIX prefix="GPL" ## Get the soft file for the platform soft_url="https://ftp.ncbi.nlm.nih.gov/geo/platforms/$stub/$geo_acc_id/soft/$geo_acc_id""_family.soft.gz" ## Get the annot file for the platform annot_url="https://ftp.ncbi.nlm.nih.gov/geo/platforms/$stub/$geo_acc_id/annot/$geo_acc_id.annot.gz" ## Supplemental URL sup_url="https://ftp.ncbi.nlm.nih.gov/geo/platforms/$stub/$geo_acc_id/suppl/" elif [[ $geo_acc_id == "GSM"* ]] then ## Set PREFIX prefix="GSM" ## Get the Table file from the CGI GEO Query site gsm_url="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=$geo_acc_id&form=text&view=full" ## Supplemental URL sup_url="https://ftp.ncbi.nlm.nih.gov/geo/samples/$stub/$geo_acc_id/suppl/" else ## Bad accession prefix echo -e "\n!!ERROR!! GEO does not recognized the supplied accession id: '$geo_acc_id'." 1>&2 echo -e " Acceptable accession prefix include: \n\t- GDSxxx \n\t- GPLxxx \n\t- GSMxxx \n\t- GSExxx\n" 1>&2 exit 1 fi ## Check if accession id exists message=$(xmllint --xpath "string(//WarningList)" <(curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=$geo_acc_id" --silent)) if [[ $message == *"No items found"* ]] then ## If accession ID not found echo -e "!!ERROR!! Accession ID $geo_acc_id not found in GEO\n" 1>&2 exit 1 else echo -e "Found Accession ID in GEO: $geo_acc_id\n" fi ## Get the Accession URL for the GEO Accession page geo_acc_url="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=$geo_acc_id" echo -e "Main GEO page for $geo_acc_id: $geo_acc_url\n" echo -e "Checking $geo_acc_id for available files" echo -e "-------------------------------------\n" final_commands="" ## Check for SOFT URL if [[ ! -z $soft_url ]] then ## Check if soft url file exists if curl --output /dev/null --silent --head --fail "$soft_url" then echo -e "\tDownloading SOFT file: $soft_url\n" ## Download file ## GEOxxx.soft.gz file ## or ## GEOxxx_family.soft.gz file curl "$soft_url" -O -J --silent final_commands="""$final_commands curl \"$soft_url\" -O -J --silent """ fi fi ## Check for MATRIX URL if [[ ! -z $matrix_url ]] then ## Check if matrix url file exists if curl --output /dev/null --silent --head --fail "$matrix_url" then echo -e "\tDownloading MATRIX file: $matrix_url\n" ## Download file ## GEOxxx_series_matrix.txt.gz file curl "$matrix_url" -O -J --silent final_commands="""$final_commands curl \"$matrix_url\" -O -J --silent """ fi fi ## Check for ANNOT URL if [[ ! -z $annot_url ]] then ## Check if annot url file exists if curl --output /dev/null --silent --head --fail "$annot_url" then echo -e "\tDownloading ANNOT file: $annot_url\n" ## Download file ## GEOxxx.annot.gz file curl "$annot_url" -O -J --silent final_commands="""$final_commands curl \"$annot_url\" -O -J --silent """ fi fi ## Check for GSM URL if [[ ! -z $gsm_url ]] then ## Check if gsm url file exists if curl --output /dev/null --silent --head --fail "$gsm_url" then echo -e "\tDownloading table: $gsm_url\n" ## Download file ## GEOxxx.txt file curl "$gsm_url" -O -J --silent final_commands="""$final_commands curl \"$gsm_url\" -O -J --silent """ fi fi ## Check for Supplemental URL if [[ ! -z $sup_url ]] then ## Check if sup url exists if curl --output /dev/null --silent --head --fail "$sup_url" then ## Iterate over all GEO Accession ID specific files in sup url for file in $(curl "$sup_url" --silent | grep -oE ".+?<\/a>" | cut -f 2 -d '"' | grep "^$geo_acc_id") do ## Build sup file url sup_file_url="$sup_url$file" ## Check if it exists if curl --output /dev/null --silent --head --fail "$sup_file_url" then ## Download file ## GEOxxx sup file echo -e "\tDownloading Sup. File: $sup_file_url\n" curl "$sup_file_url" -O -J --silent final_commands="""$final_commands curl \"$sup_file_url\" -O -J --silent """ ## Check for tar file if [[ "$file" == *".tar"* ]] then echo -e "\t\tExtracting TAR File $file" ## Extract TAR file if [[ "$file" == *".tar" ]] then tar -xf $file final_commands="""$final_commands tar -xf $file """ elif [[ "$file" == *".tar.gz" ]] then tar -xzf $file final_commands="""$final_commands tar -xzf $file """ elif [[ "$file" == *".tar.bz2" ]] then tar -xjf $file final_commands="""$final_commands tar -xjf $file """ else echo -e "!!ERROR!! Unable to extract tar file" 1>&2 exit 1 fi ## remove the tar file rm $file fi fi done fi fi ## Commands used to download the data files echo "$final_commands" > $commands_outfile ## Get the main file to parse the header from ### For GDS, GPL, and GSE the .soft file should be used ### For GSM, the .txt file should be used main_file="" submain_file="" for file in $(pwd)/* do if [[ $prefix == "GSM" ]] then if [[ "$file" == *".txt" ]] then main_file=$file fi else if [[ "$file" == *".soft"* ]] then main_file=$file elif [[ "$file" == *"matrix"* ]] then submain_file=$file fi fi done ## If GSE and soft file does not exists, use the matrix file if [[ $main_file == "" ]] then main_file=$submain_file fi ## Update ID Specific meta-recipe python $script_path/parse_geo_header.py --geo-acc $geo_acc_id --geo-file $main_file --geo-url $geo_acc_url --geo-prefix $prefix --geo-files-dir $(pwd) --json-out $json_outfile echo -e "DONE\n" B) Supporting python script named "parse_geo_header.py" .. code-block:: python from __future__ import print_function import argparse import datetime import gzip import io import json import os import sys from collections import defaultdict # --------------------------------------------------------------------------------------------------------------------------------- ## Argument Parser # --------------------------------------------------------------------------------------------------------------------------------- def arguments(): """Argument method """ p = argparse.ArgumentParser( description="Parse GEO file header and update recipe meta-data" ) req = p.add_argument_group("Required Arguments") req.add_argument( "--geo-acc", metavar="GEO Accession ID", required=True, help="The GEO accession ID", ) req.add_argument( "--geo-file", metavar="GEO file", required=True, help="The GEO file to parse" ) req.add_argument( "--geo-url", metavar="GEO Accession URL", required=True, help="The GEO Accession ID specific home page URL", ) req.add_argument( "--geo-prefix", metavar="GEO Accession prefix", required=True, choices=["GDS", "GPL", "GSM", "GSE"], help="The GEO Accession id Prefix. (GDS, GPL, GSM, GSE)", ) req.add_argument( "--geo-files-dir", metavar="GEO downloaded files", required=True, help="The directory path to where the files were downloaded", ) req.add_argument( "--json-out", metavar="JSON out file", required=True, help="The name of the json output file to create that will contain the ggd meta-recipe environment variables", ) return p.parse_args() # --------------------------------------------------------------------------------------------------------------------------------- ## Main # --------------------------------------------------------------------------------------------------------------------------------- def main(): args = arguments() ## Open GEO File try: fh = ( gzip.open(args.geo_file, "rt", encoding="utf-8", errors="ignore") if args.geo_file.endswith(".gz") else io.open(args.geo_file, "rt", encoding="utf-8", errors="ignore") ) except IOError as e: print("\n!!ERROR!! Unable to read the GEO File: '{}'".format(args.geo_file)) print(str(e)) sys.exit(1) print("\nParsing GEO header for file: {}".format(args.geo_file)) metadata_dict = defaultdict(list) for i, line in enumerate(fh): line = line.strip() if not line: continue ## Check if line is a header if line[0] == "!": line_list = line.strip().split("=") if len(line_list) > 1: metadata_dict[line_list[0].replace(" ", "")].append( line_list[1].strip() ) fh.close() geo_key = ( "dataset" if args.geo_prefix == "GDS" else "Platform" if args.geo_prefix == "GPL" else "Sample" if args.geo_prefix == "GSM" else "Series" if args.geo_prefix == "GSE" else None ) title = ", ".join(metadata_dict["!{}_title".format(geo_key)]) summary = ", ".join(metadata_dict["!{}_summary".format(geo_key)]) description = ", ".join(metadata_dict["!{}_description".format(geo_key)]) etype = ", ".join(metadata_dict["!{}_type".format(geo_key)]) status = ", ".join(metadata_dict["!{}_status".format(geo_key)]) submission_date = ", ".join(metadata_dict["!{}_submission_date".format(geo_key)]) last_update_date = ", ".join(metadata_dict["!{}_last_update_date".format(geo_key)]) organism = set( [", ".join(list(set(y))) for x, y in metadata_dict.items() if "organism" in x] ) pubmed_id = set( [", ".join(list(set(y))) for x, y in metadata_dict.items() if "pubmed_id" in x] ) link = ", ".join(metadata_dict["!{}_web_link".format(geo_key)]) ## Set summary environment variable env_vars = defaultdict(str) ## UPDATE META RECIPE SUMMARY new_summary = ( "GEO Accession ID: {}. Title: {}. GEO Accession site url: {} (See the url for additional information about {}). ".format( args.geo_acc, title, args.geo_url, args.geo_acc ) + "Summary: " + summary + description ) if etype: new_summary += " Type: {}".format(etype) env_vars["GGD_METARECIPE_SUMMARY"] = new_summary ## Update META RECIPE VERSION date_string = "Submission date: {}. Status: {}. Last Update Date: {}. Download Date: {}".format( submission_date, status, last_update_date, datetime.datetime.now().strftime("%m-%d-%Y"), ) env_vars["GGD_METARECIPE_VERSION"] = date_string ## Update META RECIPE Keywords keywords = [ args.geo_acc, args.geo_url, etype, "PubMed id: {}".format(", ".join(sorted(list(pubmed_id)))) if pubmed_id else "", "WEB LINK: {}".format(link) if link else "", ] env_vars["GGD_METARECIPE_KEYWORDS"] = ", ".join(keywords) ## Update META RECIPE SPECIES env_vars["GGD_METARECIPE_SPECIES"] = ", ".join(sorted(list(organism))) print("\nCreating environment variable json file: {}".format(args.json_out)) json.dump(dict(env_vars), open(args.json_out, "w")) if __name__ == "__main__": sys.exit(main() or 0)