Creating a ggd meta-recipe

[Click here to return to the home page]

This page is specific to creating a ggd meta-recipe. If you are looking to create a normal ggd recipe see Creating a ggd recipe

The following steps outline how to create, check, and add a ggd data meta-recipe.

1. Update local forked repo

You will need to update the forked ggd-recipes repo on your local machine before you add a recipe to it.

  • Navigate to the forked ggd-recipes repo on your local machine

  • Once in the directory run the following commands

$ git checkout master
$ git pull upstream master
$ git push origin master

2. Writing the curation script(s)

A meta-recipe script should be quite a bit more detailed then a general ggd recipe script.

Additionally, it is common to consider using multiple scripts for a meta-recipe, where you have a single main bash script which is used to control the process of all other scripts.

Note

Whether using a single or multiple scripts, the main script must be a bash script

Things to consider when building a meta-recipe (This list is by far NOT a comprehensive list of things to think about while creating a meta-recipe):

  • What types of identifiers do I need in order to download the data from the database of choice?

  • Are there different types of identifiers? If so, how do I handle them?

  • Does the database have and FTP site, a SQL database, etc? (How and where am I going to get the data from?)

  • Is there a way to check if the ID and/or the data exists in the database?

  • How do I handle a bad ID or the absence of data?

  • Is there some hash value, like an md5sum hash value, I can use to validate that the contents of the data downloaded is correct and there wasn’t an error during downloading?

  • What data should be downloaded, and where is it coming from?

  • Is there additional processing that needs to happen once the data is downloaded?

  • With an ID, and potentially the downloaded data, what information can I get and used to update the ID specific recipe metadata?

  • How am I going to handle errors?

  • etc.

Creating the main bash script

This script should handle the ID. Data download, curation, etc. can be handled by this script or can be passed to a different supporting script.

GGD provides 4 environment variables to use during meta-recipe installation in order to help the process. They are:

GGD_METARECIPE_ID

This is the ID provided during installation (Example: GSE123 for the GEO meta-recipe)

SCRIPTS_PATH

The directory path to where the additional scripts are stored. (This path is required in order to run any supporting scripts)

GGD_METARECIPE_ENV_VAR_FILE

This is the file path to store the ID specific updates to the meta-recipe. (More on this later)

GGD_METARECIPE_FINAL_COMMANDS_FILE

This is the file path to a bash script which is used to store the final/actual commands used to install the ID specific data. (More on this later)

In addition to providing these four environment variables these variables are passed into the main bash script as the first four parameters as follows: (Strict Order)

id=$1            # 1st argument is GGD_METARECIPE_ID

script_path=$2   # 2nd argument is SCRIPTS_PATH

env_var_file=$3  # 3rd argument is GGD_METARECIPE_ENV_VAR_FILE

commands_file=$4 # 4rth argument is GGD_METARECIPE_FINAL_COMMANDS_FILE

The GGD_METARECIPE_ID will match exactly what was entered in the install command. GGD will not change case or order.

  1. ID (GGD_METARECIPE_ID):

    The ID should be used to identify and download the data that is associated with that ID. If the ID doesn’t exists or there is no data for that ID then the bash script should print a warning/error message and exit.

  2. Script Path (SCRIPTS_PATH):

    In order to use a supporting script, the script path must be used. For example, if you have a supporting script named “get_id_metadata.py” which you run from within the main bash script you would do:

    script_path=$2   # 2nd argument is SCRIPTS_PATH
    
    python $script_path/get_id_metadata.py <other required arguments>
    

    or

    python $SCRIPTS_PATH/get_id_metadata.py <other required arguments>
    

    where <other required arguments> are the arguments needed for the “get_id_metadata.py” script.

  3. Updating ID specific metadata (GGD_METARECIPE_ENV_VAR_FILE, GGD_METARECIPE_FINAL_COMMANDS_FILE):

    One of the main advantages of meta-recipes is the ability to update a recipes metadata based on information about the specific ID supplied. That is, based on the ID what information is there that should be added to the metadata.

    Although it is not required to updated the metadata, it is highly recommended that you do. Otherwise, the metadata will consist of the general information of the meta-recipe without any ID specific info.

    GGD provides two environment variables to use for this purpose.

    GGD_METARECIPE_FINAL_COMMANDS_FILE: This represents a bash file that should store the commands used for installing and processing the data files specific to the ID.

    The main meta-recipe script will being doing a lot of work. This file should capture the essential pieces for determining where and how the data was installed and processed. Other information should be kept out.

    This file acts as a place holder for what will be updated in the ID specific meta-recipe metadata. That is, after the meta-recipe is installed and the metadata has been updated, a user will be able to access these commands through the ggd pkg-info command. This helps to support reproducibility and transparency.

    Again, although it is not required it is highly recommended that this step is taken.

    GGD_METARECIPE_ENV_VAR_FILE: This file represents different “environment variables” that can be set in order to update the metadata of an ID specific meta-recipe. This file is a .json file. This means that the meta-recipe needs to save the contents of the file as a .json file, otherwise GGD will not be able to use the updated environment variables. The json file should act as a dictionary/map with environment variable to change as keys and the content changes as values.

    The available keys are:

    GGD_METARECIPE_SUMMARY

    (string) A summary of the installed data

    GGD_METARECIPE_SPECIES

    (string) The species of the installed data

    GGD_METARECIPE_GENOME_BUILD

    (string) The genome build of the installed data

    GGD_METARECIPE_VERSION

    (string) The version of the data installed

    GGD_METARECIPE_KEYWORDS

    (list) A list of key words to add to the metadata

    GGD_METARECIPE_DATA_PROVIDER

    (string) The data provider of the recipe. (Should already exists. Should not be used)

    GGD_METARECIPE_FILE_TYPE

    (list) A list of file types for the files installed by the package

    GGD_METARECIPE_GENOMIC_COORDINATE_BASE

    (string) A string that represented the coordinate base of the installed files

    Not all keys are required to be set. It is recommended that the GGD_METARECIPE_SUMMARY be updated, the GGD_METARECIPE_SPECIES and GGD_METARECIPE_GENOME_BUILD be updated if data is available to update them, the GGD_METARECIPE_VERSION be updated, and the GGD_METARECIPE_KEYWORDS be updated.

    The remaining keys/environment variable names can be used if data is available to update them.

    Note

    The data provider can be updated, but it is recommended that the data provider is not updated. If the data provider needs to be updated we suggest that a different recipe be created for that data provider specifically.

    After an ID specific meta-recipe is installed, GGD will check to see if any of the two files exists. If they do, GGD will update the metadata of the ID specific meta-recipe. These updates are available via the ggd pkg-info command.

    Please try to update the metadata whenever possible.

    The meta-recipe main bash script should also clean up any extra files or other processes that were needed during the installation process.

Creating the supporting script(s)

Supporting scripts are not needed if everything can be done easily in the main bash script without them. However, supporting scripts can be helpful in defining the updated metadata for the ID specific recipe installed, or for other tasks that aren’t done easily in bash.

Supporting scripts need to be accessible through the main bash script, and any arguments needed for the supporting scripts needs to be accessible and/or generated within the main bash script.

There is not requirement for language of supporting scripts. However, if a supporting script is written in another language other then bash, the language needs to be added to the dependencies list when making a ggd meta-recipe to ensure that the language is available when installing the meta-recipe

It is recommended that the json file used for updating the metadata be created from a supporting script because creating json files from a bash script is not as straight forward as it is in some other languages. For example, if you are using a python script to create the json file, a simple example would be:

import json
import os
import sys

json_outfile = sys.argv[1] ## This file path should be the GGD_METARECIPE_ENV_VAR_FILE passed in from the main bash script

## Create dictionary
metadata_dict = dict()

## Add updated info to the  dictionary
metadata_dict["GGD_METARECIPE_SUMMARY"] = <updated summary>

.
.
.

#save data as json file to the GGD_METARECIPE_ENV_VAR_FILE location
json.dump(metadata_dict, open(json_outfile, "w"))

Note

The json file needs to be formatted as a dictionary: {“GGD_METARECIPE_SUMMARY”: “An Updated Summary”, “GGD_METARECIPE_SPECIES”: “ID specific species”, …}

Supporting scripts can be as simple or complicated as needs be. We recommend you stay on the side of simple as much as possible as to help provide transparency with what is going on.

An example of the GEO meta-recipe scripts are provided below at number 6

3. Creating a ggd meta-recipe using the ggd cli

The ggd command line interface (cli) contains tools to create and test a data meta-recipe.

If it has not been installed, install the ggd cli following the steps outlined in Using GGD.

With the ggd cli installed you can now transform your meta-recipe script(s) created in the previous step into a ggd meta-recipe.

To do this you will use the ggd make-meta-recipe command. See the make-meta-recipe docs page for more information on the command .

Note

The make-meta-recipe command is different then the make-recipe command. The first creates a meta-recipe while the second creates a normal ggd recipe.

It is important that the summary of the meta-recipe provides enough information about what the meta-recipe is and what it does, as well as what it expects in terms of an ID, so that a user can simply identify which meta-recipe they would like to use and how to use it.

None of the information added during the make-meta-recipe stage should include ID specific information other then the summary stating how to use IDs.

A meta-recipe requires the following fields to be field out:

  • species: “meta-recipe”

  • genome build: “meta-recipe”

  • data version: “meta-recipe” (Not required, but suggested so that the version can be updated based on the installation of a specific ID recipe)

  • data provider: The data provider where the meta-recipe will pull data from

  • summary: A detailed summary of the meta-recipe

  • author: Who created the meta-recipe

  • package version: The version of the meta-recipe/package (Usually “1” for the first version of a meta-recipe)

  • keywords: Keywords that will help to distinguish the meta-recipe

  • coordinate base: “NA” unless otherwise known. (Can be updated by the meta-recipe during an ID specific recipe installation)

  • name: A defining name to use for the meta-recipe

  • script: The main bash script for the meta-recipe

  • extra scripts: A space separated list of all extra/supporting scripts that are used by the meta-recipe

  • dependency: Any software or ggd data dependencies required by the main or supporting scripts of the meta-recipe

Example of making a meta-recipe:

$ ggd make-meta-recipe \
      --authors mjc \
      --package-version 1 \
      --data-provider GEO \
      --data-version "meta-recipe" \
      --species "meta-recipe" \
      --genome-build "meta-recipe" \
      --cb "NA" \
      --summary "A meta-recipe for the Gene Expression Omnibus (GEO) database from NCBI. ... " \
      --extra-scripts parse_geo_header.py \
      -k Gene-Expression-Omnibus \
      -k GEO \
      -k GEO-Accession-ID \
      -k GEO-meta-recipe \
      --name geo-accession \
      geo_meta_recipe_script.sh

This will create a new ggd meta-recipe named meta-recipe-geo-accession-geo-v1

meta-recipe-geo-accession-geo-v1 is a directory with the following files in it:

  • checksums_file.txt

  • meta.yaml

  • metarecipe.sh

  • parse_geo_header.py

  • post-link.sh

  • recipe.sh

4. Checking/Testing the new meta-recipe

The new meta-recipe needs to be tested. GGD provides an easy to use tool to do this. The tool will check if the meta-recipe can be built into a data-package, if it can be installed, along with other aspects of the recipe that are pertinent for successful data meta-recipes.

This tool is ggd check-recipe. check-recipe is used to test both a normal ggd data recipe along with a ggd data meta-recipe. One major difference from the user side is that for a meta-recipe the --id parameter is required while it is ignored during a normal recipe check.

This means that ggd will not only check that a meta-recipe works properly on its own, but also that it fulfills its requirements of installing ID specific data.

Using the meta-recipe created in the previous step, you would run the following command in order to test the new meta-recipe:

ggd check-recipe meta-recipe-geo-accession-geo-v1 --id GSE123

The ID can be any one of the IDs that can be used with the meta-recipe, check-recipe just requires that a proper ID be used for testing.

Note

check-recipe will fail for a meta-recipe if no --id is provided.

Additionally, the meta-recipe should be able to handle the occurrence of a bad ID.

If check-recipe fails there will be information on why it failed. Fix the problems and continue to test the meta-recipe until it passes.

Once the meta-recipe has passed the tests it can be added to GGD.

5. Submit the new ggd meta-recipe to the original ggd-recipes repo

Once the new ggd meta-recipe you created passes the previous step you are ready to add it to the original ggd-recipes repo.

To do this you will need to create a pull request.

From your local machine, add the new data meta-recipe you created to the forked ggd-recipes repo. You will add it to the recipes/ directory. If you do not put it in the right directory it will be rejected. The recipes file convention is as follows:

  • All recipes are stored within the ggd-recipes/recipes directory

  • The recipes directory has the following format:

    /<path to forked ggd-recipes repo>/recipes/<ggd channel>/<species>/<genome-build>/
    
    • <path to forked ggd-recipes repo> is the path to the forked ggd-recipes repo on your local machine.

    • recipes is the recipes directory.

    • <ggd channel> is the ggd channel that recipe should go in. This depends on the type of data you are adding.

      for a meta-recipe you should add it to:

      /<path to forked ggd-recipes repo>/recipes/<ggd channel>/meta-recipe/meta-recipe/

For the meta-recipe-geo-accession-geo-v1 meta-recipe created above you would use the following commands:

$ mv meta-recipe-geo-accession-geo-v1 /<forked ggd-recipes>/recipes/genomics/meta-recipe/meta-recipe/

Once the meta-recipe is there you will need to add it to your forked ggd-recipe repo. Navigate to the forked ggd-recipe directory and use the following commands:

  • Add the met-recipe to the git repo:

$ git add /recipes/genomics/meta-recipe/meta-recipe/meta-recipe-geo-accession-geo-v1/
  • Commit the addition to the repo (The vim text editor will open up. Add a comment about the new meta-recipe and save it):

$ git commit
  • Push the commit to your fork repo on github (You will be asked to fill out your github credentials):

$ git push origin
  • Go to the ggd-recipes github page for your username (https://github.com/<USERNAME>/ggd-recipes/).

  • Under the green “Clone or download” button click on Pull request.

  • Where it says base fork: make sure it is on gogetdata/ggd-recipes. And where it says base: make sure it is on master.

  • Click the green Create pull request button.

  • Add some comments and complete the pull request.

You have now created a pull request with your new data meta-recipe. The meta-recipe will go through a continuous integration step where the recipe will be tested.

If it passes, the recipe will be added to the gogetdata/ggd-recipes repo and anyone using the ggd tool will be able to access it.

If it does not pass, you will be informed by the ggd team, and they will work with you on getting it working.

Note

Because of the ID required by meta-recipes, there are additional steps that need to be taken during the continuous integration process. In the pull request comments make sure to indicate the test ID you would like used during the testing phase. The GGD team will work with you during this process to make sure that the process is done correctly.

6. Example of the Gene Expression Omnibus (GEO) main bash script and supporting python script

Below is an example of a the main bash script and a supporting python script used to create a meta-recipe for the GEO database. This stands as one example of how to create a meta-recipe, but does not indicate how every meta-recipe should be created. As with all ggd recipes, the recipe scripts should be created in order to correctly install and process the data the recipe is created for.

  1. Main bash:

    ## GEO accession number
    geo_acc_id=$1
    
    ## Script Location: The file path the script
    script_path=$2
    
    ## Json File name
    json_outfile=$3
    
    ## file path for the subsetted commands used to download the data
    commands_outfile=$4
    
    ## Force Upper Case
    #geo_acc_id=$(echo ${geo_acc_id^^}) Requires bash >= 4.2 (macOSX bash version == < 4)
    geo_acc_id="$(echo $geo_acc_id | tr '[:lower:]' '[:upper:]')"
    
    echo -e "\n    Checking GEO for $geo_acc_id"
    echo -e "  ================================\n"
    
    ## Get the GEO number excluding the prefix
    geo_digit="${geo_acc_id//[^[:digit:]]/}"
    
    ## Get GEO URL stub based on the number of digits
    if [[ "${#geo_digit}" -ge 3 ]]
    then
        stub=$(echo "$geo_acc_id" | sed 's/...$/nnn/')
    
    elif [[ "${#geo_digit}" -eq 2 ]]
    then
        stub=$(echo "$geo_acc_id" | sed 's/..$/nnn/')
    
    elif [[ "${#geo_digit}" -eq 1 ]]
    then
        stub=$(echo "$geo_acc_id" | sed 's/.$/nnn/')
    
    fi
    
    ## URL vars
    prefix=""
    soft_url=""
    matrix_url=""
    annot_url=""
    gsm_url=""
    sup_url=""
    
    ## Check accession number prefix
    if [[ $geo_acc_id == "GDS"* ]]
    then
    
        ## Set PREFIX
        prefix="GDS"
    
        ## Get the soft file from the dataset
        soft_url="https://ftp.ncbi.nlm.nih.gov/geo/datasets/$stub/$geo_acc_id/soft/$geo_acc_id.soft.gz"
    
        ## Supplemental URL
        sup_url="https://ftp.ncbi.nlm.nih.gov/geo/datasets/$stub/$geo_acc_id/suppl/"
    
    
    elif [[ $geo_acc_id == "GSE"* ]]
    then
    
        ## Set PREFIX
        prefix="GSE"
    
        ## Get the soft file for the series
        soft_url="https://ftp.ncbi.nlm.nih.gov/geo/series/$stub/$geo_acc_id/soft/$geo_acc_id""_family.soft.gz"
        ## Get the matrix file for the series
        matrix_url="https://ftp.ncbi.nlm.nih.gov/geo/series/$stub/$geo_acc_id/matrix/$geo_acc_id""_series_matrix.txt.gz"
    
        ## Supplemental URL
        sup_url="https://ftp.ncbi.nlm.nih.gov/geo/series/$stub/$geo_acc_id/suppl/"
    
    elif [[ $geo_acc_id == "GPL"* ]]
    then
    
        ## Set PREFIX
        prefix="GPL"
    
        ## Get the soft file for the platform
        soft_url="https://ftp.ncbi.nlm.nih.gov/geo/platforms/$stub/$geo_acc_id/soft/$geo_acc_id""_family.soft.gz"
        ## Get the annot file for the platform
        annot_url="https://ftp.ncbi.nlm.nih.gov/geo/platforms/$stub/$geo_acc_id/annot/$geo_acc_id.annot.gz"
        ## Supplemental URL
        sup_url="https://ftp.ncbi.nlm.nih.gov/geo/platforms/$stub/$geo_acc_id/suppl/"
    
    elif [[ $geo_acc_id == "GSM"* ]]
    then
    
        ## Set PREFIX
        prefix="GSM"
    
        ## Get the Table file from the CGI GEO Query site
        gsm_url="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=$geo_acc_id&form=text&view=full"
        ## Supplemental URL
        sup_url="https://ftp.ncbi.nlm.nih.gov/geo/samples/$stub/$geo_acc_id/suppl/"
    
    else ## Bad accession prefix
        echo -e "\n!!ERROR!! GEO does not recognized the supplied accession id: '$geo_acc_id'." 1>&2
        echo -e "  Acceptable accession prefix include: \n\t- GDSxxx  \n\t- GPLxxx  \n\t- GSMxxx \n\t- GSExxx\n" 1>&2
        exit 1
    
    fi
    
    
    ## Check if accession id exists
    message=$(xmllint --xpath "string(//WarningList)" <(curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=$geo_acc_id" --silent))
    
    if [[ $message == *"No items found"* ]]
    then
        ## If accession ID not found
        echo -e "!!ERROR!! Accession ID $geo_acc_id not found in GEO\n" 1>&2
        exit 1
    else
        echo -e "Found Accession ID in GEO: $geo_acc_id\n"
    fi
    
    
    ## Get the Accession URL for the GEO Accession page
    geo_acc_url="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=$geo_acc_id"
    
    echo -e "Main GEO page for $geo_acc_id: $geo_acc_url\n"
    
    echo -e "Checking $geo_acc_id for available files"
    echo -e "-------------------------------------\n"
    
    final_commands=""
    ## Check for SOFT URL
    if [[ ! -z $soft_url ]]
    then
        ## Check if soft url file exists
        if curl --output /dev/null --silent --head --fail "$soft_url"
        then
            echo -e "\tDownloading SOFT file: $soft_url\n"
            ## Download file
            ## GEOxxx.soft.gz file
            ##     or
            ## GEOxxx_family.soft.gz file
            curl "$soft_url" -O -J --silent
            final_commands="""$final_commands
    curl \"$soft_url\" -O -J --silent
    """
        fi
    fi
    
    ## Check for MATRIX URL
    if [[ ! -z $matrix_url ]]
    then
        ## Check if matrix url file exists
        if curl --output /dev/null --silent --head --fail "$matrix_url"
        then
            echo -e "\tDownloading MATRIX file: $matrix_url\n"
            ## Download file
            ## GEOxxx_series_matrix.txt.gz file
            curl "$matrix_url" -O -J --silent
            final_commands="""$final_commands
    curl \"$matrix_url\" -O -J --silent
    """
        fi
    fi
    
    ## Check for ANNOT URL
    if [[ ! -z $annot_url ]]
    then
        ## Check if annot url file exists
        if curl --output /dev/null --silent --head --fail "$annot_url"
        then
            echo -e "\tDownloading ANNOT file: $annot_url\n"
            ## Download file
            ## GEOxxx.annot.gz file
            curl "$annot_url" -O -J --silent
            final_commands="""$final_commands
    curl \"$annot_url\" -O -J --silent
    """
        fi
    fi
    
    ## Check for GSM URL
    if [[ ! -z $gsm_url ]]
    then
        ## Check if gsm url file exists
        if curl --output /dev/null --silent --head --fail "$gsm_url"
        then
            echo -e "\tDownloading table: $gsm_url\n"
            ## Download file
            ## GEOxxx.txt file
            curl "$gsm_url" -O -J --silent
            final_commands="""$final_commands
    curl \"$gsm_url\" -O -J --silent
    """
        fi
    fi
    
    
    ## Check for Supplemental URL
    if [[ ! -z $sup_url ]]
    then
        ## Check if sup url exists
        if curl --output /dev/null --silent --head --fail "$sup_url"
        then
            ## Iterate over all GEO Accession ID specific files in sup url
            for file in $(curl "$sup_url" --silent | grep -oE  "<a href=".+?">.+?<\/a>" | cut -f 2 -d '"' | grep  "^$geo_acc_id")
            do
                ## Build sup file url
                sup_file_url="$sup_url$file"
    
                ## Check if it exists
                if curl --output /dev/null --silent --head --fail "$sup_file_url"
                then
                    ## Download file
                    ## GEOxxx sup file
                    echo -e "\tDownloading Sup. File: $sup_file_url\n"
                    curl "$sup_file_url" -O -J --silent
                    final_commands="""$final_commands
    curl \"$sup_file_url\" -O -J --silent
    """
    
                    ## Check for tar file
                    if [[ "$file" == *".tar"* ]]
                    then
                        echo -e "\t\tExtracting TAR File $file"
    
                        ## Extract TAR file
                        if [[ "$file" == *".tar" ]]
                        then
                            tar -xf $file
                            final_commands="""$final_commands
    tar -xf $file
    """
                        elif [[ "$file" == *".tar.gz" ]]
                        then
                            tar -xzf $file
                            final_commands="""$final_commands
    tar -xzf $file
    """
                        elif [[ "$file" == *".tar.bz2" ]]
                        then
                            tar -xjf $file
                            final_commands="""$final_commands
    tar -xjf $file
    """
                        else
                            echo -e "!!ERROR!! Unable to extract tar file" 1>&2
                            exit 1
                        fi
    
                        ## remove the tar file
                        rm $file
                    fi
                fi
            done
        fi
    fi
    
    
    ## Commands used to download the data files
    echo "$final_commands" > $commands_outfile
    
    ## Get the main file to parse the header from
    ### For GDS, GPL, and GSE the .soft file should be used
    ### For GSM, the .txt file should be used
    main_file=""
    submain_file=""
    for file in $(pwd)/*
    do
        if [[ $prefix == "GSM" ]]
        then
    
            if [[ "$file" == *".txt" ]]
            then
                main_file=$file
            fi
    
        else
            if [[ "$file" == *".soft"* ]]
            then
                main_file=$file
            elif [[ "$file" == *"matrix"* ]]
            then
                submain_file=$file
            fi
        fi
    done
    
    ## If GSE and soft file does not exists, use the matrix file
    if [[ $main_file == "" ]]
    then
        main_file=$submain_file
    fi
    
    ## Update ID Specific meta-recipe
    python $script_path/parse_geo_header.py --geo-acc $geo_acc_id --geo-file $main_file  --geo-url $geo_acc_url --geo-prefix $prefix --geo-files-dir $(pwd) --json-out $json_outfile
    
    echo -e "DONE\n"
    
  2. Supporting python script named “parse_geo_header.py”

    from __future__ import print_function
    
    import argparse
    import datetime
    import gzip
    import io
    import json
    import os
    import sys
    from collections import defaultdict
    
    
    # ---------------------------------------------------------------------------------------------------------------------------------
    ## Argument Parser
    # ---------------------------------------------------------------------------------------------------------------------------------
    def arguments():
        """Argument method  """
    
        p = argparse.ArgumentParser(
            description="Parse GEO file header and update recipe meta-data"
        )
    
        req = p.add_argument_group("Required Arguments")
    
        req.add_argument(
            "--geo-acc",
            metavar="GEO Accession ID",
            required=True,
            help="The GEO accession ID",
        )
    
        req.add_argument(
            "--geo-file", metavar="GEO file", required=True, help="The GEO file to parse"
        )
    
        req.add_argument(
            "--geo-url",
            metavar="GEO Accession URL",
            required=True,
            help="The GEO Accession ID specific home page URL",
        )
    
        req.add_argument(
            "--geo-prefix",
            metavar="GEO Accession prefix",
            required=True,
            choices=["GDS", "GPL", "GSM", "GSE"],
            help="The GEO Accession id Prefix. (GDS, GPL, GSM, GSE)",
        )
    
        req.add_argument(
            "--geo-files-dir",
            metavar="GEO downloaded files",
            required=True,
            help="The directory path to where the files were downloaded",
        )
    
        req.add_argument(
            "--json-out",
            metavar="JSON out file",
            required=True,
            help="The name of the json output file to create that will contain the ggd meta-recipe environment variables",
        )
    
        return p.parse_args()
    
    
    # ---------------------------------------------------------------------------------------------------------------------------------
    ## Main
    # ---------------------------------------------------------------------------------------------------------------------------------
    
    
    def main():
    
        args = arguments()
    
        ## Open GEO File
        try:
            fh = (
                gzip.open(args.geo_file, "rt", encoding="utf-8", errors="ignore")
                if args.geo_file.endswith(".gz")
                else io.open(args.geo_file, "rt", encoding="utf-8", errors="ignore")
            )
        except IOError as e:
            print("\n!!ERROR!! Unable to read the GEO File: '{}'".format(args.geo_file))
            print(str(e))
            sys.exit(1)
    
        print("\nParsing GEO header for file: {}".format(args.geo_file))
    
        metadata_dict = defaultdict(list)
    
        for i, line in enumerate(fh):
    
            line = line.strip()
    
            if not line:
                continue
    
            ## Check if line is a header
            if line[0] == "!":
    
                line_list = line.strip().split("=")
    
                if len(line_list) > 1:
                    metadata_dict[line_list[0].replace(" ", "")].append(
                        line_list[1].strip()
                    )
    
        fh.close()
    
        geo_key = (
            "dataset"
            if args.geo_prefix == "GDS"
            else "Platform"
            if args.geo_prefix == "GPL"
            else "Sample"
            if args.geo_prefix == "GSM"
            else "Series"
            if args.geo_prefix == "GSE"
            else None
        )
    
        title = ", ".join(metadata_dict["!{}_title".format(geo_key)])
    
        summary = ", ".join(metadata_dict["!{}_summary".format(geo_key)])
    
        description = ", ".join(metadata_dict["!{}_description".format(geo_key)])
    
        etype = ", ".join(metadata_dict["!{}_type".format(geo_key)])
    
        status = ", ".join(metadata_dict["!{}_status".format(geo_key)])
    
        submission_date = ", ".join(metadata_dict["!{}_submission_date".format(geo_key)])
    
        last_update_date = ", ".join(metadata_dict["!{}_last_update_date".format(geo_key)])
    
        organism = set(
            [", ".join(list(set(y))) for x, y in metadata_dict.items() if "organism" in x]
        )
    
        pubmed_id = set(
            [", ".join(list(set(y))) for x, y in metadata_dict.items() if "pubmed_id" in x]
        )
    
        link = ", ".join(metadata_dict["!{}_web_link".format(geo_key)])
    
        ## Set summary environment variable
        env_vars = defaultdict(str)
    
        ## UPDATE META RECIPE SUMMARY
        new_summary = (
            "GEO Accession ID: {}. Title: {}. GEO Accession site url: {} (See the url for additional information about {}). ".format(
                args.geo_acc, title, args.geo_url, args.geo_acc
            )
            + "Summary: "
            + summary
            + description
        )
        if etype:
            new_summary += " Type: {}".format(etype)
    
        env_vars["GGD_METARECIPE_SUMMARY"] = new_summary
    
        ## Update META RECIPE VERSION
        date_string = "Submission date: {}. Status: {}. Last Update Date: {}. Download Date: {}".format(
            submission_date,
            status,
            last_update_date,
            datetime.datetime.now().strftime("%m-%d-%Y"),
        )
        env_vars["GGD_METARECIPE_VERSION"] = date_string
    
        ## Update META RECIPE Keywords
        keywords = [
            args.geo_acc,
            args.geo_url,
            etype,
            "PubMed id: {}".format(", ".join(sorted(list(pubmed_id)))) if pubmed_id else "",
            "WEB LINK: {}".format(link) if link else "",
        ]
        env_vars["GGD_METARECIPE_KEYWORDS"] = ", ".join(keywords)
    
        ## Update META RECIPE SPECIES
        env_vars["GGD_METARECIPE_SPECIES"] = ", ".join(sorted(list(organism)))
    
        print("\nCreating environment variable json file: {}".format(args.json_out))
        json.dump(dict(env_vars), open(args.json_out, "w"))
    
    
    if __name__ == "__main__":
        sys.exit(main() or 0)