Using GGD¶
[Click here to return to the home page]
To see and/or search for data packages available through GGD, see: Available data packages
For a brief introduction to how ggd works and to start using ggd see: GGD Quick Start
To request a new data recipe please fill out the GGD Recipe Request Form.
Important
If you use GGD, please cite the Nature Communications GGD paper
1. Install conda¶
ggd requires the conda package management system be installed on your system. Loading conda from a module is not sufficient as data packages are stored in conda root. Please install Anaconda or Miniconda onto your system. The best way to install is with the Miniconda package. We specifically recommend using the Python 3 version.
Warning
After December 31, 2020 GGD will no longer maintain python 2 compatibility. Python 2 may still work, but maintenance will be focused on python 3. This decision is based on the End-Of-Life of python 2 starting on January 1, 2020. GGD will maintain python 2 compatibility for 1 year from the End-Of-Life of python 2.
2. Configure the conda channels¶
ggd data packages are stored in the Anaconda cloud. Additionally, ggd uses software tools available from other software packages in conda. A ggd conda channel, and other required channels, need to be added to your conda configurations. You can add as many available ggd channels as you would like, but only one of the available ggd channels is required. As ggd becomes more widely used, additional channels will be created to support different areas of research.
Available ggd channels:
ggd-genomics
Run the following commands, adding in additional ggd channels as desired:
$ conda config --add channels defaults
$ conda config --add channels ggd-genomics
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
3. Install ggd¶
Note
Step 2 above is required prior to installing ggd. If Step 2 has not been completed ggd installation will fail
ggd needs to be installed on your system before you can use it. Run the following commands to download the ggd cli:
$ conda install -c bioconda ggd
4. ggd tools¶
The ggd command line tool (cli) installed in step 3 has built-in tools for accessing and managing data packages. These tools include:
$ ggd search
: Search for a ggd data package$ ggd predict-path
: Predict the file path of a data package that has not been installed yet (Good for workflows like Snakemake)$ ggd install
: Install ggd data package(s)$ ggd uninstall
: Uninstall a ggd data package(s)$ ggd list
: List the installed data packages$ ggd get-files
: get the files for an installed ggd package$ ggd pkg-info
: Show a specific ggd package’s info$ ggd show-env
: Show the ggd specific environment variables$ ggd make-recipe
: Create a ggd recipe from a bash script$ ggd make-meta-recipe
: Create a ggd meta-recipe$ ggd check-recipe
: Check/test a ggd recipe
For information about specific tools see: GGD-CLI
5. Contributing to ggd¶
We intend for ggd to become a widely used data management system for genomics and other research areas. ggd provides support for reproducibility through conda’s naming, version tracking, and dependency handling structure. One major function of the ggd cli tools is to provide an easy way to add data packages to the data repository.
We welcome and encourage everyone to contribute to the data repository hosted by ggd.
Instructions on how to create a data package and add it to ggd can be found on the Contribute documentation pages.
ggd Use Case¶
You need to align some sequence(s) to the human reference genome for a given analysis. You will need to find and download the correct reference genome from one of the sites that hosts it and make sure it is the correct genome build. You will then need to sort and index the reference genome before you can use it.
ggd simplifies this process by allowing you to search and install available processed genomic data packages using the ggd tool.
Search for a reference genome
$ ggd search reference genome
----------------------------------------------------------------------------------------------------
grch37-reference-genome-ensembl-v1
==================================
Summary: The GRCh37 unmasked genomic DNA seqeunce reference genome from Ensembl-Release 75. Includes all sequence regions EXCLUDING haplotypes and patches. 'Primary Assembly file'
Species: Homo_sapiens
Genome Build: GRCh37
Keywords: Primary-Assembly, Release-75, ref, reference, Ensembl-ref, DNA-Seqeunce, Fasta-Seqeunce, fasta-file
Data Provider: Ensembl
Data Version: release-75_2-3-14
File type(s): fa
Data file coordinate base: NA
Included Data Files:
grch37-reference-genome-ensembl-v1.fa
grch37-reference-genome-ensembl-v1.fa.fai
Approximate Data File Sizes:
grch37-reference-genome-ensembl-v1.fa: 3.15G
grch37-reference-genome-ensembl-v1.fa.fai: 2.74K
To install run:
ggd install grch37-reference-genome-ensembl-v1
----------------------------------------------------------------------------------------------------
grch38-reference-genome-ensembl-v1
==================================
Summary: The GRCh38 unmasked genomic DNA sequence reference genome from Ensembl-Release 99. Includes all sequence regions EXCLUDING haplotypes and patches. 'Primary Assembly file'
Species: Homo_sapiens
Genome Build: GRCh38
Keywords: Primary-Assembly, Release-99, ref, reference, Ensembl-ref, DNA-Sequence, Fasta-Sequence, fasta-file
Data Provider: Ensembl
Data Version: release-99_11-18-19
File type(s): fa
Data file coordinate base: NA
Included Data Files:
grch38-reference-genome-ensembl-v1.fa
grch38-reference-genome-ensembl-v1.fa.fai
Approximate Data File Sizes:
grch38-reference-genome-ensembl-v1.fa: 3.15G
grch38-reference-genome-ensembl-v1.fa.fai: 6.41K
To install run:
ggd install grch38-reference-genome-ensembl-v1
----------------------------------------------------------------------------------------------------
. . .
Install the grch38 reference genome
$ ggd install grch38-reference-genome-ensembl-v1
:ggd:install: Looking for grch38-reference-genome-ensembl-v1 in the 'ggd-genomics' channel
:ggd:install: grch38-reference-genome-ensembl-v1 exists in the ggd-genomics channel
:ggd:install: grch38-reference-genome-ensembl-v1 version 1 is not installed on your system
:ggd:install: grch38-reference-genome-ensembl-v1 has not been installed by conda
:ggd:install: The grch38-reference-genome-ensembl-v1 package is uploaded to an aws S3 bucket. To reduce processing time the package will be downloaded from an aws S3 bucket
:ggd:install: Attempting to install the following cached package(s):
grch38-reference-genome-ensembl-v1
:ggd:utils:bypass: Installing grch38-reference-genome-ensembl-v1 from the ggd-genomics conda channel
Collecting package metadata: done
Processing data: done
## Package Plan ##
environment location: <conda-root>
added / updated specs:
- grch38-reference-genome-ensembl-v1
The following packages will be downloaded:
package | build
---------------------------|-----------------
grch38-reference-genome-ensembl-v1-1| 3 7 KB ggd-genomics
------------------------------------------------------------
Total: 7 KB
The following NEW packages will be INSTALLED:
grch38-reference-~ ggd-genomics/noarch::grch38-reference-genome-ensembl-v1-1-0
Downloading and Extracting Packages
grch38-reference-gen | 7 KB | ############################################################################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
:ggd:install: Updating installed package list
:ggd:install: Initiating data file content validation using checksum
:ggd:install: Checksum for grch38-reference-genome-ensembl-v1
:ggd:checksum: installed file checksum: grch38-reference-genome-ensembl-v1.fa.fai checksum: d527f3eb6b664020cf4d882b5820056f
:ggd:checksum: metadata checksum record: grch38-reference-genome-ensembl-v1.fa.fai checksum: d527f3eb6b664020cf4d882b5820056f
:ggd:checksum: installed file checksum: grch38-reference-genome-ensembl-v1.fa checksum: 9e6b9465dc708d92bf6d67e9c9fa9389
:ggd:checksum: metadata checksum record: grch38-reference-genome-ensembl-v1.fa checksum: 9e6b9465dc708d92bf6d67e9c9fa9389
:ggd:install: ** Successful Checksum **
:ggd:install: Install Complete
:ggd:install: Installed file locations
======================================================================================================================
GGD Package Environment Variable(s)
----------------------------------------------------------------------------------------------------
-> grch38-reference-genome-ensembl-v1 $ggd_grch38_reference_genome_ensembl_v1_dir
$ggd_grch38_reference_genome_ensembl_v1_file
Install Path: <conda-root>/share/ggd/Homo_sapiens/GRCh38/grch38-reference-genome-ensembl-v1/1
----------------------------------------------------------------------------------------------------
:ggd:install: To activate environment variables run `source activate base` in the environmnet the packages were installed in
:ggd:install: NOTE: These environment variables are specific to the <conda-root> conda environment and can only be accessed from within that environmnet
======================================================================================================================
:ggd:install: Environment Variables
*****************************
Inactive or out-of-date environment variables:
> $ggd_grch38_reference_genome_ensembl_v1_dir
> $ggd_grch38_reference_genome_ensembl_v1_file
To activate inactive or out-of-date vars, run:
source activate base
*****************************
Identify the data environment variable or the file location
$ ggd show-env
***************************
Active environment variables:
> $ggd_grch38_reference_genome_ensembl_v1_dir
> $ggd_grch38_reference_genome_ensembl_v1_file
***************************
$ ggd get-files grch38-reference-genome-ensembl-v1
<conda root>/share/ggd/Homo_sapiens/GRCh38/grch38-reference-genome-ensembl-v1/1/grch38.fa
<conda root>/share/ggd/Homo_sapiens/GRCh38/grch38-reference-genome-ensembl-v1/1/grch38.fa.fai
Use the files
For additional information and examples on how to use the installed data files see: Using installed data.