Guide to Uploading Processed Results

Summary

4DN members may want to submit results of their analyses to the data portal for a variety of reasons. These may include:

  • Sharing of preliminary results with other 4DN members while collaborating on projects or preparing manuscripts.
  • Providing results for 4DN members for datasets for which the 4DN-DCIC has not yet developed a standardized pipeline for data processing.

Quick Start

  1. If you don’t already have an assigned data submitter for your group write to us at support@4dnucleome.org so that we can give them write access to the portal.
  2. Prepare an excel worksheet with minimal metadata - see below for details.
  3. E-mail your worksheet to your data wrangler (or support@4dnucleome.org)
  4. Install the Submit4DN python package pip install Submit4DN in a location that has access to your files (more info).
  5. Generate access keys for authentication following these instructions and copy them to a file in your home directory named keypairs.json
  6. Submit your metadata and upload the files to the portal (more info).

    To validate your spreadsheet:

    import_data <metadata.xlsx>

    To initiate submission and file upload:

    import_data --update <metadata.xlsx>

Metadata Preparation

Minimal metadata is required for each file that is to be uploaded. To prepare this metadata enter it into a FileProcessed worksheet - available for download here.

Each row represents one file.

The fields are:

  • aliases -- you should enter your own identifier that you can use to reference this file in the future and on the accompanying sheet. an alias must take the form - pi-name-lab:identifier_here eg. bing-ren-lab:tad_calls_1
  • description -- a brief description for the file eg. TAD calls for H1 cells by TopDom
  • file_format -- eg. bam, bedGraph, bigWig see below for a list of current formats and their designations - new formats can be added at request as needed
  • file_type -- the type of file based on the information it contains eg. alignments - see below for some suggested file_types values
  • genome_assembly -- the assembly upon which the analysis was done - valid options are GRCh38, GRCm38, dm6, or galGal5.
  • filename -- this field must contain the full local path to the file on your system.

NOTE: file names must end with standard extensions. Some files are expected to be compressed using the gzip program and have a .gz extension included. See below for allowable extensions.

Filling in this field is what will trigger the file upload when using the Submit4DN import_data program.

  • produced_from -- (optional) fill in this field with one or more aliases for files that are directly used to generate the file described in this row.

For example, list the aliases for the fastq files that were aligned to generate the bam file in this row. This information can be used to generate a provenance graph for how files are produced in a processing pipeline (see https://data.4dnucleome.org/experiment-set-replicates/4DNESRJ8KV4Q/#graph-section) * availability -- ‘public’ or ‘internal’: Should the results be made available to public or only within the 4DN Network? * linked_datasets -- the 4DN accessions of the experiment set(s) or publications that these files should be associated eg. 4DNES2M5JIGV (the accession for the Dekker lab in situ Hi-C on H1 cells). * comments -- any other information for the 4DN-DCIC. For example, if a file is more appropriately linked to an existing portal page like the Joint Analysis page you can indicate that here.

Note that the values in the last 3 columns will not be directly submitted but used by the DCIC to make appropriate links and set access permissions for the submitted files.

Additional information

Data Processing Standards

Required: All data processing should be based on the following genome assemblies.

  • Human: GRCh38
  • Mouse: GRCm38
  • Fruitfly: dm6
  • Chicken: Galgal5

Recommended: standard resolutions 1kb 5kb 10kb 25kb 50kb 100kb.

Supported File formats

HiGlass (Visualization) compatible file formats:

  • Bed (sorted gzipped)
  • Bedgraph (sorted gzipped)
  • Bigwig
  • Bigbed
  • Mcool

Other currently supported file formats.

Please contact us if you would like to submit a file in a format that is not listed above. We can also work with you in converting other 2-way or multi-way contact lists to a cooler contact matrix file.

Filename extensions

Filename extensions are standardized though variations are allowed in some common cases.

If a file is compressed with gzip the filename should end with .gz after the usual extension. These cases include:

  • bed - bed.gz
  • bedpe - bedpe.gz
  • bedGraph - bedGraph.gz
  • clusters - cluster.gz
  • compressed_fasta - fasta.gz
  • commpressed text - txt.gz
  • normvector_juicerformat - normvector.juicerformat.gz

Other standard and allowable extensions can be found here

File types

A list of existing file types in the portal. Please use a similar (or same) short descriptive title for your files:

  • read pairs
  • alignments
  • unfiltered alignments
  • contact list
  • contact list-replicate
  • contact list-combined
  • contact matrix
  • normalized contact matrix
  • long range chromatin interactions
  • intensity values
  • peaks
  • image
  • locus distances submitter format
  • dot calls
  • compartments
  • insulation score - diamond
  • insulation score - potential
  • domain calls
  • boundaries