FAQ

  Downloading and using data from the 4DN data portal

Do I need an account to download files?

File download from the 4DN data portal now requires authentication, even if the file is public. Accounts can be created by anyone, including those not part of the 4DN Network. Use your Github or Google account, or create a new one as explained in the Account Creation User Guide.

How do I download files?

To download a single file, login (if you haven't already), go to the File page and click the Download button. The file will start downloading immediately.

To download many files (bulk download), select and filter the file(s) from either the browse Experiment Sets page, the search Files page, or from the page of an Experiment Set containing the desired files. Then click on the Download button to generate a metadata.tsv file with download URLs for the selected files. These can be downloaded from the command line using cURL (include your access key, as explained here).

Can I use unpublished 4DN data sets in my publication?

Unpublished data sets are generated by the 4DN Network and made freely available to the scientific community. If you are intending to use these data for a publication, we ask that you please contact the data generating lab to discuss possible coordinated publication. In your manuscript, please cite the 4DN White Paper (doi:10.1038/nature23884) and the 4DN Data Portal paper (doi:10.1038/s41467-022-29697-4), and acknowledge the 4DN lab which generated the data. Please direct any questions to the Data Coordination and Integration Center at support@4dnucleome.org.

Should I cite the 4DN data portal if I use it in my research?

If you have used the 4DN data portal or datasets obtained from us during your research, we urge you to cite us in your published work.

https://www.nature.com/articles/s41467-022-29697-4

Reiff, S.B., et al. (2022) The 4D Nucleome Data Portal as a resource for searching and visualizing curated nucleomics data. Nature Commun. 13(1):2365.
doi:10.1038/s41467-022-29697-4

  Data visualization, exploration and analysis

Can I access the database programmatically via REST API?

Yes, we do have a REST API. You can also use our utility functions to access, create and edit metadata on the 4DN data portal.

Can I visualize 1D and 2D genomic data interactively?

Yes, using HiGlass in the 4DN Visualization Workspace you can easily explore and compare data sets by creating interactive views, adding 1D genomic tracks and 2D contact matrices, saving and sharing your own displays. We also have a short tutorial video available on how to use this feature.

Can I analyze 4DN data without downloading it?

Yes, the integrated 4DN JupyterHub allows you to search the database and perform lightweight, small-scale custom analyses on 4DN data, without having to download any file to a local machine. Example notebooks are provided, to help you getting started.

Can I explore raw microscopy image data without downloading it?

Microscopy images are hosted on an OMERO server and are publicly accessible for exploratory purposes.

  Hi-C pipeline

What is the difference between Juicer and Cooler?

Juicer is a full Hi-C pipeline, developed at the Lieberman Aiden lab. The 4DN Hi-C pipeline uses juicer_tools pre, a part of Juicer that takes in a pairs file and creates an interaction matrix file in the hic format. Cooler (developed at the Mirny lab at MIT) takes in a pairs file and creates an interaction matrix file in the cool (or mcool) format. The 4DN pipeline also uses Cooler to create an mcool file from the same pairs file used to create a hic file. The hic file can be visualized using Juicebox, whereas the mcool format can be visualized using HiGlass. The hic format is a compressed binary format, whereas cool/mcool formats are based on hdf5.

How do I run the Hi-C pipeline on my own data?

The different steps of the Hi-C pipeline can be run using the CWL files and 4DN Hi-C Docker image, using cwltools (locally) or Tibanna (on the cloud).

Alternatively, one can run the actual commands inside the 4DN Hi-C docker container. The example commands can be found at : https://github.com/4dn-dcic/docker-4dn-hic/blob/master/HiCPipeline.md

How do I run the Hi-C pipeline without Docker support?

The software programs and versions used can be downloaded by running this file https://github.com/4dn-dcic/docker-4dn-hic/blob/master/downloads.sh which was used to create the 4DN Hi-C Docker image.

After installing the required software, one can run the actual commands as in https://github.com/4dn-dcic/docker-4dn-hic/blob/master/HiCPipeline.md

Do you have a TAD caller?

No, we currently do not have a TAD caller.

  Data submission

When should I start my submission?

As early as possible. Your data will remain private until we get your approval to release it. Submitting data to the 4DN data portal makes it easier to share it (e.g. with other 4DN network members). Additionally, you can benefit from the standardized 4DN data processing pipelines and use the uniformly processed results for your downstream analyses.

What is the first step for data submission?

Contact us at support@4dnucleome.org, we need to grant you submission privileges. We will also explain you the process more in detail and guide you throughout. Learn more about the Data Model and the Submission Process.

What sequencing files should I submit?

For sequencing data sets, raw data files are generally required. We will then run the official 4DN data processing pipeline to provide uniformly processed results. In certain cases, e.g. if a 4DN pipeline is not available, you can (and are encouraged to) also upload your processed results.

What microscopy files should I submit?

For microscopy data sets, often sharing processed data is deemed much more useful to collaborators and outside users than sharing raw images. Thus, we generally require to submit the complete set of processed data, together with a subset of raw images.

Can you submit my 4DN data set to GEO/SRA automatically?

Yes we can do that automatically, and save you time from submitting the data twice. Note that a GEO submission requires processed files.

Can I submit data if I am not affiliated with 4DN?

While the 4DN data portal hosts primarily data produced by 4DN Network members, we also imported some external nucleomics data sets from landmark publications. If you have produced relevant data sets, but are not formally affiliated with a lab or project currently funded by the NIH Common Fund 4D Nucleome program, we might still be able to import them once they are submitted to external public repositories such as GEO/SRA. Contact us at support@4dnucleome.org for more details.

  Submitting Human Data - Controlled Access Data

Can I submit protected human genomic data sets?

Yes. If your datasets include data that is protected under the NIH Genomic Data Sharing Policy you should submit the metadata and data to the DCIC. If a processing pipeline exists for the data type, it will be run to generate standardly processed analysis results. When ready, metadata and analysis results that do not contain potentially identifying sequence data will be released for these controlled access datasets and be available in the data portal. Raw read files (fastq) and alignment files (bam) will be restricted, meaning that they cannot be downloaded and only the metadata associated with these files is available in the portal.

How do I know if the data from my experiments with human tissues and cells should be protected?

Generally any genomic data generated from human tissue or cell lines would be considered controlled access data. Guidelines and information regarding sharing of human genomic data and the required consent for public sharing and related FAQs can be found at this site. If there is any question you should consult with your NIH program officers or contact Ian Fingerman the NIH officer who can assist 4DN investigators with these issues.

Can you submit our data to dbGAP for us?

No. The data generating lab is responsible for registering their data sets with dbGAP and submitting them to that resource at the appropriate time. The 4DN-DCIC will add appropriate links and database cross references on the portal to facilitate discovery of these datasets.

  Data release policy

Will you release my unpublished data without my permission?

No. Although the 4DN policies encourage early sharing of data, the DCIC waits for the submitter's consent and approval before releasing unpublished data. For this reason we encourage early submission of datasets.

When do you release my published data (peer-reviewed or pre-print)?

If the datasets are reported in a manuscript - either a pre-print or peer-reviewed journal article - we will release the data within 7 days of completion of submission and data processing. Contact us prior to that deadline if you find some reason that release of your data is not appropriate at that time.

No, we currently do not support this feature. According to official 4DN policies, a pre-print of the manuscript should generally be posted on a public server (such as bioRxiv) and data shared by the time the manuscript is submitted to a journal.

  Data model

What is the difference between Biosource and Biosample?

Biosource describes the type of biological material, e.g. HFFc6 cell line, or mouse liver tissue. Biosample is a specific preparation of a Biosource, e.g. one plate of HFFc6 cells grown in a given lab on a given day.

What are Biological Replicates? What are Technical Replicates?

Biological Replicates are experiments that use identical experimental protocols and are performed on separate preparations (Biosamples) of the same biological material (Biosource). Technical Replicates are experiments that use identical experimental protocols and are performed on the same Biosample preparation. All sequencing runs of the same DNA library preparation are associated with the same experimental replicate.

What is the difference between Experiment and Experiment Set?

Experiment is one replicate. All biological and technical replicates of an experiment are grouped in an Experiment Set.

What is the difference between Processed Files and Supplementary Files?

Processed Files are generally produced by official 4DN data processing pipelines, run by the 4DN-DCIC. All data sets from the same experiment type are processed uniformly, provenance and data processing are documented extensively. Read more on Reproducible Data Analysis.

Supplementary Files are generally provided by the data submitter (either the lab who performed the experiment, or a different lab who analyzed or re-analyzed the results). These files include results for experiment types for which a standardized data processing pipeline has not yet been developed or approved, as well as preliminary results only shared with other 4DN members.

  Tiered cells

What are Tier 1 and Tier 2 cell lines?

A group of cell lines on which 4DN network scientists agreed to perform coordinated analysis in, with the goal of delivering data that is more directly comparable even across different assays. 4DN Cell Lines are cultured according to 4DN SOPs and are either obtained from a 4DN batch (Tier 1) or from the recommended vendor (Tier 2).

What is the difference between "GM12878 (Tier 1)" and "GM12878" cell line?

Tier 1 cell lines, such as "GM12878 (Tier 1)", have been obtained exclusively from the 4DN stock. Cells that do not meet this requirement, such as cells of a different lot, lack the Tier classification, e.g. "GM12878".

Note that, in the case of Tier 2 cell lines, there is no 4DN stock. The "Tier 2" classification is reported if the cell line has been obtained from the recommended vendor.

  FAIR data and tools

Does 4DN data portal adhere to FAIR (findability, accessibility, interoperability, and reusability) guiding principles?

We have evaluated the “FAIRness” of digital objects and tools available from the 4DN data portal using the FAIRshake assessment tool available at fairshake.cloud. The 4DNucleome assessment project can be found here. Each digital object is assessed using rubrics which evaluate specific aspects of the FAIR principles - an example can be found here. All current assessments and overall analytics for the 4DNucleome FAIRShake project are available to browse.

What ontologies are used in the 4DN data portal?

EFO - the Experimental Factor Ontology - mapping to Experiment Types.

UBERON and OBI - Ontology for Biomedical Investigation - are used for anatomy, tissues and cell lines.

SO - a limited subset of Sequence Ontology terms are used to describe types of biological features used as targets or sequence regions of interest.

We also utilize a small in house controlled vocabulary for terms that have not yet been incorporated into one of the existing ontologies.