The 4DN Hi-C data processing pipeline includes alignment, filtering, and matrix aggregation and normalization steps. Feature calling steps, including insulation scores, compartments, and enriched contacts will be provided in the next version of the pipeline.
bashbwa mem -SP5M -t<nthreads> <genome_index> <fastq1> <fastq2>
-SPoption is used to ensure the results are equivalent to that obtained by running
bwa memon each mate separately, while retaining the right formatting for paired-end reads. This option skips a step in
bwa memthat forces alignment of a poorly aligned read given an alignment of its mate with the assumption that the two mates are part of a single genomic segment.
-5option is used to report the 5' portion of chimeric alignments as the primary alignment. In Hi-C experiments, when a mate has chimeric alignments, typically, the 5' portion is the position of interest, while the 3' portion represents the same fragment as the mate. For chimeric alignments,
bwa memreports two alignments: one of them is annotated as primary and soft-clipped, retaining the full-length of the original sequence. The other end is annotated as hard-clipped and marked as either 'supplementary' or 'secondary'. The
-5option forces the 5'end to be always annotated as primary.
-Moption is used to annotate the secondary/supplementary clipped reads as secondary rather than supplementary, for compatibility with some public software tools such as
-toption is used for multi-threading and should not affect the result.
For filtering valid Hi-C alignments, we use
pairtools (previously called
pairsamtools). Specifically, we use version
0.2.2. The filtering workflow outputs a pairs file containing a list of valid contacts.
This filtering workflow applies the following criteria:
- Reads marked as duplicates are removed.
- Full-length alignments that are unique are kept.
- An unmapped portion shorter than 20bp is ignored; and the rest of the alignment is still considered as full-length.
- In addition, clipped (chimeric) alignments are kept, if they are valid Hi-C contacts. This means one mate is clipped and the other is full-length and the 3'end of the clipped alignment is mapped with 2kb of the full-length alignment in the orientation that the two 3'ends are pointing toward each other.
- As with full-length alignments, any unmapped portion shorter than 20bp is ignored.
One of the design choices we have made is to include a lossless bam file as an output of the data processing. This output file, containing all the sequences in the original fastq files, the alignment results, and pairtools-provided flags for read filtering, is provided as a resource. To be able to produce this output file, the contents of the bam file is carried forward in the filtering workflow in intermediate
pairsam files. Users who are only interested in the valid contact lists may run the same analysis with more light-weight intermediate files.
Specifically, the filtering workflow consists of the following steps:
- A bam file is read in, and a pairsam file is written out.
- The pairsam file is a pairs file, listing one read pair per line, with additional columns to track the sam-file lines, and a pairtools read classification.
- These classifications include information on whether the read aligned to 0, 1, or multiple places in the genome and whether it aligned end-to-end or if it was clipped.
- This tool also upper-triangularizes the reads, i.e. if the coordinate of second read is higher than the first, the reads are flipped.
For more details, see pairsamtools doc
pairs) file is read in, and a
pairsamfile is written out.
- The rows are sorted in chr1-chr2-pos1-pos2 order.
Note that the flipping order and sort order of chromosomes is not identical. See the docs for more details.
- One or more
pairs) files are read in, and a
pairsamfile is written out.
The files are merged, preserving the sorted order.
pairtools dedup --mark-dups
- (equivalent to
- Duplicate alignments that share the same pair of 5'end coordinates +/- 3 bps are marked as identified.
An arbitrary one is retained with the original classification, while others get a duplicate classification.
- Only the reads with pairtools classification
UCare retained and output to a
Matrix aggregation and normalization
Depending on the experiment type, the Hi-C pipeline was used either as it is or without restriction enzyme-based intra-fragment contact filtering.
- In situ Hi-C and dilution Hi-C data have been processed with the Hi-C pipeline as it is.
- Data from similar experiment types were processed using the Hi-C pipeline either without restriction enzyme filtering or without balancing.
|Experiment type||Restriction site file|
|in situ Hi-C||yes||False|
4DN DCIC provides a Hi-C matrix in two different formats:
.mcool format and
.hic format. The two files are generated from the same
pairs file as input filtered contact list. Both files contain multiple resolutions.
.hicfile is produced by Juicertools (version 1.8.9-cuda8) and can be visualized using Juicebox
- The matrix is normalized using the VC, VC_SQRT, KR methods, after filtering out intra-fragment contacts (contacts that fall in the same restriction fragment).
.mcoolfile is produced by Cooler (version 0.8.3) and can be visualized using HiGlass.
- The matrix is normalized using the ICE (iterative correction and eigenvalue decomposition) matrix balancing algorithm, after the matrix-level removal of the diagonal and the rows/columns with a low value.
.mcoolfile also contains the normalization vectors generated by Juicertools (same as in a
.hicfile generated from the same
hic files contain the following resolutions.
- 1kb, 2kb, 5kb, 10kb, 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb, 2.5Mb, 5Mb, 10Mb
Replicates are merged in two steps before producing a matrix.
- Sequencing replicates (replicates within an experiment)
- Sequencing replicates are merged after the alignment but before duplicate removal step (
pairsam dedup), since reads resulting from a single PCR duplication event may exist in both replicates.
- Merging is performed on (intermediate)
- DCIC provides a merged output as a
- Biological replicates (replicates within an experiment set)
- Biological replicates are merged after duplicate removal step, since PCR duplication event happened independently in these replicates.
- Merging is performed on
pairsfiles using run-merge-pairs.sh.
- DCIC provides a merged output as a merged
The pipeline components are pre-installed in a publicly available Docker image (
duplexa/4dn-hic:v43) on Docker Hub. The source code for the Docker image and pipeline description in Common Workflow Language (CWL) can be found on GitHub.
- Latest runs
0.2.7 can be considered (nearly) identical.
- CWL : https://github.com/4dn-dcic/docker-4dn-hic/tree/v43/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-hic/tree/v43
- CWL : https://github.com/4dn-dcic/docker-4dn-hic/tree/v42.2/cwl
- Docker : https://github.com/4dn-dcic/docker-4dn-hic/tree/v42.2
- Old runs
- CWL : https://github.com/4dn-dcic/pipelines-cwl/tree/0.2.0/cwl_awsem/
- Docker : https://github.com/4dn-dcic/docker-4dn-hic/tree/v40
Example set of commands that were actually run as part of the pipeline can be found at https://github.com/4dn-dcic/docker-4dn-hic/blob/v43/HiCPipeline.md
Restriction Enzyme Check
For a bam file (produced in the alignment step), this process calculates the percentage of clipped sites with a restriction enzyme motif which matches the motif of the reported restriction enzyme. The result is available on the portal as the percent of clipped sites with RE motif field, located in the details section of the assessed bam file. Expected percentages for correctly reported restriction enzymes are in the neighborhood of 80%, while incorrect enzymes tend to produce values less than 10%.