Hi-C Processing Pipeline

Overview

The 4DN Hi-C data processing pipeline includes alignment, filtering, and matrix aggregation and normalization steps. Feature calling steps, including insulation scores, compartments, and enriched contacts will be provided in the next version of the pipeline.

Alignment

Hi-C reads are mapped to the GRCh38 (human) or mm10 (mouse) reference genome using bwa version 0.7.17. Specifically, we run:

bwa mem -SP5M -t<nthreads> <genome_index> <fastq1> <fastq2>
  • The -SP option is used to ensure the results are equivalent to that obtained by running bwa mem on each mate separately, while retaining the right formatting for paired-end reads. This option skips a step in bwa mem that forces alignment of a poorly aligned read given an alignment of its mate with the assumption that the two mates are part of a single genomic segment.
  • The -5 option is used to report the 5' portion of chimeric alignments as the primary alignment. In Hi-C experiments, when a mate has chimeric alignments, typically, the 5' portion is the position of interest, while the 3' portion represents the same fragment as the mate. For chimeric alignments, bwa mem reports two alignments: one of them is annotated as primary and soft-clipped, retaining the full-length of the original sequence. The other end is annotated as hard-clipped and marked as either 'supplementary' or 'secondary'. The -5 option forces the 5'end to be always annotated as primary.
  • The -M option is used to annotate the secondary/supplementary clipped reads as secondary rather than supplementary, for compatibility with some public software tools such as picard MarkDuplicates.
  • The -t option is used for multi-threading and should not affect the result.

Filtering

For filtering valid Hi-C alignments, we use pairtools (previously called pairsamtools). Specifically, we use version 0.2.2. The filtering workflow outputs a pairs file containing a list of valid contacts.

This filtering workflow applies the following criteria:

  • Reads marked as duplicates are removed.
  • Full-length alignments that are unique are kept.
    • An unmapped portion shorter than 20bp is ignored; and the rest of the alignment is still considered as full-length.
  • In addition, clipped (chimeric) alignments are kept, if they are valid Hi-C contacts. This means one mate is clipped and the other is full-length and the 3'end of the clipped alignment is mapped with 2kb of the full-length alignment in the orientation that the two 3'ends are pointing toward each other.
    • As with full-length alignments, any unmapped portion shorter than 20bp is ignored.

One of the design choices we have made is to include a lossless bam file as an output of the data processing. This output file, containing all the sequences in the original fastq files, the alignment results, and pairtools-provided flags for read filtering, is provided as a resource. To be able to produce this output file, the contents of the bam file is carried forward in the filtering workflow in intermediate pairsam files. Users who are only interested in the valid contact lists may run the same analysis with more light-weight intermediate files.

Specifically, the filtering workflow consists of the following steps:

  • pairtools parse

    • A bam file is read in, and a pairsam file is written out.
    • The pairsam file is a pairs file, listing one read pair per line, with additional columns to track the sam-file lines, and a pairtools read classification.
    • These classifications include information on whether the read aligned to 0, 1, or multiple places in the genome and whether it aligned end-to-end or if it was clipped.
    • This tool also upper-triangularizes the reads, i.e. if the coordinate of second read is higher than the first, the reads are flipped.
    • For more details, see pairsamtools doc

  • pairtools sort

    • A pairsam (or generically pairs) file is read in, and a pairsam file is written out.
    • The rows are sorted in chr1-chr2-pos1-pos2 order.
    • Note that the flipping order and sort order of chromosomes is not identical. See the docs for more details.

  • pairtools merge

    • One or more pairsam (or generically pairs) files are read in, and a pairsam file is written out.
    • The files are merged, preserving the sorted order.

  • pairtools dedup --mark-dups

    • (equivalent to pairtools markasdup)
    • Duplicate alignments that share the same pair of 5'end coordinates +/- 3 bps are marked as identified.
    • An arbitrary one is retained with the original classification, while others get a duplicate classification.

  • pairtools select

    • Only the reads with pairtools classification UU and UC are retained and output to a pairs file.

Matrix aggregation and normalization

Depending on the experiment type, the Hi-C pipeline was used either as it is or without restriction enzyme-based intra-fragment contact filtering.

  • In situ Hi-C and dilution Hi-C data have been processed with the Hi-C pipeline as it is.
  • Data from similar experiment types were processed using the Hi-C pipeline either without restriction enzyme filtering or without balancing.

    Experiment typeRestriction site fileno_balance
    in situ Hi-CyesFalse
    dilution Hi-CyesFalse
    DNase Hi-CnoFalse
    Micro-CnoFalse
    Capture Hi-CyesTrue
    ChIA-PETnoTrue

4DN DCIC provides a Hi-C matrix in two different formats: .mcool format and .hic format. The two files are generated from the same pairs file as input filtered contact list. Both files contain multiple resolutions.

  • .hic format
    • A .hic file is produced by Juicertools (version 1.8.9-cuda8) and can be visualized using Juicebox
    • The matrix is normalized using the VC, VC_SQRT, KR methods, after filtering out intra-fragment contacts (contacts that fall in the same restriction fragment).
  • .mcool format
    • An .mcool file is produced by Cooler (version 0.8.3) and can be visualized using HiGlass.
    • The matrix is normalized using the ICE (iterative correction and eigenvalue decomposition) matrix balancing algorithm, after the matrix-level removal of the diagonal and the rows/columns with a low value.
    • The .mcool file also contains the normalization vectors generated by Juicertools (same as in a .hic file generated from the same pairs file)

Resolutions

Both mcool and hic files contain the following resolutions.

  • 1kb, 2kb, 5kb, 10kb, 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb, 2.5Mb, 5Mb, 10Mb

Merging replicates

Replicates are merged in two steps before producing a matrix.

  • Sequencing replicates (replicates within an experiment)
    • Sequencing replicates are merged after the alignment but before duplicate removal step (pairsam dedup), since reads resulting from a single PCR duplication event may exist in both replicates.
    • Merging is performed on (intermediate) pairsam files using pairsam merge.
    • DCIC provides a merged output as a pairs file.
  • Biological replicates (replicates within an experiment set)
    • Biological replicates are merged after duplicate removal step, since PCR duplication event happened independently in these replicates.
    • Merging is performed on pairs files using run-merge-pairs.sh.
    • DCIC provides a merged output as a merged pairs file.

Source files

The pipeline components are pre-installed in a publicly available Docker image (duplexa/4dn-hic:v43) on Docker Hub. The source code for the Docker image and pipeline description in Common Workflow Language (CWL) can be found on GitHub.

Example set of commands that were actually run as part of the pipeline can be found at https://github.com/4dn-dcic/docker-4dn-hic/blob/v42.2/HiCPipeline.md

Restriction Enzyme Check

For a bam file (produced in the alignment step), this process calculates the percentage of clipped sites with a restriction enzyme motif which matches the motif of the reported restriction enzyme. The result is available on the portal as the percent of clipped sites with RE motif field, located in the details section of the assessed bam file. Expected percentages for correctly reported restriction enzymes are in the neighborhood of 80%, while incorrect enzymes tend to produce values less than 10%.

The Docker image created for this step (4dndcic/4dn-re-checker/v1.1) is available on Docker Hub. All source files can be found in the GitHub re-checker v1.1 repository.