StaticSection
Insulation Scores and Boundaries

UserContent
released
   November 24th, 2020 at 3:45pm

Insulation Scores and Boundaries


Methods

The workflow uses the cooltools software to call diamond insulation scores using a 100kb window size and either a 5kb or 10kb bin size, for 4-cutter or 6-cutter restriction enzyme respectively. A bin size of 5kb is also used for MNase and DNase based assays. Local minima of the chromosome-wide topographic prominence track for log2(insulation score) above a 0.2 threshold are defined as boundaries. Please note that insulation score calls are not provided on datasets with less than 100M filtered reads, since results on low resolution data sets are found to be less reproducible and reliable.

The insulation score and boundary caller workflow components are pre-installed in a publicly available Docker image (4dndcic/4dn-insulation-scores-and-boundaries-caller:v1) on Docker Hub. The source code for the Docker image and the workflow description in Common Workflow Language (CWL) can be found on 4DN-DCIC GitHub repo.

Boundary Score Assessment

In the absence of gold standard truth information, it is not possible to assign a statistical score to the boundary calls presented. Based on the following assessment, it has been deemed useful to qualify boundaries as weak (0.2<=prominence<0.5) and strong (prominence>=0.5) to create two separate boundary lists for when high sensitivity vs. high specificity is desired.

The Micro-C dataset 4DNES21D8SP8 and the Hi-C dataset 4DNES2M5JIGV were used as reference for assessing boundary thresholds. These datasets are based on 4DN Tier H1-ESC cells grown with standard protocols and constitute some of the highest resolution genome-wide 3C maps. Here, we vary the thresold on prominence score to define boundary sets and assess their overlap to CTCF peaks. Boundaries within 5kb of CTCF ChIP-seq peaks (obtained from ENCODE) are considered as overlapping a CTCF site.

The top plots represent the number of boundaries obtained after a minimum boundary strength score is chosen as threshold. The bottom plots represent the proportion of boundaries within 5kb distance from a CTCF region after a minimum boundary strength score is chosen as threshold. 0.2 and 0.5 were selected as the weak and strong boundary thresholds respectively.

To further assess the reliability of insulation score calls in data sets of different protocols and sequencing depths, we present a comparison of boundary calls between different data sets. We present the boundary calls with a score above the 0.2 from the Micro-C dataset- the dataset with the highest resolution - as the most reliable "true" set of boundaries. We compare that to two Hi-C datasets of different sequencing depths on the same cell line. Boundaries in two data sets are considered to overlap if they are within 10kb of each other.
The top plots represent the boundary count for Micro-C set, Hi-C set and their overlap (within a 10kb distance) and the bottom represents the proportion of the overlap set to the number of boundaries called. The left plots are the results for a dataset with 2.5 billion reads (4DNES2M5JIGV) and the right plots are the results for dataset with 415 million reads (4DNESRJ8KV4Q).