{"lab": {"@id": "/labs/4dn-dcic-lab/", "correspondence": [{"contact_email": "cGV0ZXJfcGFya0BobXMuaGFydmFyZC5lZHU=", "@id": "/users/fb287a31-e765-41c5-8c1d-665f8e9f025b/", "display_title": "Peter Park"}], "title": "4DN DCIC, HMS", "uuid": "828cd4fe-ebb0-4b36-a94a-d2e3a36cc989", "@type": ["Lab", "Item"], "status": "current", "display_title": "4DN DCIC, HMS", "pi": {"error": "no view permissions"}, "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin", "role.lab_submitter", "submits_for.828cd4fe-ebb0-4b36-a94a-d2e3a36cc989"]}}, "body": "## Methods\n\n\nThe workflow uses the <a href=\"https://cooltools.readthedocs.io/en/latest/cooltools.html#module-cooltools.insulation\">cooltools</a> software to call diamond insulation scores using a 100kb window size and either a 5kb or 10kb bin size, for 4-cutter or 6-cutter restriction enzyme respectively. A bin size of 5kb is also used for MNase and DNase based assays. Local minima of the chromosome-wide topographic prominence track for log2(insulation score) above a 0.2 threshold are defined as boundaries. Please note that **insulation score calls are not provided on datasets with less than 100M filtered reads**, since results on low resolution data sets are found to be less reproducible and reliable.\n\nThe insulation score and boundary caller workflow components are pre-installed in a publicly available Docker image (`4dndcic/4dn-insulation-scores-and-boundaries-caller:v1`) on <a href=\"https://hub.docker.com/r/4dndcic/4dn-insulation-scores-and-boundaries-caller/\">Docker Hub</a>. The source code for the Docker image and the workflow description in Common Workflow Language (CWL) can be found on 4DN-DCIC <a href=\"https://github.com/4dn-dcic/docker-4dn-insulation-scores-and-boundaries-caller/tree/v1\">GitHub repo</a>.\n\n\n## Boundary Score Assessment\n\nIn the absence of gold standard truth information, it is not possible to assign a statistical score to the boundary calls presented. Based on the following assessment, it has been deemed useful to qualify boundaries as weak (0.2<=prominence<0.5) and strong (prominence>=0.5) to create two separate boundary lists for when high sensitivity vs. high specificity is desired. \n\nThe Micro-C dataset <a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES21D8SP8/\">4DNES21D8SP8</a> and the Hi-C dataset <a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES2M5JIGV/\">4DNES2M5JIGV</a> were used as reference for assessing boundary thresholds. These datasets are based on 4DN Tier H1-ESC cells grown with standard protocols and constitute some of the highest resolution genome-wide 3C maps. Here, we vary the thresold on prominence score to define boundary sets and assess their overlap to CTCF peaks. Boundaries within 5kb of CTCF ChIP-seq peaks (obtained from <a href=\"https://www.encodeproject.org/search/?type=Experiment&status=released&target.label=CTCF&assembly=GRCh38&files.file_type=bed+narrowPeak\">ENCODE</a>) are considered as overlapping a CTCF site.\n\n <div> <img style=\"width: 1200px;\" src=\"https://s3.amazonaws.com/4dn-dcic-public/static-pages/ISC_analysis_plots/microc_insitu_ctcf_count_prop_v3.png\"/>\n\n <em>\n The top plots represent the number of boundaries obtained after a minimum boundary strength score is chosen as threshold. The bottom plots represent the proportion of boundaries within 5kb distance from a CTCF region after a minimum boundary strength score is chosen as threshold. 0.2 and 0.5 were selected as the weak and strong boundary thresholds respectively.</em>\n\n\n<br/><br/>\nTo further assess the reliability of insulation score calls in data sets of different protocols and sequencing depths, we present a comparison of boundary calls between different data sets. We present the boundary calls with a score above the 0.2 from the Micro-C dataset- the dataset with the highest resolution - as the most reliable \"true\" set of boundaries. We compare that to two Hi-C datasets of different sequencing depths on the same cell line. Boundaries in two data sets are considered to overlap if they are within 10kb of each other.\n\n\n <div> <img style=\"width: 1400px;\" src=\"https://s3.amazonaws.com/4dn-dcic-public/static-pages/ISC_analysis_plots/insitu_microc_comparison_v4.png\"/>\n\n <em>\n\n The top plots represent the boundary count for Micro-C set, Hi-C set and their overlap (within a 10kb distance) and the bottom represents the proportion of the overlap set to the number of boundaries called. The left plots are the results for a dataset with 2.5 billion reads (<a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES2M5JIGV/\">4DNES2M5JIGV</a>) and the right plots are the results for dataset with 415 million reads (<a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNESRJ8KV4Q/\">4DNESRJ8KV4Q</a>).</em>\n", "name": "resources.data-analysis.insulation_scores_and_boundaries_page_all", "award": {"display_title": "4D NUCLEOME NETWORK DATA COORDINATION AND INTEGRATION CENTER - PHASE I", "status": "current", "center_title": "DCIC - DCIC", "description": "DCIC: The goals of the 4D Nucleome (4DN) Data Coordination and Integration Center (DCIC) are to collect, store, curate, display, and analyze data generated in the 4DN Network. We have assembled a team of investigators with a strong track record in analysis of chromatin interaction data, image processing and three-dimensional data visualization, integrative analysis of genomic and epigenomic data, data portal development, large-scale computing, and development of secure and flexible cloud technologies. In Aim 1, we will develop efficient submission pipelines for data and metadata from 4DN data production groups. We will define data/metadata requirements and quality metrics in conjunction with the production groups and ensure that high-quality, well- annotated data become available to the wider scientific community in a timely manner. In Aim 2, we will develop a user-friendly data portal for the broad scientific community. This portal will provide an easy-to-navigate interface for accessing raw and intermediate data files, allow for programmatic access via APIs, and will incorporate novel analysis and visualization tools developed by DCIC as well as other Network members. For computing and storage scalability and cost-effectiveness, significant efforts will be devoted to development and deployment of cloud-based technology. We will conduct tutorials and workshops to facilitate the use of 4DN data and tools by external investigators. In Aim 3, we will coordinate and assist in conducting integrative analysis of the multiple data types. These efforts will examine key questions in higher-order chromatin organization using both sequence and image data, and the tools and algorithms developed here will be incorporated into the data portal for use by other investigators. These three aims will ensure that the data generated in 4DN will have maximal impact for the scientific community.", "@type": ["Award", "Item"], "@id": "/awards/1U01CA200059-01/", "project": "4DN", "name": "1U01CA200059-01", "uuid": "b0b9c607-f8b4-4f02-93f4-9895b461334b", "pi": {"error": "no view permissions"}, "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}, "title": "Insulation Scores and Boundaries", "status": "released", "aliases": ["4dn-dcic-lab:4dn-dcic-lab:ISC_analysis_static_section_all"], "options": {"filetype": "md", "collapsible": true, "default_open": true}, "date_created": "2020-11-24T15:45:04.677066+00:00", "section_type": "Page Section", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2020-12-17T06:51:19.487612+00:00"}, "schema_version": "2", "@id": "/static-sections/93bf54b9-a6f2-4ece-8937-9f4e503bd2d8/", "@type": ["StaticSection", "UserContent", "Item"], "uuid": "93bf54b9-a6f2-4ece-8937-9f4e503bd2d8", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin", "role.owner", "userid.56c9c683-bb11-471b-b590-c656f7dc03c1"]}, "display_title": "Insulation Scores and Boundaries", "external_references": [], "content": "## Methods\n\n\nThe workflow uses the <a href=\"https://cooltools.readthedocs.io/en/latest/cooltools.html#module-cooltools.insulation\">cooltools</a> software to call diamond insulation scores using a 100kb window size and either a 5kb or 10kb bin size, for 4-cutter or 6-cutter restriction enzyme respectively. A bin size of 5kb is also used for MNase and DNase based assays. Local minima of the chromosome-wide topographic prominence track for log2(insulation score) above a 0.2 threshold are defined as boundaries. Please note that **insulation score calls are not provided on datasets with less than 100M filtered reads**, since results on low resolution data sets are found to be less reproducible and reliable.\n\nThe insulation score and boundary caller workflow components are pre-installed in a publicly available Docker image (`4dndcic/4dn-insulation-scores-and-boundaries-caller:v1`) on <a href=\"https://hub.docker.com/r/4dndcic/4dn-insulation-scores-and-boundaries-caller/\">Docker Hub</a>. The source code for the Docker image and the workflow description in Common Workflow Language (CWL) can be found on 4DN-DCIC <a href=\"https://github.com/4dn-dcic/docker-4dn-insulation-scores-and-boundaries-caller/tree/v1\">GitHub repo</a>.\n\n\n## Boundary Score Assessment\n\nIn the absence of gold standard truth information, it is not possible to assign a statistical score to the boundary calls presented. Based on the following assessment, it has been deemed useful to qualify boundaries as weak (0.2<=prominence<0.5) and strong (prominence>=0.5) to create two separate boundary lists for when high sensitivity vs. high specificity is desired. \n\nThe Micro-C dataset <a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES21D8SP8/\">4DNES21D8SP8</a> and the Hi-C dataset <a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES2M5JIGV/\">4DNES2M5JIGV</a> were used as reference for assessing boundary thresholds. These datasets are based on 4DN Tier H1-ESC cells grown with standard protocols and constitute some of the highest resolution genome-wide 3C maps. Here, we vary the thresold on prominence score to define boundary sets and assess their overlap to CTCF peaks. Boundaries within 5kb of CTCF ChIP-seq peaks (obtained from <a href=\"https://www.encodeproject.org/search/?type=Experiment&status=released&target.label=CTCF&assembly=GRCh38&files.file_type=bed+narrowPeak\">ENCODE</a>) are considered as overlapping a CTCF site.\n\n <div> <img style=\"width: 1200px;\" src=\"https://s3.amazonaws.com/4dn-dcic-public/static-pages/ISC_analysis_plots/microc_insitu_ctcf_count_prop_v3.png\"/>\n\n <em>\n The top plots represent the number of boundaries obtained after a minimum boundary strength score is chosen as threshold. The bottom plots represent the proportion of boundaries within 5kb distance from a CTCF region after a minimum boundary strength score is chosen as threshold. 0.2 and 0.5 were selected as the weak and strong boundary thresholds respectively.</em>\n\n\n<br/><br/>\nTo further assess the reliability of insulation score calls in data sets of different protocols and sequencing depths, we present a comparison of boundary calls between different data sets. We present the boundary calls with a score above the 0.2 from the Micro-C dataset- the dataset with the highest resolution - as the most reliable \"true\" set of boundaries. We compare that to two Hi-C datasets of different sequencing depths on the same cell line. Boundaries in two data sets are considered to overlap if they are within 10kb of each other.\n\n\n <div> <img style=\"width: 1400px;\" src=\"https://s3.amazonaws.com/4dn-dcic-public/static-pages/ISC_analysis_plots/insitu_microc_comparison_v4.png\"/>\n\n <em>\n\n The top plots represent the boundary count for Micro-C set, Hi-C set and their overlap (within a 10kb distance) and the bottom represents the proportion of the overlap set to the number of boundaries called. The left plots are the results for a dataset with 2.5 billion reads (<a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES2M5JIGV/\">4DNES2M5JIGV</a>) and the right plots are the results for dataset with 415 million reads (<a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNESRJ8KV4Q/\">4DNESRJ8KV4Q</a>).</em>\n", "filetype": "md", "content_as_html": "<div class=\"markdown-container\"><h2>Methods</h2>\n<p>The workflow uses the <a href=\"https://cooltools.readthedocs.io/en/latest/cooltools.html#module-cooltools.insulation\" rel=\"noopener noreferrer\" target=\"_blank\">cooltools</a> software to call diamond insulation scores using a 100kb window size and either a 5kb or 10kb bin size, for 4-cutter or 6-cutter restriction enzyme respectively. A bin size of 5kb is also used for MNase and DNase based assays. Local minima of the chromosome-wide topographic prominence track for log2(insulation score) above a 0.2 threshold are defined as boundaries. Please note that <strong>insulation score calls are not provided on datasets with less than 100M filtered reads</strong>, since results on low resolution data sets are found to be less reproducible and reliable.</p>\n<p>The insulation score and boundary caller workflow components are pre-installed in a publicly available Docker image (<code>4dndcic/4dn-insulation-scores-and-boundaries-caller:v1</code>) on <a href=\"https://hub.docker.com/r/4dndcic/4dn-insulation-scores-and-boundaries-caller/\" rel=\"noopener noreferrer\" target=\"_blank\">Docker Hub</a>. The source code for the Docker image and the workflow description in Common Workflow Language (CWL) can be found on 4DN-DCIC <a href=\"https://github.com/4dn-dcic/docker-4dn-insulation-scores-and-boundaries-caller/tree/v1\" rel=\"noopener noreferrer\" target=\"_blank\">GitHub repo</a>.</p>\n<h2>Boundary Score Assessment</h2>\n<p>In the absence of gold standard truth information, it is not possible to assign a statistical score to the boundary calls presented. Based on the following assessment, it has been deemed useful to qualify boundaries as weak (0.2&lt;=prominence&lt;0.5) and strong (prominence&gt;=0.5) to create two separate boundary lists for when high sensitivity vs. high specificity is desired. </p>\n<p>The Micro-C dataset <a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES21D8SP8/\" rel=\"noopener noreferrer\" target=\"_blank\">4DNES21D8SP8</a> and the Hi-C dataset <a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES2M5JIGV/\" rel=\"noopener noreferrer\" target=\"_blank\">4DNES2M5JIGV</a> were used as reference for assessing boundary thresholds. These datasets are based on 4DN Tier H1-ESC cells grown with standard protocols and constitute some of the highest resolution genome-wide 3C maps. Here, we vary the thresold on prominence score to define boundary sets and assess their overlap to CTCF peaks. Boundaries within 5kb of CTCF ChIP-seq peaks (obtained from <a href=\"https://www.encodeproject.org/search/?type=Experiment&amp;status=released&amp;target.label=CTCF&amp;assembly=GRCh38&amp;files.file_type=bed+narrowPeak\" rel=\"noopener noreferrer\" target=\"_blank\">ENCODE</a>) are considered as overlapping a CTCF site.</p>\n<div> <img src=\"https://s3.amazonaws.com/4dn-dcic-public/static-pages/ISC_analysis_plots/microc_insitu_ctcf_count_prop_v3.png\" style=\"width: 1200px;\"/>\n<em>\n The top plots represent the number of boundaries obtained after a minimum boundary strength score is chosen as threshold. The bottom plots represent the proportion of boundaries within 5kb distance from a CTCF region after a minimum boundary strength score is chosen as threshold. 0.2 and 0.5 were selected as the weak and strong boundary thresholds respectively.</em>\n<br/><br/>\nTo further assess the reliability of insulation score calls in data sets of different protocols and sequencing depths, we present a comparison of boundary calls between different data sets. We present the boundary calls with a score above the 0.2 from the Micro-C dataset- the dataset with the highest resolution - as the most reliable \"true\" set of boundaries. We compare that to two Hi-C datasets of different sequencing depths on the same cell line. Boundaries in two data sets are considered to overlap if they are within 10kb of each other.\n\n\n <div> <img src=\"https://s3.amazonaws.com/4dn-dcic-public/static-pages/ISC_analysis_plots/insitu_microc_comparison_v4.png\" style=\"width: 1400px;\"/>\n<em>\n\n The top plots represent the boundary count for Micro-C set, Hi-C set and their overlap (within a 10kb distance) and the bottom represents the proportion of the overlap set to the number of boundaries called. The left plots are the results for a dataset with 2.5 billion reads (<a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNES2M5JIGV/\" rel=\"noopener noreferrer\" target=\"_blank\">4DNES2M5JIGV</a>) and the right plots are the results for dataset with 415 million reads (<a href=\"https://data.4dnucleome.org/experiment-set-replicates/4DNESRJ8KV4Q/\" rel=\"noopener noreferrer\" target=\"_blank\">4DNESRJ8KV4Q</a>).</em></div></div></div>", "@context": "/terms/", "aggregated-items": {}, "validation-errors": []}