Hi-C数据的分辨率(resolution)通常指接触矩阵(contact matrix)的bin大小(如1kb、5kb或10kb),分辨率的高低决定了下游可以做哪些分析。
分辨率范围 | 典型分析 | 应用研究问题 |
---|---|---|
1kb–10kb | 染色质环、增强子-启动子互作 | 基因精细调控、非编码区域功能 |
25kb–100kb | TAD结构、A/B区室 | 染色质结构与表达、疾病状态比较 |
100kb–500kb | 染色质区室全局特征 | 全基因组结构比较、发育差异 |
500kb–2Mb | 染色体区域互作、染色体领地 | 核内染色体排布、进化差异 |
Hi-C数据有效分辨率定义为,最小的bin大小,其中大部分(通常80%)基因组bin具有足够的接触数(contacts),比如说至少1000个unique contacts,以确保矩阵不稀疏。 (Rao et al., 2014, Cell)。Hi-C数据的分辨率由测序reads的数量决定,reads越多,能支持的分辨率越高(bin大小越小。
Juicer提供了一个脚本用于估计分辨率,也就是juicer-1.6/misc/calculate_map_resolution.sh
,脚本先算 50 bp 端点覆盖,再用二分法找到“让 ≥ 80 % 的基因组 bin 都累积到 ≥ 1 000 条接触”的最小 bin 大小;这个尺寸就是 Hi‑C 数据能可靠解析的图谱分辨率。
使用人类数据SRR1658573进行测试(32分钟),估算是17.8K的分辨率,对应25K分辨率的矩阵。
time bash ~/software/juicer-1.6/misc/calculate_map_resolution.sh merged_nodups.txt cov50.txt
# The map resolution is 17800
# 1438.80s user 39.90s system 76% cpu 32:15.42 total
如果你觉得速度太慢,可以用我写的Rust重构版本 https://github.com/xuzhougeng/hic_resolution_rs,速度可以快5倍左右。
也可以直接用juicer输出给的hic矩阵里分析,看看不同分辨率下的矩阵是否满足需求。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
pip install -i https://pypi.org/simple/ hic-straw
Estimate effective Hi-C resolution per chromosome.
Usage:
python calc_hic_resolution.py <hic_file> <chromosome> [--thr 1000] [--pct 0.8]
"""
import argparse
import sys
from hicstraw import HiCFile, straw
import numpy as np
# ----------------------------------------------------------------------
# Helper
# ----------------------------------------------------------------------
def find_matching_chr(requested, available):
"""在 available 里寻找与 requested 对应的染色体名称(大小写 / chr 前缀不敏感)。"""
req = requested.lower().lstrip("chr")
for name in available:
if name.lower() == requested.lower() or name.lower().lstrip("chr") == req:
return name
return None
def coverage_at_bin(hic_path, chrom, res, thr):
"""计算给定染色体 & bin size 时,≥thr contacts 的 bin 占比。"""
records = straw("observed", "NONE", hic_path, chrom, chrom, "BP", res)
counts = {}
for rec in records: # contactRecord 对象
counts[rec.binX] = counts.get(rec.binX, 0) + rec.counts
counts[rec.binY] = counts.get(rec.binY, 0) + rec.counts
arr = np.fromiter(counts.values(), dtype=np.int64)
return (arr >= thr).mean() if arr.size else 0.0
# ----------------------------------------------------------------------
# Main
# ----------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Estimate Hi-C effective resolution on a given chromosome",
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("hic_file", help=".hic file produced by Juicer/HiC-Pro")
parser.add_argument("chromosome", help="Chromosome name, e.g. 1 / chr1 / X")
parser.add_argument("--thr", type=int, default=1000,
help="Minimum contacts per bin to count as covered")
parser.add_argument("--pct", type=float, default=0.8,
help="Coverage fraction defining effective resolution (0–1)")
args = parser.parse_args()
# 打开 .hic 文件并解析元数据
hic = HiCFile(args.hic_file)
chroms = [c.name for c in hic.getChromosomes() if c.name.lower() != "all"]
chrom = find_matching_chr(args.chromosome, chroms)
if chrom is None:
sys.stderr.write(
f"[ERROR] 未找到染色体 '{args.chromosome}',"
f"可选值: {', '.join(chroms)}\n")
sys.exit(1)
print(f"# File: {args.hic_file}")
print(f"# Chromosome: {chrom}")
print(f"# Threshold per bin: {args.thr} contacts")
print(f"# Required coverage: {args.pct*100:.1f}% bins\n")
print("resolution_bp\tcoverage")
resolutions = sorted(hic.getResolutions())
eff_res = None
for res in resolutions:
cov = coverage_at_bin(args.hic_file, chrom, res, args.thr)
print(f"{res}\t{cov:.3f}")
if eff_res is None and cov >= args.pct:
eff_res = res
if eff_res:
print(f"\nEffective resolution on {chrom}: {eff_res:,} bp "
f"(≥{args.pct*100:.0f}% bins ≥ {args.thr} contacts)")
else:
print(f"\nNo resolution met the {args.pct*100:.0f}% / {args.thr} contacts criterion.")
if __name__ == "__main__":
main()
结可以发现,10K太高,25K挺好的
5000 0.000
10000 0.219
25000 0.940
50000 0.963
100000 0.975
250000 0.991
500000 0.993
1000000 0.996
2500000 1.000
这跟https://github.com/ClaireMarchal/HiCRes 预测结果24744 bp是一致的