Hi-C估算分辨率

Hi-C数据的分辨率(resolution)通常指接触矩阵(contact matrix)的bin大小(如1kb、5kb或10kb),分辨率的高低决定了下游可以做哪些分析。

分辨率范围 典型分析 应用研究问题
1kb–10kb 染色质环、增强子-启动子互作 基因精细调控、非编码区域功能
25kb–100kb TAD结构、A/B区室 染色质结构与表达、疾病状态比较
100kb–500kb 染色质区室全局特征 全基因组结构比较、发育差异
500kb–2Mb 染色体区域互作、染色体领地 核内染色体排布、进化差异

Hi-C数据有效分辨率定义为,最小的bin大小,其中大部分(通常80%)基因组bin具有足够的接触数(contacts),比如说至少1000个unique contacts,以确保矩阵不稀疏。 (Rao et al., 2014, Cell)。Hi-C数据的分辨率由测序reads的数量决定,reads越多,能支持的分辨率越高(bin大小越小。

Juicer提供了一个脚本用于估计分辨率,也就是juicer-1.6/misc/calculate_map_resolution.sh,脚本先算 50 bp 端点覆盖,再用二分法找到“让 ≥ 80 % 的基因组 bin 都累积到 ≥ 1 000 条接触”的最小 bin 大小;这个尺寸就是 Hi‑C 数据能可靠解析的图谱分辨率。

使用人类数据SRR1658573进行测试(32分钟),估算是17.8K的分辨率,对应25K分辨率的矩阵。

time bash ~/software/juicer-1.6/misc/calculate_map_resolution.sh merged_nodups.txt cov50.txt
# The map resolution is 17800
# 1438.80s user 39.90s system 76% cpu 32:15.42 total

如果你觉得速度太慢,可以用我写的Rust重构版本 https://github.com/xuzhougeng/hic_resolution_rs,速度可以快5倍左右。

也可以直接用juicer输出给的hic矩阵里分析,看看不同分辨率下的矩阵是否满足需求。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
pip install -i https://pypi.org/simple/ hic-straw

Estimate effective Hi-C resolution per chromosome.
Usage:
    python calc_hic_resolution.py <hic_file> <chromosome> [--thr 1000] [--pct 0.8]
"""
import argparse
import sys
from hicstraw import HiCFile, straw
import numpy as np

# ----------------------------------------------------------------------
# Helper
# ----------------------------------------------------------------------
def find_matching_chr(requested, available):
    """在 available 里寻找与 requested 对应的染色体名称(大小写 / chr 前缀不敏感)。"""
    req = requested.lower().lstrip("chr")
    for name in available:
        if name.lower() == requested.lower() or name.lower().lstrip("chr") == req:
            return name
    return None

def coverage_at_bin(hic_path, chrom, res, thr):
    """计算给定染色体 & bin size 时,≥thr contacts 的 bin 占比。"""
    records = straw("observed", "NONE", hic_path, chrom, chrom, "BP", res)
    counts = {}
    for rec in records:                               # contactRecord 对象
        counts[rec.binX] = counts.get(rec.binX, 0) + rec.counts
        counts[rec.binY] = counts.get(rec.binY, 0) + rec.counts
    arr = np.fromiter(counts.values(), dtype=np.int64)
    return (arr >= thr).mean() if arr.size else 0.0

# ----------------------------------------------------------------------
# Main
# ----------------------------------------------------------------------
def main():
    parser = argparse.ArgumentParser(
        description="Estimate Hi-C effective resolution on a given chromosome",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument("hic_file", help=".hic file produced by Juicer/HiC-Pro")
    parser.add_argument("chromosome", help="Chromosome name, e.g. 1 / chr1 / X")
    parser.add_argument("--thr", type=int, default=1000,
                        help="Minimum contacts per bin to count as covered")
    parser.add_argument("--pct", type=float, default=0.8,
                        help="Coverage fraction defining effective resolution (0–1)")
    args = parser.parse_args()

    # 打开 .hic 文件并解析元数据
    hic = HiCFile(args.hic_file)
    chroms = [c.name for c in hic.getChromosomes() if c.name.lower() != "all"]

    chrom = find_matching_chr(args.chromosome, chroms)
    if chrom is None:
        sys.stderr.write(
            f"[ERROR] 未找到染色体 '{args.chromosome}',"
            f"可选值: {', '.join(chroms)}\n")
        sys.exit(1)

    print(f"# File: {args.hic_file}")
    print(f"# Chromosome: {chrom}")
    print(f"# Threshold per bin: {args.thr} contacts")
    print(f"# Required coverage: {args.pct*100:.1f}% bins\n")
    print("resolution_bp\tcoverage")

    resolutions = sorted(hic.getResolutions())
    eff_res = None
    for res in resolutions:
        cov = coverage_at_bin(args.hic_file, chrom, res, args.thr)
        print(f"{res}\t{cov:.3f}")
        if eff_res is None and cov >= args.pct:
            eff_res = res

    if eff_res:
        print(f"\nEffective resolution on {chrom}: {eff_res:,} bp "
              f"(≥{args.pct*100:.0f}% bins ≥ {args.thr} contacts)")
    else:
        print(f"\nNo resolution met the {args.pct*100:.0f}% / {args.thr} contacts criterion.")

if __name__ == "__main__":
    main()


结可以发现,10K太高,25K挺好的

    5000        0.000
   10000        0.219
   25000        0.940
   50000        0.963
  100000        0.975
  250000        0.991
  500000        0.993
 1000000        0.996
 2500000        1.000

这跟https://github.com/ClaireMarchal/HiCRes 预测结果24744 bp是一致的

# Hi-C 

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×