动物如何准备pyscenic所需的cistarget数据库

思路: https://resources.aertslab.org/cistarget/ 提供了全面的非冗余的motif数据库,以及人类、动物和昆虫的TF与motif对应表格,因此,如果研究动物,则基于同源序列相似性,将基因编号与人类基因对应,研究昆虫,则与果蝇对应。

数据库中,Mouse和Chicken就是基于同源基因跟人类基因对应得到。

首先,配置create_cisTarget_databases环境

# Clone git repo.
git clone https://github.com/aertslab/create_cisTarget_databases
~~
~~cd create_cisTarget_databases

# Display to which value ${create_cistarget_databases_dir} variable should be set.
echo "create_cistarget_databases_dir='""${PWD}""'"

# Create conda environment.
conda create -n create_cistarget_databases \
    'python=3.10' \
    'numpy=1.21' \
    'pandas>=1.4.1' \
    'pyarrow>=7.0.0' \
    'numba>=0.55.1' \
    'python-flatbuffers'
    
# Activate conda environment.
conda activate create_cistarget_databases

cd "${CONDA_PREFIX}/bin"

# Download precompiled Cluster-Buster binary.
wget https://resources.aertslab.org/cistarget/programs/cbust
# Download liftOver.
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
# Download bigWigAverageOverBed.
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigAverageOverBed
# Make downloaded binaries executable.
chmod a+x cbust  liftOver bigWigAverageOverBed

# 安装一些库
pip install pandas==1.3.5 pyranges  pyfaidx

下一步,是启动环境,并配置好create_cistarget_databases_dir的路径,例如我是家目录下,那么路径应该写成

# Activate conda environment.
conda activate create_cistarget_databases

# Set ${create_cistarget_databases_dir} variable to path where the repo was cloned to.
create_cistarget_databases_dir=$HOME/create_cisTarget_databases

我们以Pig为例,创建文件夹,并从ENSEMBLE上下载相关的文件

mkdir -p GRN_DB
cd GRN_DB
# 下载基因组序列和GTF文件
wget https://ftp.ensembl.org/pub/release-112/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
wget https://ftp.ensembl.org/pub/release-112/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.112.gtf.gz
pigz -d Sus_scrofa.Sscrofa11.1.112.gtf.gz

此外,下载CISTARGET提供的现成的motif数据。

wget https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/v10nr_clust_public.zip
unzip v10nr_clust_public.zip

从ENSEMBLE上下载的GTF文件中可能包含gene_name这一记录,一般对应的是人类的同源基因,因而可以得到当前物种的基因与人类基因的对应关系。我编写了一个脚本用于提取,

import re

def parse_gtf(gtf_file):
    gene_info = {}
    with open(gtf_file, 'r') as file:
        for line in file:
            if line.startswith('#'):
                continue  # Skip header lines
            fields = line.strip().split('\t')
            if fields[2] == 'gene':
                attributes = fields[8]
                gene_id_match = re.search('gene_id "([^"]+)"', attributes)
                gene_name_match = re.search('gene_name "([^"]+)"', attributes)
                
                if gene_id_match:
                    gene_id = gene_id_match.group(1)
                    gene_name = gene_name_match.group(1) if gene_name_match else gene_id
                    gene_info[gene_id] = gene_name
    
    return gene_info

def save_gene_info(gene_info, output_file):
    with open(output_file, 'w') as file:
        for gene_id, gene_name in gene_info.items():
            file.write(f'{gene_id}\t{gene_name}\n')

# Path to your GTF file
import sys
gtf_file_path = sys.argv[1]

# Output file path
output_file_path = sys.argv[2]

# Parse the GTF file
gene_info = parse_gtf(gtf_file_path)

# Save the gene info to a file
save_gene_info(gene_info, output_file_path)

print("Gene ID to Gene Name mapping has been saved to", output_file_path)

将其保存到一个gene2symbol.py 文件中,并运行,得到gene2symbol.txt

python3 gene2symbol.py Sus_scrofa.Sscrofa11.1.112.gtf gene2symbol.txt

然后,基于gene2symbol.txt,对v10nr_clust_public/snapshots/motifs-v10-nr.hgnc-m0.00001-o0.0.tbl 进行过滤,只保留存在同源基因tbl

import pandas as pd
import sys

# 路径可能需要根据你的文件系统进行调整
motifs_file = sys.argv[1]
gene2symbol_file = sys.argv[2]

    
# 读取 gene2symbol 文件并建立基因名集合
gene_names = set()
with open(gene2symbol_file, 'r') as file:
    for line in file:
        parts = line.strip().split()
        if len(parts) > 1:  # 确保行有两个以上的部分
            gene_names.add(parts[1])  # 假设第二列是基因名

# 读取 motifs 文件并直接输出符合条件的行
with open(motifs_file, 'r') as file:
    headers = next(file).strip()  # 读取头部行
    print(headers)  # 输出头部行
    for line in file:
        if line.startswith('#'):
            print(line.strip())
        else:
          parts = line.strip().split('\t')
          if len(parts) > 5 and parts[5] in gene_names:
              print(line.strip())  # 输出符合条件的行


存在一个名为filter_tbl.py的脚本中,然后运行

python3 filter_tbl.py gene2symbol.txt v10nr_clust_public/snapshots/motifs-v10-nr.hgnc-m0.00001-o0.0.tbl > motifs-v10-nr.Sus.tbl

然后,我们从 motifs-v10-nr.Sus.tbl提取出所有的motif,并过滤掉,不存在与singletons的motif

grep -v '^#' motifs-v10-nr.Sus.tbl|cut -f 1 |sort -u > motif_all.txt
ls v10nr_clust_public/singletons | cut -d '.' -f 1 > motif_nr.txt
grep -Ff motif_all.txt motif_nr.txt > motifs_sus.txt

此外,还需要基于基因组序列和注释文件,提取序列

from pyfaidx import Fasta
import pyranges as pr
import argparse

def read_fasta(filename):
    """Read a FASTA file using pyfaidx, which provides efficient access to sequences."""
    return Fasta(filename)  # This will index the file if not already indexed

def write_fasta(fasta_dict, output_filename):
    """Write a dictionary to a FASTA file."""
    with open(output_filename, 'w') as file:
        for header, sequence in fasta_dict.items():
            file.write(f">" + header + "\n")
            file.write(sequence + "\n")

def extract_sequence(fasta, chrom, start, end):
    """Extract sequence from a FASTA file using pyfaidx."""
    sequence = fasta[chrom][start:end].seq
    return sequence

def main():
    parser = argparse.ArgumentParser(description="Extract sequences around gene start positions considering gene strand orientation.")
    parser.add_argument("genomic_fasta_filename", help="Genomic FASTA file.")
    parser.add_argument("gtf_filename", help="GTF file containing gene annotations.")
    parser.add_argument("output_fasta_filename", help="Output FASTA file for extracted sequences.")
    parser.add_argument("upstream_bp", type=int, help="Number of base pairs to extract upstream of each gene start considering strand.")
    parser.add_argument("downstream_bp", type=int, help="Number of base pairs to extract downstream of each gene start considering strand.")
    parser.add_argument("-n", "--name_field", default="gene_name", help="Field in GTF to use for naming sequences. If not found, an error will be raised.")

    args = parser.parse_args()

    # Load the FASTA file using pyfaidx
    fasta = read_fasta(args.genomic_fasta_filename)

    # Load GTF file using PyRanges
    gr = pr.read_gtf(args.gtf_filename)
    genes = gr[gr.Feature == "gene"]

    # Prepare the new FASTA entries
    new_fasta = {}

    for idx, row in genes.df.iterrows():
        chrom, start, end, strand = row['Chromosome'], row['Start'], row['End'], row['Strand']
        gene_name = row[args.name_field]

        if not isinstance(gene_name, str):
            continue

        if strand == '+':
            upstream_start = max(0, start - args.upstream_bp)
            downstream_end = start + args.downstream_bp
        else:
            upstream_start = end - args.downstream_bp
            downstream_end = end + args.upstream_bp

        # Ensure downstream end does not exceed chromosome length
        chrom_length = len(fasta[chrom])
        downstream_end = min(downstream_end, chrom_length)

        # Extract and save the sequence
        extracted_sequence = extract_sequence(fasta, chrom, upstream_start, downstream_end)
        new_fasta[gene_name] = extracted_sequence

    # Write the new fasta file
    write_fasta(new_fasta, args.output_fasta_filename)

if __name__ == "__main__":
    main()


提取以gene的起始的上游1kb,下游0bp, 用gene_name作为序列命名

python3 extract_fasta.py -n gene_name Sus_scrofa.Sscrofa11.1.dna.toplevel.fa Sus_scrofa.Sscrofa11.1.112.gtf gene_up1kb_down0kb.fasta 1000 0

最后运行create_cistarget_motif_databases进行构建

# FASTA file with sequences per region IDs / gene IDs.
fasta_filename=gene_up1kb_down0kb.fasta
# Directory with motifs in Cluster-Buster format.
motifs_dir=v10nr_clust_public/singletons
# File with motif IDs (base name of motif file in ${motifs_dir}).
motifs_list_filename=motifs_sus.txt
# cisTarget motif database output prefix.
db_prefix=Sus_scrofa_up1kb_down0kb

nbr_threads=24

"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
    -f "${fasta_filename}" \
    -M "${motifs_dir}" \
    -m "${motifs_list_filename}" \
    -o "${db_prefix}" \
    -t "${nbr_threads}"

最终,GRN所需要的文件如下

  • tbl: motifs-v10-nr.Sus.tbl
  • motif_name: motifs_sus.txt
  • feather: 有三个, 按需使用
    • Sus_scrofa_up1kb_down0kb.motifs_vs_regions.scores.feather
    • Sus_scrofa_up1kb_down0kb.regions_vs_motifs.rankings.feather
    • Sus_scrofa_up1kb_down0kb.regions_vs_motifs.scores.feather

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×