# ENCODE4 Integrated Regulation Track (wgEncodeReg4) for hg38
# Redmine #34923
# Lou Nassar, 2026-03-12 (updated 2026-04-10)

# This track converts the ENCODE V4 Regulation hub into a native UCSC browser
# supertrack containing organ-averaged multiWig signal tracks, TF rPeak clusters,
# and individual experiment composites (bigComposite/faceted format) for
# epigenetics, RNA-seq, and TF ChIP-seq.

# The original hub was prepared by Mingshi Gao (Weng lab, UMass Chan Medical School):
#   https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt
# It was cloned locally for processing:
#   /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation/hub.txt

# Scripts are located at:
#   kent/src/hg/makeDb/scripts/encode4regulation/

##############################################################################
# Step 1: Clone the hub data locally
##############################################################################

# The hub was cloned with hubClone -download:
mkdir -p /hive/data/outside/encode4/ccre
cd /hive/data/outside/encode4/ccre
hubClone -download https://users.wenglab.org/gaomingshi/ENCODE_Reg/hub.txt \
    ENCODE_V4_Regulation

# Total data: ~5.6 TB across ~7,000 files (bigWig + bigBed)

##############################################################################
# Step 2: Create gbdb symlinks for organ-averaged multiWig files
##############################################################################

# Only organ-averaged multiWig files and TF rPeak files need gbdb symlinks.
# Individual experiment tracks use S3 URLs directly.

mkdir -p /gbdb/hg38/encode4/regulation/organAve
cd /hive/data/outside/encode4/ccre/ENCODE_V4_Regulation

# Symlinks were created using rename_symlinks.py which maps hub filenames
# to UCSC convention (camelCase, .bw extension):
#   ave.adipose.H3K27ac.bigWig -> adiposeH3K27ac.bw  (all-biosamples)
#   tp.blood.H3K27ac.bigWig -> tpBloodH3K27ac.bw     (tissue-only)
#   adrenal_gland.minus.bigWig -> adrenalGlandMinus.bw (RNA strand)
#   tp.blood.plus.bigWig -> tpBloodPlus.bw            (RNA tp strand)

# TF rPeak clusters (from existing ENCODEv4TFrPeaks bed)
mkdir -p /gbdb/hg38/encode4/regulation/tfRpeak
ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.rPeaks.bb \
    /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClusters.bb
ln -s /hive/data/genomes/hg38/bed/ENCODEv4TFrPeaks/no_trim.TF_name.decorator.bb \
    /gbdb/hg38/encode4/regulation/tfRpeak/TFrPeakClustersDecorator.bb

# Symlink summary:
#   organAve/  363 symlinks (113 tissue-only + 250 all-biosamples/RNA)
#   tfRpeak/     2 symlinks
#   Total:     365 symlinks + 5 metadata/JSON files + 1 nose DNase = 371 files

##############################################################################
# Step 2b: Download tissue-only (tp) organ-averaged files (2026-04-08)
##############################################################################

# The initial hub had a bug where tissue-only and all-biosamples variants pointed
# to the same file. The Weng lab fixed this in April 2026 with proper separate
# tp.{organ}.{Assay}.bigWig files. These were downloaded and symlinked:

cd /hive/users/lrnassar/claude/RM34923
bash download_tp_files.sh    # downloads 83 tp bigWig files (~733 GB)
bash create_tp_symlinks.sh   # creates 83 symlinks: tp{Organ}{Assay}.bw

# Also downloaded 1 new file: ave.nose.DNase.bigWig (nose organ added to hub)
# Removed 1 orphan: lymphoidTissueDNase.bw (hub removed from all-bio group)
# Renamed 2 old symlinks: boneMarrowTpH3K27ac.bw -> tpBoneMarrowH3K27ac.bw
#                          boneMarrowTpH3K4me3.bw -> tpBoneMarrowH3K4me3.bw

##############################################################################
# Step 3: Validate local files against ENCODE portal
##############################################################################

# Run the validation script to verify md5sums of local files against the
# ENCODE REST API and generate S3 URL mapping for bigComposite tracks.

cd /hive/users/lrnassar/claude/RM34923
python3 kent/src/hg/makeDb/scripts/encode4regulation/validate_encode_urls.py

# This produces:
#   encode4_url_mapping.tsv  — maps accessions to S3 URLs with md5 validation
#   encode4_validation.log   — detailed log of any mismatches
# All S3 URLs verified accessible via bigWigInfo/bigBedInfo (0 failures).

##############################################################################
# Step 4: Generate multiWig trackDb stanzas
##############################################################################

# The multiWig organ-averaged tracks (H3K27ac, DNase, ATAC, H3K4me3, CTCF, Txn)
# are generated from the hub.txt and assembled into the main wgEncodeReg4.ra.

cd /hive/users/lrnassar/claude/RM34923
python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_multiwig_ra.py \
    > multiwig_output.ra

# The output was manually integrated into wgEncodeReg4.ra along with:
#   - SuperTrack definition (priority 0.9, group=regulation)
#   - TF rPeak track (bigBed 12+ with decorator, 911-factor filterValues)
#   - Include directives for the 3 bigComposite files
# The main file is hand-maintained:
#   kent/src/hg/makeDb/trackDb/human/hg38/wgEncodeReg4.ra (~3,461 lines)

# multiWig track details:
#   wgEncodeReg4MarkH3k27ac — full visibility (priority 1.4)
#   wgEncodeReg4Dnase       — hidden (priority 1.1)
#   wgEncodeReg4Atac        — hidden (priority 1.2)
#   wgEncodeReg4MarkH3k4me3 — hidden (priority 1.3)
#   wgEncodeReg4MarkCtcf    — hidden (priority 1.5)
#   wgEncodeReg4Txn         — hidden (priority 1.6)
#   wgEncodeReg4TfPeaks     — hidden (priority 1.7)

# Each multiWig has tissue-only subtracks first, then all-biosamples subtracks.
# Tissue-only subtracks use tp{Organ}{Assay}.bw gbdb symlinks pointing to
# tp.{organ}.{Assay}.bigWig files (genuine tissue/primary cell averages).
# All-biosamples subtracks use {organ}{Assay}.bw symlinks pointing to
# ave.{organ}.{Assay}.bigWig files (averages including cell lines).
#
# Subtrack counts per assay (tissue-only + all-biosamples):
#   H3K27ac: 17 + 38 = 55
#   DNase:   24 + 40 = 64
#   ATAC:     8 + 26 = 34
#   H3K4me3: 19 + 38 = 57
#   CTCF:    15 + 36 = 51
#   RNA:     30 + 72 = 102 (strand pairs: Plus/Minus)

##############################################################################
# Step 5: Generate bigComposite (faceted) individual experiment tracks
##############################################################################

# The three individual experiment composites use the new bigComposite/faceted
# format (refs #36320). This was a two-step process:
#
# Step 5a: Generate traditional composites from hub
#   generate_composites.py parses hub.txt and creates compositeTrack-on-style
#   .ra files with subGroups, views, dimensions, etc.
#
# Step 5b: Convert to bigComposite faceted format
#   convert_to_bigcomposite.py reads those .ra files, strips subGroups/views,
#   adds metaDataUrl + primaryKey, generates metadata TSVs, and uses S3 URLs
#   from the validation mapping (Step 3).

cd /hive/users/lrnassar/claude/RM34923
python3 kent/src/hg/makeDb/scripts/encode4regulation/generate_composites.py
python3 kent/src/hg/makeDb/scripts/encode4regulation/convert_to_bigcomposite.py

# This overwrites the .ra files in place and creates metadata TSVs:
#   /gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv
#   /gbdb/hg38/encode4/regulation/wgEncodeReg4RnaSeq_metadata.tsv
#   /gbdb/hg38/encode4/regulation/wgEncodeReg4TfChip_metadata.tsv
#
# Output .ra files (in kent/src/hg/makeDb/trackDb/human/hg38/):
#   wgEncodeReg4Epigenetics.ra — 6,353 subtracks (3,199 signal + 3,154 peak)
#     Facets: Assay, Organ, Biosample Type, Life Stage, Data Type
#   wgEncodeReg4RnaSeq.ra      — 1,046 subtracks
#     Facets: Organ, Biosample Type, Life Stage, Strand
#   wgEncodeReg4TfChip.ra      — 4,964 subtracks (2,502 peak + 2,462 signal)
#     Facets: TF, Organ, Biosample Type, Life Stage, Data Type
#
# Faceted UI features:
#   - _Biosample column hidden from facets via underscore prefix (refs #36320)
#   - All facet values capitalized (e.g., "Cell line", "Adult")
#   - longLabels cleaned: underscores replaced with spaces for readability
#   - colorSettingsUrl for facet color indicators:
#     Epigenetics: colored by Assay (epi_colors.json)
#     RnaSeq: colored by Organ (organ_colors.json)
#     TfChip: no facet colors (uses score-based spectrum coloring)
#
# Default-ON subtracks (tissue samples per Weng lab request):
#   Epigenetics (28): brain/heart/liver × 5 assays × signal+peak
#     Brain: middle frontal area 46, mild cognitive impairment, male 89y (4 assays, no ATAC)
#     Heart: right atrium auricular region, female 51y (5 assays)
#     Liver: right lobe, female 53y (5 assays)
#   RNA-seq (6): brain/heart/liver × Plus+Minus strand
#     Brain: dorsolateral prefrontal cortex, female 75y
#     Heart: right atrium auricular region, female 51y
#     Liver: right lobe, female 53y
#   TF ChIP (8): brain/heart/liver tissue CTCF + liver POLR2A
#
# All subtracks use S3 URLs (encode-public.s3.amazonaws.com) for bigDataUrl.

##############################################################################
# Step 5c: Track color adjustments
##############################################################################

# Organ colors follow the Weng lab canonical color mapping:
#   https://wiki.wenglab.org/references/color-mappings/
# 18 colors were darkened for contrast against white browser background while
# preserving hue (canonical colors designed for dark portal background).
# 1 organ previously using default gray was given its canonical color:
#   urinary bladder: 194,33,39
# Color mapping wiki linked in Display Conventions of all multiWig HTML pages.

##############################################################################
# Step 5d: Biosample column cleanup (2026-04-08)
##############################################################################

# The _Biosample column in the Epigenetics metadata TSV had redundant
# assay+type suffixes on peak entries (e.g., "Homo sapiens A549 ATAC peak").
# This information is already in the Assay and Data Type columns.
# Removed suffixes from 3,154 entries using:

cd /hive/users/lrnassar/claude/RM34923
python3 -c "
import re
with open('/gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv') as f:
    lines = f.readlines()
out = [lines[0]]
for line in lines[1:]:
    cols = line.rstrip('\n').split('\t')
    if len(cols) >= 6:
        cols[5] = re.sub(r' (ATAC|CTCF|DNase|H3K27ac|H3K4me3) peak$', '', cols[5])
    out.append('\t'.join(cols) + '\n')
with open('/gbdb/hg38/encode4/regulation/wgEncodeReg4Epigenetics_metadata.tsv', 'w') as f:
    f.writelines(out)
"

##############################################################################
# Step 6: Create HTML description pages
##############################################################################

# 11 HTML files in kent/src/hg/makeDb/trackDb/human/hg38/:
#   wgEncodeReg4.html                — SuperTrack overview
#   wgEncodeReg4MarkH3k27ac.html     — H3K27ac layered signal
#   wgEncodeReg4Dnase.html           — DNase layered signal
#   wgEncodeReg4Atac.html            — ATAC layered signal
#   wgEncodeReg4MarkH3k4me3.html     — H3K4me3 layered signal
#   wgEncodeReg4MarkCtcf.html        — CTCF layered signal
#   wgEncodeReg4Txn.html             — Transcription layered signal
#   wgEncodeReg4TfPeaks.html         — TF rPeaks
#   wgEncodeReg4Epigenetics.html     — Individual epigenetics composite
#   wgEncodeReg4RnaSeq.html          — Individual RNA-seq composite
#   wgEncodeReg4TfChip.html          — Individual TF ChIP composite
#
# Each layered track HTML includes an organ/tissue availability table.
# All HTMLs include Data Access sections with bigWigToWig/bigBedToBed examples.
# Production lab credits are assay-specific, matching the upstream hub.
# ENCODE color mapping wiki linked in Display Conventions sections.

##############################################################################
# Step 7: Add related tracks, trackDb include, ENCODE3 rename
##############################################################################

# Added to kent/src/hg/makeDb/trackDb/human/hg38/trackDb.ra:
#   include wgEncodeReg4.ra alpha

# Added reciprocal entries to relatedTracks.ra:
#   hg38 wgEncodeReg4 wgEncodeReg Previous ENCODE3 Regulation track
#   hg38 wgEncodeReg wgEncodeReg4 New ENCODE4 Regulation track
#   hg38 wgEncodeReg4 cCREs Related ENCODE4 cCRE annotations
#   hg38 cCREs wgEncodeReg4 Related ENCODE4 regulation data
# Note: track1's "why" text describes track2 (the link destination).

# ENCODE3 renamed to "ENCODE3 Regulation" via alpha release tags:
#   wgEncodeReg.alpha.ra: shortLabel "ENCODE3 Regulation", snowflake pennant

##############################################################################
# Step 8: Release tags (ENCODE3 transition)
##############################################################################

# On alpha (dev): ENCODE4 visible, ENCODE3 hidden with snowflake
# On beta/public: ENCODE3 visible as-is, ENCODE4 not visible
#
# Approach: Duplicate wgEncodeReg.ra into wgEncodeReg.alpha.ra with:
#   - superTrack on hide (hidden by default)
#   - shortLabel "ENCODE3 Regulation"
#   - pennantIcon snowflake.png (deprecation notice)
#   - Inner includes tagged with release alpha
# The original wgEncodeReg.ra gets inner includes tagged beta,public.
# Key: inner include directives must also carry release tags.
#
# trackDb.ra includes:
#   include wgEncodeReg.ra beta,public
#   include wgEncodeReg.alpha.ra alpha
#   include wgEncodeReg4.ra alpha

##############################################################################
# Step 9: Load trackDb
##############################################################################

cd /cluster/home/lrnassar/kent/src/hg/makeDb/trackDb

# Sandbox (personal testing):
make DBS=hg38

# Dev (hgwdev):
make alpha DBS=hg38

##############################################################################
# Step 10: Cleanup — delete local ENCFF files (now S3-served)
##############################################################################

# Individual experiment files (ENCFF*.bigWig, ENCFF*.bigBed) and ?proxy=TRUE
# download artifacts were deleted from the source hub directory, since they
# are now served via S3 URLs. Only the organ-averaged multiWig files
# (referenced by gbdb symlinks) and hub.txt/HTML docs were preserved.
#
# hg38: Deleted 7,598 files (3.6 TB freed)

##############################################################################
# Step 10: Default sort by experiment to group peak/signal pairs (2026-04-13)
##############################################################################

# The Epigenetics and TF ChIP bigComposite tracks each contain both peak and
# signal subtracks for the same experiments. To group these pairs adjacent in
# the configure page, added an _Experiment column to the metadata TSVs
# containing the ENCSR experiment accession (extracted from shortLabel in the
# .ra stanzas). The underscore prefix hides it from facet filters.
# Added defaultSortField _Experiment to the 2 composite stanzas.
# Also fixed pre-existing CRLF line endings in the Epigenetics metadata TSV.
#
# Composites updated:
#   wgEncodeReg4Epigenetics: 3,152 peak/signal pairs + 49 singletons
#   wgEncodeReg4TfChip:      2,461 peak/signal pairs + 42 singletons
# RNA-seq not changed (signal only, no peak/signal pairing).

##############################################################################
# Step 11: Link accession column to ENCODE portal (2026-04-17)
##############################################################################

# Jonathan added a new trackDb setting 'subtrackUrl' for faceted composites
# (refs #36320, note 157). The setting takes a URL pattern where '$$' is
# replaced with the value from the primaryKey column of the metadata TSV.
# When set, the primary-key column on the track configuration page becomes
# a clickable hyperlink.
#
# Since our primaryKey is 'accession' (an ENCFF file accession), added:
#   subtrackUrl https://www.encodeproject.org/files/$$/
# to all 3 hg38 bigComposites (Epigenetics, RnaSeq, TfChip).
#
# Also appended a sentence in the Data Access section of each composite's
# HTML noting that clicking any accession leads to the corresponding file
# details page on the ENCODE portal.

##############################################################################
# Track hierarchy and disk usage summary
##############################################################################

# Regulation group priority order: cCREs (0.8) > wgEncodeReg4 (0.9) > wgEncodeReg (1)
#
# wgEncodeReg4 (superTrack, priority 0.9, group=regulation)
# ├── wgEncodeReg4MarkH3k27ac (multiWig, full)                     priority 1.4
# ├── wgEncodeReg4Dnase       (multiWig, hide)                     priority 1.1
# ├── wgEncodeReg4Atac        (multiWig, hide)                     priority 1.2
# ├── wgEncodeReg4MarkH3k4me3 (multiWig, hide)                     priority 1.3
# ├── wgEncodeReg4MarkCtcf    (multiWig, hide)                     priority 1.5
# ├── wgEncodeReg4Txn         (multiWig, hide)                     priority 1.6
# ├── wgEncodeReg4TfPeaks     (bigBed 12+ with decorator, hide)    priority 1.7
# ├── wgEncodeReg4Epigenetics (bigComposite faceted, 6,353, 28 ON) priority 2.0
# ├── wgEncodeReg4RnaSeq      (bigComposite faceted, 1,046, 6 ON)  priority 2.1
# └── wgEncodeReg4TfChip      (bigComposite faceted, 4,964, 8 ON)  priority 2.2

# Disk usage (/gbdb/hg38/encode4/regulation/):
#   organAve tp (tissue-only):    1.8 TB (113 files)
#   organAve ave (all-bio + RNA): 2.6 TB (250 files)
#   tfRpeak:                      3 GB (2 files)
#   metadata + JSON:              3 MB (5 files)
#   nose DNase:                   680 MB (1 file)
#   Total:                        4.4 TB (371 files)

# File list: /hive/users/lrnassar/claude/RM34923/gbdb_file_list.txt
