Help - CLASHub 1.0

1. What CLASH data is currently available in CLASHub?

CLASHub hosts data from four species: Human, Mouse, Drosophila melanogaster, and Caenorhabditis elegans. Below is the summary of available datasets:

Human

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	ZSWIM8 Knockout (#)	BioProject Number	SRR Number
A549	—	6	6	PRJNA1166120	SRR34738798, SRR34738799, SRR34738800, SRR34738801, SRR34738802, SRR34738803, SRR34738804, SRR34738805, SRR34738790, SRR34738791, SRR34738792, SRR34738793
Colorectal tissue	2	—	—	PRJNA1166120	SRR37216684, SRR37216685
D425	3	—	—	PRJNA1166120	SRR34757946, SRR34757949, SRR34757950
ES2	—	3	3	PRJNA1166120	SRR34757940, SRR34757941, SRR34757942, SRR34757943, SRR34757944, SRR34757945
HCT116	5	—	3	GSE164634, PRJNA1166120	SRR13415087, SRR13415088, SRR13415089, SRR13415090, SRR13415091, SRR34757939, SRR34757947, SRR34757948
HEK293T	8	—	—	GSE198250, PRJNA1166120	SRR18281055, SRR18281057, SRR18281067, SRR18281068, SRR34761041, SRR34761042, SRR34761043, SRR34761044
HepG2	3	—	—	PRJNA1166120	SRR34783077, SRR34783079, SRR34783080
H1299	—	3	3	PRJNA1166120	SRR34768260, SRR34768261, SRR34768262, SRR34768263, SRR34768274, SRR34768275
MB002	—	4	4	PRJNA1166120	SRR34783070, SRR34783071, SRR34783072, SRR34783073, SRR34783074, SRR34783075, SRR34783076, SRR34783078
MDA-MB-231	—	6	6	PRJNA1166120	SRR30817646, SRR30817647, SRR30817648, SRR30817649, SRR30817650, SRR30817651, SRR34738794, SRR34738795, SRR34738796, SRR34738797, SRR34738806, SRR34738807
OVCAR8	—	3	3	PRJNA1166120	SRR34768264, SRR34768265, SRR34768266, SRR34768267, SRR34768276, SRR34768277
TIVE-EX-LTC	3	—	—	GSE101978	SRR5876947, SRR5876948, SRR5876949
T98G	—	3	3	PRJNA1166120	SRR34743309, SRR34743310, SRR34743311, SRR34743312, SRR34743317, SRR34743318
U87MG	—	3	3	PRJNA1166120	SRR34743313, SRR34743314, SRR34743315, SRR34743316, SRR34743319, SRR34743320
501Mel	—	3	3	PRJNA1166120	SRR34768268, SRR34768269, SRR34768270, SRR34768271, SRR34768272, SRR34768273

Mouse

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	Zswim8 Knockout (#)	BioProject Number	SRR Number
HE2.1B	6	—	—	GSE124687	SRR8395242, SRR8395243, SRR8395244, SRR8395245, SRR8395246, SRR8395247
MEF	—	2	2	PRJNA1166120	SRR34793109, SRR34793110, SRR34793111, SRR34793112
Striatal cell	—	4	4	PRJNA1093144	SRR28497185, SRR28497186, SRR28497189, SRR28497190, SRR2849718, 6SRR28497197, SRR28497198, SRR28497199, SRR28497200
3T12	3	—	—	GSE124687	SRR8395248, SRR8395249, SRR8395250
Cortex tissue	8	—	—	GSE73058	SRR2413277, SRR2413278, SRR2413282, SRR2413289, SRR2413290, SRR2413300, SRR2413301, SRR2413302
Heart tissue	2	—	—	PRJNA1166120	SRR34793107, SRR34793108
Kidney tissue	2	—	—	PRJNA1166120	SRR34793105, SRR34793106

Drosophila melanogaster

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	Dora Knockout (#)	BioProject Number	SRR Number
S2 cells	—	3	3	PRJNA896239	SRR22129325, SRR22129327, SRR22129328, SRR22129284, SRR22129287, SRR22129298

Caenorhabditis elegans

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	Ebax Knockout (#)	BioProject Number	SRR Number
Embryo	—	4	4	GSE303817	—
mid-L4 stage	—	4	—	PRJNA328816	SRR3882724, SRR3882949, SRR3882950, SRR3882951

2. What Gene Expression Profile data is available in CLASHub?

Gene Expression Profile from four species: Human, Mouse, Drosophila melanogaster, and Caenorhabditis elegans. Below is the summary of available datasets:

Human

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	ZSWIM8 Knockout (#)	BioProject Number	SRR Number
A549	7	—	—	GSE263036, GSE212057, GSE199309	SRR28535493, SRR28535494, SRR28535495, SRR21237863, SRR21237869, SRR21237879, SRR18462418
D425	5	—	—	GSE151810, GSE185024, GSE123760	SRR11924485, SRR11924486, SRR16119415, SRR16119416, SRR8315029
ES2	6	—	—	GSE218794, GSE245778	SRR22410790, SRR22410791, SRR22410792, SRR26439462, SRR26439463, SRR26439464
HEK293T	7	—	—	GSE231583, GSE196043	SRR24421974, SRR24421975, SRR24421976, SRR18074813, SRR18074814, SRR18074815, SRR18074816
Hela	7	—	—	GSE273634, GSE218727, GSE199309	SRR30058518, SRR30058519, SRR30058520, SRR22407570, SRR22407571, SRR22407572, SRR18462415
HepG2	5	—	—	GSE224980, GSE264010	SRR28685775, SRR28685776, SRR28685777, SRR23387178, SRR23387179
H1299	4	—	—	GSE212057, GSE199309	SRR21237865, SRR21237873, SRR21237881, SRR18462412
K562	6	—	—	GSE199309, GSE167869	SRR18462409, SRR13800753, SRR13800754, SRR13800737, SRR13800738, SRR13800739
MB002	5	—	—	GSE229150 GSE261568	SRR28341540, SRR28341541, SRR28341542,SRR28341543
MCF7	7	—	—	GSE195761, GSE178905, GSE163791	SRR17944548, SRR17944549, SRR14915857, SRR14915858, SRR13296901, SRR13296902, SRR13296903
MDA-MB-231	6	—	—	GSE178532	SRR11544576, SRR11544577, SRR11544578, SRR14870088, SRR14870089, SRR14870090
OVCAR8	4	—	—	GSE246325	SRR26536798, SRR26536799, SRR26536802, SRR26536803
T98G	5	—	—	GSE112241, PRJNA580150	SRR10358029, SRR10358030, SRR10358031, SRR6881782, SRR6881783
U87MG	6	—	—	GSE147626, GSE235568	SRR11433766, SRR11433767, SRR11433768, SRR24991947, SRR24991948, SRR24991949
501Mel	7	—	—	PRJNA515302, GSE104869	SRR8473015, SRR8473019, SRR8473020, SRR6163777, SRR6163778, SRR6163779, SRR6163780

Mouse

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	Zswim8 Knockout (#)	BioProject Number	SRR Number
Eye	—	3	3	GSE231447	SRR24391488, SRR24391489, SRR24391526, SRR24391480, SRR24391481, SRR24391536
Forebrain	—	3	3	GSE231447	SRR24391522, SRR24391523, SRR24391534, SRR24391514, SRR24391515, SRR24391547
Heart	—	3	3	GSE231447	SRR24391502, SRR24391503, SRR24391533, SRR24391510, SRR24391511, SRR24391543
Hindbrain	—	3	3	GSE231447	SRR24391520, SRR24391521, SRR24391538, SRR24391512, SRR24391513, SRR24391546
Intestine	—	3	3	GSE231447	SRR24391494, SRR24391495, SRR24391530, SRR24391486, SRR24391487, SRR24391545
Kidney	—	3	3	GSE231447	SRR24391490, SRR24391491, SRR24391531, SRR24391482, SRR24391483, SRR24391539
Liver	—	3	3	GSE231447	SRR24391492, SRR24391493, SRR24391527, SRR24391484, SRR24391485, SRR24391540
Lung	—	3	3	GSE231447	SRR24391500, SRR24391501, SRR24391532, SRR24391508, SRR24391509, SRR24391542
Muscle	—	3	3	GSE231447	SRR24391518, SRR24391519, SRR24391525, SRR24391478, SRR24391479, SRR24391535
Placenta	—	3	3	GSE231447	SRR24391516, SRR24391517, SRR24391524, SRR24391476, SRR24391477, SRR24391537
Skin	—	3	3	GSE231447	SRR24391496, SRR24391497, SRR24391528, SRR24391504, SRR24391505, SRR24391541
Stomach	—	3	3	GSE231447	SRR24391498, SRR24391499, SRR24391529, SRR24391506, SRR24391507, SRR24391544
Embryonic Stem Cell	2	—	—	PRJEB27315	ERR2640636, ERR2640637
iNeuron	3	—	—	PRJEB27315	ERR2640652, ERR2640653, ERR2640654
MEF	3	—	—	GSE239373	SRR25443485, SRR25443484, SRR25443483
Neural Precursor	2	—	—	PRJEB27315	ERR2640640, ERR2640641
Striatal cell	—	4	4	PRJNA1093144	SRR34804890, SRR34804891, SRR34804892, SRR34804893, SRR34804894, SRR34804895, SRR34804896, SRR34804897

Drosophila melanogaster

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	Dora Knockout (#)	BioProject Number	SRR Number
S2 cells	5	—	3	GSE196837,	SRR18048483, SRR18048484, SRR18048425, SRR18048423, SRR18048424, SRR18048427, SRR18048468, SRR18048426
0–4 h Embryos	4	—	—	GSE196837	SRR18048437, SRR18048436, SRR18048435, SRR18048446
8–12 h Embryos	6	—	4	GSE196837	SRR18048461, SRR18048433, SRR18048512, SRR18048481, SRR18048482, SRR18048434, SRR18048499, SRR18048531, SRR18048442, SRR18048532
12–16 h Embryos	6	—	4	GSE196837	SRR18048539, SRR18048525, SRR18048508, SRR18048459, SRR18048432, SRR18048465, SRR18048448, SRR18048497, SRR18048529, SRR18048516
16–20 h Embryos wild type	5	—	4	GSE196837	SRR18048421, SRR18048538, SRR18048479, SRR18048463, SRR18048527, SRR18048542, SRR18048443, SRR18048495, SRR18048501
Fly Non-targeting Control	—	3	—	PRJNA896239	SRR22129292, SRR22129294, SRR22129296

Caenorhabditis elegans

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	Ebax Knockout (#)	BioProject Number	SRR Number
Embryos	4	—	—	PRJNA922944	SRR23049957, SRR23049959, SRR23049928, SRR23049954
L1	5	—	2	GSE68588, GSE262626, GSE267368	SRR2010468, SRR2010469, SRR28479534, SRR29013568, SRR29013569, SRR29013570, SRR29013571
L2	3	—	—	GSE266398	SRR28868053, SRR28868054, SRR28868055
L3	3	—	—	PRJNA684142	SRR13238604, SRR13238605, SRR13238606
L4	3	—	—	PRJNA922944	SRR23049963, SRR23049955, SRR23049961
Adult	4	—	—	PRJNA922944, GSE267368	SRR23049965, SRR23049966, SRR23049906, SRR23049937

3. What miRNA Expression Profile data is available in CLASHub?

microRNA Expression Profile data from four species: Human, Mouse, Drosophila melanogaster, and Caenorhabditis elegans. Below is the summary of available datasets:

Human

Sample Name	Wild Type (#)	Non-targeting sgRNA Control (#)	ZSWIM8 Knockout (#)	BioProject Number	SRR Number
A549	—	3	3	GSE163387	SRR13264637, SRR13264638, SRR13264639, SRR13264640, SRR13264641, SRR13264642
HEK293T	—	3	3	GSE123627, GSE158025	SRR12650650, SRR12650651, SRR12650652, SRR12650653, SRR12650654, SRR12650655
HeLa	—	3	3	GSE123627, GSE163387	SRR13264643, SRR13264644, SRR13264645, SRR13264646, SRR13264647, SRR13264648
K562	6	—	6	GSE158025, GSE163388	SRR12650656, SRR12650657, SRR12650658, SRR13264707, SRR13264708, SRR13264709, SRR12650659, SRR12650660, SRR12650661, SRR13264710, SRR13264711, SRR13264712
MCF7	—	2	3	GSE163388	SRR13264649, SRR13264650, SRR13264651, SRR13264652, SRR13264653

Mouse

Sample Name	wild type (#)	Non-targeting sgRNA Control (#)	Zswim8 Knockout (#)	BioProject Number	SRR Number
Brain	3	—	3	GSE235065	SRR24941005, SRR24941026, SRR24940996, SRR24941021, SRR24941036, SRR24941000
Heart	3	—	3	GSE235065	SRR24941003, SRR24941027, SRR24940995, SRR24941022, SRR24941035, SRR24940999
Kidney	3	—	3	GSE235065	SRR24941001, SRR24940993, SRR24941029, SRR24941011, SRR24941033, SRR24941017
Liver	3	—	3	GSE235065	SRR24941004, SRR24940989, SRR24941030, SRR24941010, SRR24941032, SRR24941016
Lung	3	—	3	GSE235065	SRR24940992, SRR24940998, SRR24941018, SRR24941008, SRR24941031, SRR24941015
Intestine	3	—	3	GSE235065	SRR24941002, SRR24940994, SRR24941028, SRR24941023, SRR24941012, SRR24941034
Neuron	—	3	2	GSE163387	SRR13264632, SRR13264633, SRR13264634, SRR13264635, SRR13264636
MEF	—	6	6	GSE163387, GSE158025	SRR13264626, SRR13264627, SRR13264628, SRR12650662, SRR12650663, SRR12650664, SRR13264629, SRR13264630, SRR13264631, SRR12650665, SRR12650666, SRR12650667
Stomach	3	—	3	GSE235065	SRR24941020, SRR24941009, SRR24940990, SRR24941006, SRR24941025, SRR24941013
Skin	3	—	3	GSE235065	SRR24941019, SRR24940991, SRR24940997, SRR24941024, SRR24941007, SRR24941014
Striatal cell	—	4	4	PRJNA1093144	SRR28497187, SRR28497188, SRR28497191, SRR28497192, SRR28497193, SRR28497194, SRR28497195, SRR28497196

Drosophila melanogaster

Sample Name	Wild Type (#)	Non-targeting sgRNA Control (#)	Dora Knockout (#)	BioProject Number	SRR Number
S2 cells	3	—	3	GSE163388	SRR13264713, SRR13264714, SRR13264715, SRR13264716, SRR13264717, SRR13264718

Caenorhabditis elegans

Sample Name	Wild Type (#)	Non-targeting sgRNA Control (#)	Ebax Knockout (#)	BioProject Number	Data Source
Early Embryo	2	—	2	GSE267367	SRR29013903, SRR29013904, SRR29013905, SRR29013906
Late Embryo	2	—	2	GSE267367	SRR29013899, SRR29013900, SRR29013901, SRR29013902
L1	4	—	4	GSE267367	SRR29013871, SRR29013872, SRR29013873, SRR29013874, SRR29013895, SRR29013896, SRR29013897, SRR29013898
L2	2	—	2	GSE267367	SRR29013891, SRR29013892, SRR29013893, SRR29013894
L3	2	—	2	GSE267367	SRR29013887, SRR29013888, SRR29013889, SRR29013890
L4	5	—	4	GSE267367	SRR29013866, SRR29013867, SRR29013868, SRR29013869, SRR29013870, SRR29013883, SRR29013884, SRR29013885, SRR29013886
Gravid adult	2	—	2	GSE267367	SRR29013879, SRR29013880, SRR29013881, SRR29013882
Glp-4	2	—	2	GSE267367	SRR29013875, SRR29013876, SRR29013877, SRR29013878

4. How is CLASH data analyzed in CLASHub?

Step 1: Data Upload and Input
CLASHub accepts paired-end FASTQ files or clean single-end FASTA files. Users need to provide minimal information to initiate the analysis.
1.1 Paired-end Adapter Sequences:
5′ Adapter Sequence (default): GATCGTCGGACTGTAGAACT
3′ Adapter Sequence (default): TGGAATTCTCGGGTGCCAAG
1.2 UMI Configuration: Users specify 5′ and 3′ Unique Molecular Identifier (UMI) lengths. Setting both to 0 automatically skips deduplication and UMI-trimming.
1.3 Target species: (e.g., Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans)
1.4 Output file names & Email address

Step 2: Data Preprocessing
CLASHub automatically processes the uploaded data. For paired-end FASTQ files, the preprocessing pipeline includes:
2.1 Adapter Trimming: Adapter sequences are removed using cutadapt (v2.10).
2.2 Read Merging: Overlapping paired-end reads are merged using PEAR (v0.9.6).
2.3 Redundancy Collapse & UMI Trimming: If UMIs are present, redundant reads are collapsed using fastx_collapser, and UMIs are trimmed. If UMIs are absent (lengths = 0), this step is bypassed.

Step 3: Genome Mapping & Peak Calling
Cleaned sequences are aligned to the reference genome.
3.1 Downsampling: To prevent memory overload, files exceeding 20 million reads are downsampled prior to mapping.
3.2 Alignment: Reads are aligned using HISAT2, sorted with SAMtools, and converted to BED format.
3.3 Peak Calling: Piranha assesses target site confidence via peak-calling to identify high-confidence binding sites.
3.4 Visualization: BigWig (bw) files are automatically generated for direct inspection of read coverage in genome browsers like IGV.

Step 4: Hybrid Identification
The cleaned data is processed to identify miRNA-target hybrids using:
4.1 hyb: Aligns reads to the reference transcript database using bowtie2.
4.2 Reference Database: Includes Ensembl genome assemblies and mature miRNAs from miRBase.
4.3 Binding Stability: Free energy (ΔG) and pairing patterns are calculated using UNAfold (v3.8).

Step 5: Conservation Score Calculation
Conservation scores assess evolutionary conservation of miRNA binding sites using phyloP tracks from the UCSC Genome Browser (e.g., g38.phyloP100way for human, mm39.phyloP35way for mouse).

Step 6: Output Results
The final output includes an HTML summary report and a detailed results table featuring miRNA Name, Pairing Pattern, Gene Info, Conservation Score, Free Energy, Transcript Annotation, Piranha Peak p-values, and Normalized Hybrid Abundance.

5. How is miRNA AQ-seq data analyzed in CLASHub?

Step 1: Data Upload and Input
Users upload miRNA sequencing data in one of three supported formats:
1.1 Paired-End FASTQ (.gz) or Single-End FASTQ (.gz): Requires adapter sequences.
1.2 Cleaned Single-End FASTA (.gz): Does not require adapter sequences.
1.3 UMI Configuration: For libraries with UMIs (e.g., AQ-seq), specify the UMI length. For standard small RNA-seq libraries (e.g., Illumina TruSeq or NEBNext) lacking UMIs, set lengths to 0.

Step 2: Data Preprocessing
CLASHub processes uploaded data to produce clean FASTA files:
2.1 Adapter Trimming: Adapters are removed using cutadapt.
2.2 Read Merging: For paired-end files, reads are merged using PEAR.
2.3 Redundancy Collapse & UMI Trimming: If UMIs are specified (>0), PCR duplicates are collapsed via fastx_collapser and UMIs trimmed. If UMI lengths are 0, these steps are automatically skipped.

Step 3: miRNA Identification and Quantification
The cleaned data is analyzed for miRNA quantification using CLASHub.py.
3.1 miRNA Mapping: The first 18 nucleotides of each trimmed read are perfectly matched to mature miRNA sequences from miRBase (Release 22.1).
3.2 Quantification: Both total miRNA expression levels and isoform-specific abundances (capturing 3′ variations) are accurately estimated.

Step 4: Output Results
The analysis generates a Total miRNA Table, an Isoform Expression Table, and a Summary HTML Report with key preprocessing and alignment metrics.

6. How is RNA-seq data analyzed in CLASHub?

The RNA-seq pipeline integrates HISAT2, StringTie, and DESeq2, with automated QC, optional Exon-Intron Split Analysis (EISA), and auto-repair mechanisms.

Step 1: Data Upload and Configuration
Users configure Adapter Sequences, UMI lengths (if applicable), Library Type (Stranded vs. Unstranded), and optionally enable EISA to distinguish post-transcriptional regulation.

Step 2: Preprocessing, Alignment & QC
2.1 Auto-Repair: Broken paired-end reads are automatically checked and repaired using repair.sh to maintain read integrity.
2.2 Trimming: Adapters and specified UMIs are removed using Cutadapt.
2.3 Alignment: Reads are aligned to the reference genome using HISAT2. Strand-specific flags are applied based on the library configuration.
2.4 Quality Check: RSeQC calculates read distribution across genomic features to verify library quality.
2.5 Sorting: SAM files are sorted to BAM using SAMtools.

Step 3: Standard Quantification
3.1 Abundance Estimation: StringTie quantifies gene expression using full Ensembl annotations to generate Transcripts Per Million (TPM).
3.2 Count Generation: The prepDE.py3 script extracts raw read counts for differential analysis.

Step 4: EISA Quantification (Optional Add-on)
If enabled, CLASHub performs parallel quantification using custom Exon-only and Intron-only GTF files (with overlapping genes excluded and boundaries masked) to generate separate count matrices for intronic and exonic reads.

Step 5: Differential Expression & Classification
5.1 Standard DE: DESeq2 calculates differential expression.
5.2 EISA Classification: If EISA is selected, changes are classified as Post-transcriptional (exons and introns diverge), Transcriptional (track together), or Ambiguous.

Step 6: Output Files
Outputs include QC reports (HTML), standard DE tables (DESeq2 output), TPM/Count matrices, and—if EISA is enabled—classification tables isolating regulatory mechanisms.

7. How is cumulative fraction curve analysis performed in CLASHub?

Step 1: Data Upload and Input
Users upload a differential gene expression CSV file containing GeneName, BaseMean, and log2FoldChange. A BaseMean threshold (default: 100) filters out low-expression transcripts to ensure robust results.

Step 2: Target Identification
Target genes are classified into two groups:
2.1 CLASH-Derived Targets: Identified via experimental CLASH data (Conserved and All targets).
2.2 TargetScan-Derived Targets: Predicted interactions extracted from TargetScan databases.

Step 3: Curve Generation and Analysis Modes
The tool compares fold change distributions between miRNA targets and non-target genes using two available modes:
3.1 Standard Analysis: Groups targets by broad conservation status.
3.2 Stringent Filtering: Narrows the analysis specifically to the top 25% of high-efficacy targets based on TargetScan Context++ scores, revealing more pronounced repression patterns.
Statistical differences between target groups and background non-targets are quantified via Mann–Whitney U tests.

Step 4: Output Results
Outputs include SVG files of the Cumulative Fraction Curves visually plotting the repression shifts, alongside a comprehensive merged CSV dataset that annotates each gene with its specific target classification (e.g., top 25% Context++, high-confidence CLASH overlaps, or non-targets).