Frequently Asked Questions (FAQs)
-
CLASHub hosts data from four species: Human, Mouse, Drosophila melanogaster, and Caenorhabditis elegans. Below is the summary of available datasets:
Human
Sample Name wild type (#) Non-targeting sgRNA Control (#) ZSWIM8 Knockout (#) BioProject Number SRR Number A549 — 6 6 PRJNA1166120 SRR34738798, SRR34738799, SRR34738800, SRR34738801, SRR34738802, SRR34738803, SRR34738804, SRR34738805, SRR34738790, SRR34738791, SRR34738792, SRR34738793 D425 3 — — PRJNA1166120 SRR34757946, SRR34757949, SRR34757950 ES2 — 3 3 PRJNA1166120 SRR34757940, SRR34757941, SRR34757942, SRR34757943, SRR34757944, SRR34757945 HCT116 5 — 3 GSE164634, PRJNA1166120 SRR13415087, SRR13415088, SRR13415089, SRR13415090, SRR13415091, SRR34757939, SRR34757947, SRR34757948 HEK293T 8 — — GSE198250, PRJNA1166120 SRR18281055, SRR18281057, SRR18281067, SRR18281068, SRR34761041, SRR34761042, SRR34761043, SRR34761044 HepG2 3 — — PRJNA1166120 SRR34783077, SRR34783079, SRR34783080 Huh-7.5 12 — — GSE73057 SRR2413175, SRR2413176, SRR2413177, SRR2413178, SRR2413179, SRR2413180, SRR2413181, SRR2413182, SRR2413183, SRR2413184, SRR2413185, SRR2413186 H1299 — 3 3 PRJNA1166120 SRR34768260, SRR34768261, SRR34768262, SRR34768263, SRR34768274, SRR34768275 Liver tissue 1 — — PRJNA1166120 SRR34783163 MB002 — 4 4 PRJNA1166120 SRR34783070, SRR34783071, SRR34783072, SRR34783073, SRR34783074, SRR34783075, SRR34783076, SRR34783078 MDA-MB-231 — 6 6 PRJNA1166120 SRR30817646, SRR30817647, SRR30817648, SRR30817649, SRR30817650, SRR30817651, SRR34738794, SRR34738795, SRR34738796, SRR34738797, SRR34738806, SRR34738807 OVCAR8 — 3 3 PRJNA1166120 SRR34768264, SRR34768265, SRR34768266, SRR34768267, SRR34768276, SRR34768277 TIVE-EX-LTC 3 — — GSE101978 SRR5876947, SRR5876948, SRR5876949 T98G — 3 3 PRJNA1166120 SRR34743309, SRR34743310, SRR34743311, SRR34743312, SRR34743317, SRR34743318 U87MG — 3 3 PRJNA1166120 SRR34743313, SRR34743314, SRR34743315, SRR34743316, SRR34743319, SRR34743320 501Mel — 3 3 PRJNA1166120 SRR34768268, SRR34768269, SRR34768270, SRR34768271, SRR34768272, SRR34768273 Mouse
Sample Name wild type (#) Non-targeting sgRNA Control (#) Zswim8 Knockout (#) BioProject Number SRR Number HE2.1B 6 — — GSE124687 SRR8395242, SRR8395243, SRR8395244, SRR8395245, SRR8395246, SRR8395247 MEF — 2 2 PRJNA1166120 SRR34793109, SRR34793110, SRR34793111, SRR34793112 Striatal cell — 4 4 PRJNA1093144 SRR28497185, SRR28497186, SRR28497189, SRR28497190, SRR2849718, 6SRR28497197, SRR28497198, SRR28497199, SRR28497200 3T12 3 — — GSE124687 SRR8395248, SRR8395249, SRR8395250 Cortex tissue 16 — — GSE73058 SRR2413277, SRR2413278, SRR2413279, SRR2413282, SRR2413284, SRR2413286, SRR2413288, SRR2413289, SRR2413290, SRR2413293, SRR2413297, SRR2413298, SRR2413299, SRR2413300, SRR2413301, SRR2413302 Heart tissue 2 — — PRJNA1166120 SRR34793107, SRR34793108 Kidney tissue 2 — — PRJNA1166120 SRR34793105, SRR34793106 Drosophila melanogaster
Sample Name wild type (#) Non-targeting sgRNA Control (#) Dora Knockout (#) BioProject Number SRR Number S2 cells — 3 3 PRJNA896239 SRR22129325, SRR22129327, SRR22129328, SRR22129284, SRR22129287, SRR22129298 Caenorhabditis elegans
Sample Name wild type (#) Non-targeting sgRNA Control (#) Ebax Knockout (#) BioProject Number SRR Number Embryo — 4 4 GSE303817 — L3 stage — 7 — GSE56180 SRR1207389, SRR1207390, SRR1207391, SRR1207392, SRR1207393, SRR1207394, SRR1207395 mid-L4 stage — 5 — PRJNA328816 SRR3882724, SRR3882728, SRR3882949, SRR3882950, SRR3882951 -
Gene Expression Profile from four species: Human, Mouse, Drosophila melanogaster, and Caenorhabditis elegans. Below is the summary of available datasets:
Human
Sample Name wild type (#) Non-targeting sgRNA Control (#) ZSWIM8 Knockout (#) BioProject Number SRR Number A549 7 — — GSE263036, GSE212057, GSE199309 SRR28535493, SRR28535494, SRR28535495, SRR21237863, SRR21237869, SRR21237879, SRR18462418 D425 5 — — GSE151810, GSE185024, GSE123760 SRR11924485, SRR11924486, SRR16119415, SRR16119416, SRR8315029 ES2 6 — — GSE218794, GSE245778 SRR22410790, SRR22410791, SRR22410792, SRR26439462, SRR26439463, SRR26439464 HEK293T 7 — — GSE231583, GSE196043 SRR24421974, SRR24421975, SRR24421976, SRR18074813, SRR18074814, SRR18074815, SRR18074816 Hela 7 — — GSE273634, GSE218727, GSE199309 SRR30058518, SRR30058519, SRR30058520, SRR22407570, SRR22407571, SRR22407572, SRR18462415 HepG2 5 — — GSE224980, GSE264010 SRR28685775, SRR28685776, SRR28685777, SRR23387178, SRR23387179 H1299 4 — — GSE212057, GSE199309 SRR21237865, SRR21237873, SRR21237881, SRR18462412 K562 6 — — GSE199309, GSE167869 SRR18462409, SRR13800753, SRR13800754, SRR13800737, SRR13800738, SRR13800739 MB002 4 — — GSE261568 SRR28341540, SRR28341541, SRR28341542,SRR28341543 MCF7 7 — — GSE195761, GSE178905, GSE163791 SRR17944548, SRR17944549, SRR14915857, SRR14915858, SRR13296901, SRR13296902, SRR13296903 MDA-MB-231 3 — — GSE178532 SRR14870088, SRR14870089, SRR14870090 OVCAR8 4 — — GSE246325 SRR26536798, SRR26536799, SRR26536802, SRR26536803 T98G 5 — — GSE112241, PRJNA580150 SRR10358029, SRR10358030, SRR10358031, SRR6881782, SRR6881783 U87MG 6 — — GSE147626, GSE235568 SRR11433766, SRR11433767, SRR11433768, SRR24991947, SRR24991948, SRR24991949 501Mel 7 — — PRJNA515302, GSE104869 SRR8473015, SRR8473019, SRR8473020, SRR6163777, SRR6163778, SRR6163779, SRR6163780 Mouse
Sample Name wild type (#) Non-targeting sgRNA Control (#) Zswim8 Knockout (#) BioProject Number SRR Number Adrenal Gland 4 — — PRJNA375882 SRR5273702, SRR5273670, SRR5273654, SRR5273686 Bone Marrow 4 — — PRJNA375882 SRR5273648, SRR5273680, SRR5273664, SRR5273696 Brain 8 — — PRJNA375882 SRR5273637, SRR5273639, SRR5273641, SRR5273673, SRR5273657, SRR5273635, SRR5273689, SRR5273705 Embryonic Stem Cell 2 — — PRJEB27315 ERR2640636, ERR2640637 Forestomach 4 — — PRJNA375882 SRR5273662, SRR5273694, SRR5273678, SRR5273646 Heart 4 — — PRJNA375882 SRR5273651, SRR5273683, SRR5273667, SRR5273699 iNeuron 3 — — PRJEB27315 ERR2640652, ERR2640653, ERR2640654 Kidney 4 — — PRJNA375882 SRR5273655, SRR5273671, SRR5273703, SRR5273687 Large Intestine 4 — — PRJNA375882 SRR5273676, SRR5273692, SRR5273660, SRR5273644 Liver 8 — — PRJNA375882 SRR5273636, SRR5273656, SRR5273640, SRR5273634, SRR5273638, SRR5273672, SRR5273704, SRR5273688 Lung 4 — — PRJNA375882 SRR5273668, SRR5273700, SRR5273684, SRR5273652 MEF 3 — — GSE239373 SRR25443485, SRR25443484, SRR25443483 Muscle 4 — — PRJNA375882 SRR5273643, SRR5273691, SRR5273659, SRR5273675 Neural Precursor 2 — — PRJEB27315 ERR2640640, ERR2640641 Ovary 2 — — PRJNA375882 SRR5273665, SRR5273649 Placenta 6 — — GSE252281 SRR27386997, SRR27386999, SRR27386998, SRR27387000, SRR27387002, SRR27387001 Skin 5 — — GSE222026 SRR22952493, SRR22952494, SRR22952490, SRR22952491, SRR22952492 Small Intestine 4 — — PRJNA375882 SRR5273661, SRR5273693, SRR5273677, SRR5273645 Spleen 4 — — PRJNA375882 SRR5273653, SRR5273685, SRR5273669, SRR5273701 Stomach 4 — — PRJNA375882 SRR5273647, SRR5273663, SRR5273679, SRR5273695 Striatal cell — 4 4 PRJNA1093144 SRR34804890, SRR34804891, SRR34804892, SRR34804893, SRR34804894, SRR34804895, SRR34804896, SRR34804897 Testis 2 — — PRJNA375882 SRR5273681, SRR5273697 Thymus 4 — — PRJNA375882 SRR5273650, SRR5273682, SRR5273698, SRR5273666 Vesicular Gland 3 — — PRJNA375882 SRR5273658, SRR5273674, SRR5273690 Drosophila melanogaster
Sample Name wild type (#) Non-targeting sgRNA Control (#) Dora Knockout (#) BioProject Number SRR Number S2 cells 5 4 3 GSE196837, PRJNA896239 SRR18048483, SRR18048484, SRR18048425, SRR18048423, SRR18048424, SRR22129330, SRR22129281, SRR22129317, SRR22129316, SRR18048427, SRR18048468, SRR18048426 0–4 h Embryos 4 — — GSE196837 SRR18048437, SRR18048436, SRR18048435, SRR18048446 8–12 h Embryos 6 — 4 GSE196837 SRR18048461, SRR18048433, SRR18048512, SRR18048481, SRR18048482, SRR18048434, SRR18048499, SRR18048531, SRR18048442, SRR18048532 12–16 h Embryos 6 — 4 GSE196837 SRR18048539, SRR18048525, SRR18048508, SRR18048459, SRR18048432, SRR18048465, SRR18048448, SRR18048497, SRR18048529, SRR18048516 16–20 h Embryos wild type 5 — 4 GSE196837 SRR18048421, SRR18048538, SRR18048479, SRR18048463, SRR18048527, SRR18048542, SRR18048443, SRR18048495, SRR18048501 Fly Non-targeting Control — 3 — PRJNA896239 SRR22129292, SRR22129294, SRR22129296 Caenorhabditis elegans
Sample Name wild type (#) Non-targeting sgRNA Control (#) Ebax Knockout (#) BioProject Number SRR Number Embryos 4 — — PRJNA922944 SRR23049957, SRR23049959, SRR23049928, SRR23049954 L1 6 2 2 GSE68588, GSE262626, GSE267368 SRR2010468, SRR2010469, SRR28479531, SRR28479532, SRR28479533, SRR28479534 L2 3 — — GSE266398 SRR28868053, SRR28868054, SRR28868055 L3 3 — — PRJNA684142 SRR13238604, SRR13238605, SRR13238606 L4 3 — — PRJNA922944 SRR23049963, SRR23049955, SRR23049961 Adult 4 1 1 PRJNA922944, GSE267368 SRR23049965, SRR23049966, SRR23049906, SRR23049937 -
microRNA Expression Profile data from four species: Human, Mouse, Drosophila melanogaster, and Caenorhabditis elegans. Below is the summary of available datasets:
Human
Sample Name Wild Type (#) Non-targeting sgRNA Control (#) ZSWIM8 Knockout (#) BioProject Number SRR Number A549 — 3 3 GSE163387 SRR13264637, SRR13264638, SRR13264639, SRR13264640, SRR13264641, SRR13264642 HEK293T 2 3 3 GSE123627, GSE158025 SRR8311265, SRR8311266, SRR12650650, SRR12650651, SRR12650652, SRR12650653, SRR12650654, SRR12650655 HeLa 2 3 3 GSE123627, GSE163387 SRR13377179, SRR13377180, SRR13264643, SRR13264644, SRR13264645, SRR13264646, SRR13264647, SRR13264648 K562 6 — 6 GSE158025, GSE163388 SRR12650656, SRR12650657, SRR12650658, SRR13264707, SRR13264708, SRR13264709, SRR12650659, SRR12650660, SRR12650661, SRR13264710, SRR13264711, SRR13264712 MCF7 — 2 3 GSE163388 SRR13264649, SRR13264650, SRR13264651, SRR13264652, SRR13264653 Mouse
Sample Name wild type (#) Non-targeting sgRNA Control (#) Zswim8 Knockout (#) BioProject Number SRR Number Brain 4 — 3 GSE235065, GSE148686 SRR24941005, SRR11547029, SRR24941026, SRR24940996, SRR24941021, SRR24941036, SRR24941000 Heart 4 — 5 GSE235065, GSE148686, GSE231448 SRR24941003, SRR24941027, SRR24940995, SRR11547034, SRR24941022, SRR24391641, SRR24391629, SRR24941035, SRR24940999 Kidney 4 — 5 GSE235065, GSE148686, GSE231448 SRR24941001, SRR24940993, SRR24941029, SRR11547035, SRR24941011, SRR24391611, SRR24391615, SRR24941033, SRR24941017 Liver 4 — 5 GSE235065, GSE148686 SRR24941004, SRR24940989, SRR24941030, SRR11547036, SRR24941010, SRR24391636, SRR24391616, SRR24941032, SRR24941016 Lung 4 — 5 GSE235065, GSE148686 SRR24940992, SRR24940998, SRR24941018, SRR24391640, SRR24941008, SRR24391628, SRR11547037, SRR24941031, SRR24941015 Intestine 3 — 5 GSE235065, GSE231448 SRR24941002, SRR24940994, SRR24941028, SRR24941023, SRR24941012, SRR24391637, SRR24391617, SRR24941034 Neuron — 3 2 GSE163387 SRR13264632, SRR13264633, SRR13264634, SRR13264635, SRR13264636 MEF — 6 6 GSE163387, GSE158025 SRR13264626, SRR13264627, SRR13264628, SRR12650662, SRR12650663, SRR12650664, SRR13264629, SRR13264630, SRR13264631, SRR12650665, SRR12650666, SRR12650667 Stomach 4 — 5 GSE235065, GSE148686, GSE231448 SRR24941020, SRR24941009, SRR24940990, SRR11547042, SRR24941006, SRR24391639, SRR24941025, SRR24391619, SRR24941013 Skin 3 — 5 GSE235065, GSE231448 SRR24941019, SRR24940991, SRR24940997, SRR24941024, SRR24391638, SRR24941007, SRR24391618, SRR24941014 Striatal cell — 2 2 PRJNA1093144 SRR28497187, SRR28497188, SRR28497191, SRR28497192, SRR28497193, SRR28497194, SRR28497195, SRR28497196 Drosophila melanogaster
Sample Name Wild Type (#) Non-targeting sgRNA Control (#) Dora Knockout (#) BioProject Number SRR Number S2 cells 3 — 3 GSE163388 SRR13264713, SRR13264714, SRR13264715, SRR13264716, SRR13264717, SRR13264718 Caenorhabditis elegans
Sample Name Wild Type (#) Non-targeting sgRNA Control (#) Ebax Knockout (#) BioProject Number Data Source Early Embryo 2 — 2 GSE267367 SRR29013903, SRR29013904, SRR29013905, SRR29013906 Late Embryo 2 — 2 GSE267367 SRR29013899, SRR29013900, SRR29013901, SRR29013902 L1 2 — 2 GSE267367 SRR29013895, SRR29013896, SRR29013897, SRR29013898 L2 2 — 2 GSE267367 SRR29013891, SRR29013892, SRR29013893, SRR29013894 L3 2 — 2 GSE267367 SRR29013887, SRR29013888, SRR29013889, SRR29013890 L4 3 — 2 GSE267367 SRR29013866, SRR29013867, SRR29013868, SRR29013869, SRR29013870 Gravid adult 2 — 2 GSE267367 SRR29013879, SRR29013880, SRR29013881, SRR29013882 Glp-4 2 — 2 GSE267367 SRR29013875, SRR29013876, SRR29013877, SRR29013878 -
Step 1: Data Upload and Input
CLASHub accepts paired-end FASTQ files or clean single-end FASTA files. Users need to provide minimal information to initiate the analysis.
1.1 Paired-end Adapter Sequences:
5′ Adapter Sequence (default):GATCGTCGGACTGTAGAACT
3′ Adapter Sequence (default):TGGAATTCTCGGGTGCCAAG
1.2 Target species: (e.g., Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans)
1.3 Output file names
1.4 Email address for result notificationStep 2: Data Preprocessing
CLASHub automatically processes the uploaded data. For paired-end FASTQ files, the preprocessing pipeline includes:
2.1 Adapter Trimming: Adapter sequences are removed usingcutadapt (v2.10) (Martin 2011)
.
2.2 Read Merging: Overlapping paired-end reads are merged usingPEAR (v0.9.6) (Zhang et al. 2014)
.
2.3 Redundancy Collapse: Redundant reads, often from PCR amplification, are collapsed usingfastx_collapser (v0.0.14)
.
2.4 UMI Trimming: Unique Molecular Identifiers (UMIs) are trimmed to produce clean FASTA files.
For single-end FASTA files, these preprocessing steps are not required.Step 3: Hybrid Identification
The cleaned data is processed to identify miRNA-target hybrids. The analysis uses:
3.1 hyb (Travis et al., 2014): Aligns reads to the reference transcript database usingbowtie2 (v2.5.3)
.
3.2 Reference Database: IncludesEnsembl genome (Harrison et al., 2024)
assemblies (e.g., GRCh38 for Homo sapiens, GRCm39 for Mus musculus, BDGP6.46 for Drosophila melanogaster, and WBcel235 for Caenorhabditis elegans) and mature miRNAs frommiRBase (Release 22.1)
.
3.3 Binding Stability: Free energy (ΔG
, kcal/mol) is calculated usingUNAfold (v3.8)
(Markham and Zuker, 2008).Step 4: Conservation Score Calculation
Conservation scores are calculated usingCLASHub.py
, available on GitHub (https://github.com/UF-Xie-Lab/CLASHub/), and phyloP tracks from the UCSC Genome Browser. These scores assess the evolutionary conservation of miRNA binding sites (Perez et al. 2024
). The following phyloP tracks are used for different species:
4.1g38.phyloP100way
for Homo sapiens (human)
4.2mm39.phyloP35way
for Mus musculus (mouse)
4.3dm6.phyloP124way
for Drosophila melanogaster (fruit fly)
4.4ce11.phyloP135way
for Caenorhabditis elegans (nematode)Step 5: Hybrid Quantification and Site Type Analysis
All identified miRNA-target hybrids are quantified, and site types are classified based on sequence matching. The site types include: Offset 6mer, 6mer, 7mer-A1, 7mer-m8, and 8mer. These results are summarized into a comprehensive table, including expression levels and detailed hybrid information.Step 6: Output Results
The final output is presented in a detailed table and summary HTML report. The table includes the following columns:
6.1 miRNA Name (from themiRBase
database)
6.2 Pairing Pattern (includes miRNA sequence, target sequence, and their base pairing relationships, analyzed viaUNAfold
)
6.3 Gene Name (from theEnsembl
database)
6.4 Gene ID (from theEnsembl
database)
6.5 Conservation Score (calculated using phyloP tracks from the UCSC Genome Browser:g38.phyloP100way
for Homo sapiens,mm39.phyloP35way
for Mus musculus,dm6.phyloP124way
for Drosophila melanogaster, andce11.phyloP135way
for Caenorhabditis elegans)
6.6 Free Energy (ΔG, kcal/mol, calculated viaUNAfold
)
6.7 Gene Type (e.g., mRNA, lncRNA)
6.8 Element Region (e.g., CDS, 5'UTR, or 3'UTR)
6.9 Genomic Position
6.10 Binding Site Type (e.g., 8mer, 7mer-A1, 7mer-m8, 6mer, or non-seed match)
6.11 Number of Datasets with Hybrid Occurrence (specifies the datasets where hybrids occur, including wild-type samples, non-targeting sgRNA control samples, or ZSWIM8 knockout samples)
6.12 Normalized Hybrid Abundance (quantifies the abundance of hybrids across wild-type samples, non-targeting sgRNA control samples, or ZSWIM8 knockout samples)
-
Step 1: Data Upload and Input
Users upload miRNA sequencing data in one of three supported formats:
1.1 Paired-End FASTQ (.gz): Requires 5′ and 3′ adapter sequences. Default adapter sequences:
5′ Adapter Sequence (default):GATCGTCGGACTGTAGAACT
3′ Adapter Sequence (default):TGGAATTCTCGGGTGCCAAG
1.2 Single-End FASTQ (.gz): Requires only the 3′ adapter sequence. Default 3′ adapter:
3′ Adapter Sequence (default):TGGAATTCTCGGGTGCCAAG
1.3 Cleaned Single-End FASTA (.gz): Does not require adapter sequences.
Additional inputs include the output file name, target species (Homo sapiens, Mus musculus, Drosophila melanogaster, or Caenorhabditis elegans), and a valid email address for result notification.
Step 2: Data Preprocessing
CLASHub processes uploaded data to produce clean FASTA files for analysis:
2.1 Adapter Trimming: Adapter sequences are removed usingcutadapt (v2.10) (Martin 2011)
for both paired-end and single-end FASTQ files.
2.2 Read Merging: For paired-end FASTQ files, overlapping reads are merged usingPEAR (v0.9.6) (Zhang et al. 2014)
. Single-end files bypass this step.
2.3 Redundancy Collapse: Redundant sequences, often from PCR amplification, are removed usingfastx_collapser (v0.0.14)
. This step applies to both paired-end and single-end files.
2.4 UMI Trimming: Unique Molecular Identifiers (UMIs) are trimmed by removing four nucleotides from both 5′ and 3′ ends usingcutadapt
. This step applies to paired-end and single-end FASTQ files but is skipped for cleaned single-end FASTA files.
Step 3: miRNA Identification and Quantification
The cleaned data is analyzed for miRNA quantification using a custom Python script,CLASHub.py
, available on GitHub (https://github.com/UF-Xie-Lab/CLASHub/).
3.1 miRNA Mapping: Reads are aligned to mature miRNA sequences frommiRBase (Release 22.1)
.
3.2 Sequence Matching: The first 18 nucleotides of each read are matched to the corresponding miRNA in the database for accurate identification.
3.3 Quantification: Total miRNA expression levels and isoform-specific abundances (capturing variations at the 3′ end) are calculated.
Step 4: Output Results
The AQ-miRNA-seq analysis generates the following outputs:
4.1 Total miRNA Table: Contains miRNA names and their total expression levels.
4.2 Isoform Expression Table: Provides isoform-specific abundances, including sequence variations at the 3′ ends.
4.3 Summary HTML Report: Includes key metrics such as the total number of uploaded reads, reads remaining after preprocessing, and the proportion of reads successfully aligned to known miRNAs.
-
Step 1: Data Upload and Input
Users upload RNA sequencing data in paired-end FASTQ (.gz) format. Required inputs include:
1.1 Adapter Sequences: Default adapter sequences for trimming are:
5′ Adapter Sequence (default):AGATCGGAAGAGCGTCGTGTA
3′ Adapter Sequence (default):AGATCGGAAGAGCACACGTCT
1.2 Output File Name: User-defined name for the output files.
1.3 Target Species: Homo sapiens (GRCh38), Mus musculus (GRCm39), Drosophila melanogaster (BDGP6), or Caenorhabditis elegans (WBcel235).
1.4 Email Address: Results will be sent to the provided email.
Step 2: Data Preprocessing
Uploaded data undergoes preprocessing to prepare for gene expression analysis:
2.1 Adapter Trimming: Adapter sequences are removed usingCutadapt (v2.10) (Martin 2011)
.
2.2 Genome Mapping: Trimmed reads are aligned to the reference genome usingHISAT2 (v2.2.1)
. The supported genome assemblies include:
2.2.1 Homo sapiens (GRCh38)
2.2.2 Mus musculus (GRCm39)
2.2.3 Drosophila melanogaster (BDGP6)
2.2.4 Caenorhabditis elegans (WBcel235)
Step 3: Gene Expression Quantification
3.1 Gene Expression Quantification: Gene expression levels are quantified usingStringTie (v2.2.1)
to calculate Transcripts Per Million (TPM), providing normalized expression values for all detected genes.
4. Differential Gene Expression Analysis: For identifying differentially expressed genes, raw counts are generated using the
prepDE.py3
script (part of StringTie). These raw counts are analyzed withDESeq2 (v1.44)
to detect statistically significant changes in gene expression between experimental groups (e.g., treatment vs. control).
Step 5: Output Results
The RNA-seq analysis generates the following outputs using a custom Python script,CLASHub.py
, available on GitHub (https://github.com/UF-Xie-Lab/CLASHub/):
5.1 TPM Expression Table: Provides normalized gene expression values (TPM).
5.2 Raw Count Table: Contains unprocessed gene-level counts for downstream analysis.
5.3 DESeq2 Results Table: Includes log2 fold changes, adjusted p-values, and base mean values for differentially expressed genes.
5.4 Summary HTML Report: Summarizes the total number of uploaded reads, trimmed reads, mapped reads, and key metrics for both gene expression quantification and differential analysis.
-
Step 1: Data Upload and Input
Users upload a differential gene expression CSV file with at least three required columns:
1.1 GeneName: Specifies gene names used for analysis.
1.2 BaseMean: Filters out genes with low expression (default threshold: 100).
1.3 log2FoldChange: Indicates the direction and magnitude of differential regulation.
Additional inputs include the target species (Homo sapiens, Mus musculus, Drosophila melanogaster, or Caenorhabditis elegans), the miRNA name, a BaseMean threshold, an output file name, and a valid email address for result delivery.
Step 2: Target Identification
Target genes are classified into two groups for analysis:
2.1 CLASH-Derived Targets:
• Conserved Targets: Identified in at least two cell types for Homo sapiens (15 cell types) and Mus musculus (7 cell types), containing 8mer or 7mer seed matches with conservation scores (phyloP > 0). For Drosophila melanogaster, conserved targets are restricted to the S2 cell line, and for Caenorhabditis elegans, they are limited to embryonic datasets.
• All Targets: Includes all genes containing seed matches (8mer, 7mer, or 6mer), covering mRNA and ncRNA biotypes.
2.2 TargetScan-Derived Targets:
TargetScan predictions are derived from the fileSummary_Counts.txt
, which classifies targets into conserved and non-conserved categories:
• Conserved Targets: miRNA targets specifically annotated as conserved in the TargetScan database.
• All Targets: Includes both conserved and non-conserved sites, providing a comprehensive list of predicted miRNA interactions.
Users can access theSummary_Counts.txt
file for different species from the TargetScan database:
• Homo sapiens (Human): Release 8, September 2021
• Mus musculus (Mouse): Release 8, September 2021
• Drosophila melanogaster (Fly): Release 7.2, October 2018
• Caenorhabditis elegans (Worm): Release 6.2, June 2012
Step 3: Cumulative Fraction Curve Generation
The uploaded gene expression data is used to compare fold change distributions between miRNA targets (identified via CLASH or TargetScan) and non-target genes. Genes with BaseMean values below the specified threshold are excluded to ensure robust results. The analysis generates cumulative fraction curves to visualize regulatory trends, such as miRNA upregulation leading to the repression of target genes relative to non-target genes.
Step 4: Output Results
The cumulative fraction curve analysis produces the following outputs:
4.1 Cumulative Fraction Curves: Visualize fold change distributions for miRNA targets versus non-targets.
4.2 Summary Report: Includes key statistics such as the number of input genes, filtered genes (based on BaseMean), and target gene classifications (e.g., conserved or all targets).
Results are summarized using a custom Python script,CLASHub.py
, available on GitHub (https://github.com/UF-Xie-Lab/CLASHub/).