Story

Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

ComputingArtificial Intelligence

Key takeaway

Researchers developed a framework to better detect toxic content online by combining text and images, which could help make social media platforms safer for users.

Read the paper

Quick Explainer

The researchers proposed a hybrid approach called KID-VLM to detect toxicity in online memes. It combines knowledge distilled from large vision-language models with explicit relational semantics from knowledge graphs. This allows the model to capture both implicit and explicit contextual cues needed for accurate toxicity assessment. The key steps are: 1) distilling contextual knowledge from a large teacher model into a more compact student model, 2) infusing relational knowledge from a knowledge graph through graph-based reasoning, and 3) fusing the distilled multimodal representation with the graph-based representation. This knowledge-enhanced approach outperforms existing compact and large-scale models on benchmark toxicity detection tasks while maintaining an efficient model size.

Deep Dive

Technical Deep Dive: Just KIDDIN - Detecting INdecent Memes

Overview

This work proposes a novel hybrid neurosymbolic framework, called Knowledge-Infused Distilled Vision-Language Model (KID-VLM), for detecting toxicity in online multimodal environments like memes. The key innovations are:

Integrating implicit contextual knowledge distilled from Large Vision-Language Models (LVLMs) and explicit relational semantics infused from Knowledge Graphs (KGs) to enhance compact vision-language models.
Utilizing CLIP as the backbone and LLaVA-NeXT for caption generation, followed by knowledge distillation and graph-based reasoning over ConceptNet to capture both implicit and explicit contextual cues.
Achieving state-of-the-art performance on two benchmark datasets (HatefulMemes and HarMeme) while maintaining a compact model size (~500M parameters).

Problem & Context

Online platforms have seen a surge in the dissemination of harmful, context-dependent content like memes, which can convey toxic messages through sarcasm, irony, or cultural references.
Current methods rely solely on training data and pre-trained models, limiting their ability to capture complex contextual nuances required for accurate toxicity assessment.
Large multimodal systems demand high computational resources, hindering their deployment in resource-constrained settings.

Methodology

Knowledge Distillation (KD): Distill implicit contextual knowledge from the teacher model (LLaVA-NeXT) into the student model (CLIP-based) using a consistency loss.
Knowledge Infusion (KI): Incorporate explicit relational knowledge from ConceptNet KG by constructing a joint working graph and applying Relational Graph Convolutional Network (R-GCN) for reasoning.
Multimodal Fusion: Fuse the distilled multimodal representation and the graph-based representation using a Gated Fusion mechanism.

Data & Experimental Setup

Evaluated on two benchmark datasets: HatefulMemes and HarMeme.
Compared against compact VLMs like HateClipper, RGCL, Pro-Cap, and large models like MMBT, CLIP.
Optimized hyperparameters using Optuna, including GNN architecture, fusion mechanisms, and loss weighting.

Results

KID-VLM outperforms all baselines on both datasets:
- HatefulMemes: 10.6% improvement in F1 score and 0.5% in AUC on the unseen split.
- HarMeme: 6.3% improvement in F1 and 3.2% in AUC.
The Hop 2 variant of KID-VLM shows the best overall performance, highlighting the benefits of broader contextual understanding.
Ablation studies demonstrate the complementary benefits of KD and KI, with KID-VLM achieving the best results.

Interpretation

Knowledge-enhanced representations lead to better separation between toxic and non-toxic content in the latent space, reducing ambiguity and misclassifications.
KID-VLM's compact size (~500M parameters) enables efficient training and deployment, making it scalable for real-world toxicity detection applications.
Multi-hop knowledge extraction helps the model generalize better, especially on unseen data, by capturing extended contextual cues.

Limitations & Uncertainties

The model's reliance on ConceptNet may limit generalizability to datasets beyond the two examined.
Graph-based methods can increase computational complexity, potentially affecting scalability for larger datasets.
Potential for bias propagation from pre-trained models and KGs, as well as inherited hallucination issues from LLaVA.

What Comes Next

Exploring more diverse datasets to assess the model's broader generalizability.
Investigating strategies to improve scalability, such as more efficient graph reasoning approaches.
Addressing potential biases through careful evaluation and the use of diverse data sources.

Sources:

Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

Source

Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes
PreprintarXiv (cs.AI)2/18/2026