Curious Now

Story

Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

ComputingArtificial Intelligence

Key takeaway

Researchers developed a framework to better detect toxic content online by combining text and images, which could help make social media platforms safer for users.

Read the paper

Quick Explainer

The researchers proposed a hybrid approach called KID-VLM to detect toxicity in online memes. It combines knowledge distilled from large vision-language models with explicit relational semantics from knowledge graphs. This allows the model to capture both implicit and explicit contextual cues needed for accurate toxicity assessment. The key steps are: 1) distilling contextual knowledge from a large teacher model into a more compact student model, 2) infusing relational knowledge from a knowledge graph through graph-based reasoning, and 3) fusing the distilled multimodal representation with the graph-based representation. This knowledge-enhanced approach outperforms existing compact and large-scale models on benchmark toxicity detection tasks while maintaining an efficient model size.

Deep Dive

Technical Deep Dive: Just KIDDIN - Detecting INdecent Memes

Overview

This work proposes a novel hybrid neurosymbolic framework, called Knowledge-Infused Distilled Vision-Language Model (KID-VLM), for detecting toxicity in online multimodal environments like memes. The key innovations are:

  • Integrating implicit contextual knowledge distilled from Large Vision-Language Models (LVLMs) and explicit relational semantics infused from Knowledge Graphs (KGs) to enhance compact vision-language models.
  • Utilizing CLIP as the backbone and LLaVA-NeXT for caption generation, followed by knowledge distillation and graph-based reasoning over ConceptNet to capture both implicit and explicit contextual cues.
  • Achieving state-of-the-art performance on two benchmark datasets (HatefulMemes and HarMeme) while maintaining a compact model size (~500M parameters).

Problem & Context

  • Online platforms have seen a surge in the dissemination of harmful, context-dependent content like memes, which can convey toxic messages through sarcasm, irony, or cultural references.
  • Current methods rely solely on training data and pre-trained models, limiting their ability to capture complex contextual nuances required for accurate toxicity assessment.
  • Large multimodal systems demand high computational resources, hindering their deployment in resource-constrained settings.

Methodology

  1. Knowledge Distillation (KD): Distill implicit contextual knowledge from the teacher model (LLaVA-NeXT) into the student model (CLIP-based) using a consistency loss.
  2. Knowledge Infusion (KI): Incorporate explicit relational knowledge from ConceptNet KG by constructing a joint working graph and applying Relational Graph Convolutional Network (R-GCN) for reasoning.
  3. Multimodal Fusion: Fuse the distilled multimodal representation and the graph-based representation using a Gated Fusion mechanism.

Data & Experimental Setup

  • Evaluated on two benchmark datasets: HatefulMemes and HarMeme.
  • Compared against compact VLMs like HateClipper, RGCL, Pro-Cap, and large models like MMBT, CLIP.
  • Optimized hyperparameters using Optuna, including GNN architecture, fusion mechanisms, and loss weighting.

Results

  • KID-VLM outperforms all baselines on both datasets:
    • HatefulMemes: 10.6% improvement in F1 score and 0.5% in AUC on the unseen split.
    • HarMeme: 6.3% improvement in F1 and 3.2% in AUC.
  • The Hop 2 variant of KID-VLM shows the best overall performance, highlighting the benefits of broader contextual understanding.
  • Ablation studies demonstrate the complementary benefits of KD and KI, with KID-VLM achieving the best results.

Interpretation

  • Knowledge-enhanced representations lead to better separation between toxic and non-toxic content in the latent space, reducing ambiguity and misclassifications.
  • KID-VLM's compact size (~500M parameters) enables efficient training and deployment, making it scalable for real-world toxicity detection applications.
  • Multi-hop knowledge extraction helps the model generalize better, especially on unseen data, by capturing extended contextual cues.

Limitations & Uncertainties

  • The model's reliance on ConceptNet may limit generalizability to datasets beyond the two examined.
  • Graph-based methods can increase computational complexity, potentially affecting scalability for larger datasets.
  • Potential for bias propagation from pre-trained models and KGs, as well as inherited hallucination issues from LLaVA.

What Comes Next

  • Exploring more diverse datasets to assess the model's broader generalizability.
  • Investigating strategies to improve scalability, such as more efficient graph reasoning approaches.
  • Addressing potential biases through careful evaluation and the use of diverse data sources.

Sources:

Source

You're offline. Saved stories may still be available.