Specifically,
-
Prompt Construction:
Given a sensitive prompt \( p_t = [p_1, p_2, ..., p_n] \), JPA prepends k learnable tokens \( [v_1, ..., v_k] \) to form the adversarial prompt:\( p_a = [v_1, ..., v_k, p_1, ..., p_n] \) -
Concept Direction via Antonyms:
Using antonym pairs \( (r_i^+, r_i^-) \) generated by ChatGPT, we compute the concept direction in the embedding space:\( r = \frac{1}{N}\sum_{i=1}^n \mathcal{T}(r_i^+) - \mathcal{T}(r_i^-) \)where \( \mathcal{T}(\cdot) \) denotes the text encoder. -
Embedding Modification:
The original prompt embedding is modified to inject the NSFW concept:\( \mathcal{T}(p_r) = \mathcal{T}(p_t) + \lambda \cdot r \)where \( \lambda \) controls the strength of the injected concept. -
Prompt Search via Cosine Similarity:
The goal is to find a prompt \( p_a \) whose embedding is closest to \( \mathcal{T}(p_r) \):\( \max_{p_a} \frac{\mathcal{T}(p_a) \cdot \mathcal{T}(p_r)}{|\mathcal{T}(p_a)| \cdot |\mathcal{T}(p_r)|} \) -
Optimization in Discrete Space:
JPA uses Projected Gradient Descent (PGD) with a softmax relaxation over the vocabulary:\( \text{embed}[i] = \sum_{k=1}^L \frac{e^{v_{ik}}}{\sum_{h=1}^L e^{v_{ih}}} E_k \)where \( L \) is the vocabulary size, \( E_k \) is the embedding of the k-th word, and \( v_{ik} \) is the logit for word k at position i. -
Discrete Prompt Extraction:
After optimization, the final token at each position is selected as:\( v_i = \arg\max_k v_{ik} \) -
Gradient Masking for Safety:
To avoid selecting blocked or overly sensitive words, JPA applies gradient masking by assigning a large negative value (e.g., -1e9) to the logits of sensitive tokens, effectively preventing their selection during optimization.
JPA can generate adversarial prompts that evade safety filters while maintaining semantic alignment with the original intent.