Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
650369c
added toxicity detection validators
rkritika1508 Apr 1, 2026
949647d
fixed import error
rkritika1508 Apr 1, 2026
da50537
removed redundant validators
rkritika1508 Apr 2, 2026
9ab64c7
Added NSFW text validator
rkritika1508 Apr 2, 2026
b64d0e9
fixed test
rkritika1508 Apr 2, 2026
57d97b2
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 2, 2026
09b6a05
fix: profanity free validator description
dennyabrain Apr 6, 2026
f4a11fa
doc: updated details of sentence parameter
dennyabrain Apr 7, 2026
f330f1b
fix: remove vscode files
dennyabrain Apr 7, 2026
51c9266
Added integration tests
rkritika1508 Apr 7, 2026
141e5fc
Merge branch 'main' into feat/toxicity-hub-validators
rkritika1508 Apr 7, 2026
c76f829
added integration tests
rkritika1508 Apr 7, 2026
baac9e4
fix: profanity free validator description
dennyabrain Apr 6, 2026
627fb4f
Added integration tests
rkritika1508 Apr 7, 2026
8b3da89
validator config: add name to config (#79)
nishika26 Apr 7, 2026
cc0bb14
added integration tests
rkritika1508 Apr 7, 2026
3037eb8
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 7, 2026
b69883d
added integration tests
rkritika1508 Apr 7, 2026
8f67176
updated readme
rkritika1508 Apr 7, 2026
affe72d
Added installation of huggingface model in dockerfile
rkritika1508 Apr 7, 2026
8b0a183
resolved comment
rkritika1508 Apr 7, 2026
14f6dc1
removed blank line
rkritika1508 Apr 7, 2026
74f8a82
updated policies for llama guard
rkritika1508 Apr 7, 2026
6676414
fixed tests
rkritika1508 Apr 7, 2026
0d15d0c
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 7, 2026
6443c1b
updated readme and fixed llama guard inference
rkritika1508 Apr 8, 2026
af933ef
fixed test
rkritika1508 Apr 8, 2026
9b6616a
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 9, 2026
9aca5f2
Merge branch 'main' into feat/toxicity-hub-validators
rkritika1508 Apr 10, 2026
664ded8
resolved comments
rkritika1508 Apr 10, 2026
0ce6ebb
Added evaluation readme (#82)
rkritika1508 Apr 10, 2026
ba27b80
resolved comments
rkritika1508 Apr 10, 2026
d7c5eba
resolved comments
rkritika1508 Apr 10, 2026
02fd043
fixed llama guard
rkritika1508 Apr 10, 2026
d9569ba
Merge branch 'feat/toxicity-hub-validators' into feat/toxicity-huggin…
rkritika1508 Apr 10, 2026
31af2f6
Toxicity Detection validators (#80)
rkritika1508 Apr 10, 2026
a061af8
Merge branch 'main' into feat/toxicity-huggingface-model
rkritika1508 Apr 10, 2026
88c1b56
removed unnecessary changes
rkritika1508 Apr 10, 2026
5b2fe3b
fix: update default nsfw_text model to michellejieli/NSFW_text_classi…
rkritika1508 Apr 10, 2026
fd3cddc
fix: use textdetox/xlmr-large-toxicity-classifier as default nsfw_tex…
rkritika1508 Apr 10, 2026
7264771
updated readme
rkritika1508 Apr 10, 2026
217ba9b
Merge branch 'main' into feat/toxicity-huggingface-model
nishika26 Apr 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,14 @@ RUN --mount=type=cache,target=/root/.cache/uv \
# Install pinned spaCy model in the final environment used at runtime.
RUN python -m pip install --no-deps "${SPACY_MODEL_WHEEL_URL}"

# Set HuggingFace cache directory
ENV HF_HOME=/app/hf_cache

# Pre-download HuggingFace model
RUN /app/.venv/bin/python -c "from transformers import AutoTokenizer, AutoModelForSequenceClassification; \
AutoTokenizer.from_pretrained('textdetox/xlmr-large-toxicity-classifier', cache_dir='/app/hf_cache'); \
AutoModelForSequenceClassification.from_pretrained('textdetox/xlmr-large-toxicity-classifier', cache_dir='/app/hf_cache')"

# -------------------------------
# Entrypoint (runtime setup)
# -------------------------------
Expand Down
2 changes: 2 additions & 0 deletions backend/app/api/API_USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ Endpoint:
Optional filters:
- `ids=<uuid>&ids=<uuid>`
- `stage=input|output`
- `type=uli_slur_match|pii_remover|gender_assumption_bias|ban_list|llm_critic|topic_relevance|llamaguard_7b|profanity_free|nsfw_text`
- `type=uli_slur_match|pii_remover|gender_assumption_bias|ban_list|llm_critic|topic_relevance|llamaguard_7b|profanity_free`

Example:
Expand Down Expand Up @@ -461,6 +462,7 @@ From `validators.json`:
- `topic_relevance`
- `llamaguard_7b`
- `profanity_free`
- `nsfw_text`

Source of truth:
- `backend/app/core/validators/validators.json`
Expand Down
1 change: 1 addition & 0 deletions backend/app/core/enum.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,4 @@ class ValidatorType(Enum):
LLMCritic = "llm_critic"
LlamaGuard7B = "llamaguard_7b"
ProfanityFree = "profanity_free"
NSFWText = "nsfw_text"
44 changes: 43 additions & 1 deletion backend/app/core/validators/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Current validator manifest:
- `topic_relevance` (source: `local`)
- `llamaguard_7b` (source: `hub://guardrails/llamaguard_7b`)
- `profanity_free` (source: `hub://guardrails/profanity_free`)
- `nsfw_text` (source: `hub://guardrails/nsfw_text`)

## Configuration Model

Expand Down Expand Up @@ -409,7 +410,47 @@ Notes / limitations:
- `on_fail=fix` returns `""` on failure — LlamaGuard has no programmatic fix, so `safe_text` will be `""` and the response `metadata.reason` will identify this validator as the cause.
- LlamaGuard policy classification may produce false positives in news, clinical, or legal contexts.

### 8) Profanity Free Validator (`profanity_free`)
### 8) NSFW Text Validator (`nsfw_text`)

Code:

- Config: `backend/app/core/validators/config/nsfw_text_safety_validator_config.py`
- Source: Guardrails Hub (`hub://guardrails/nsfw_text`)

What it does:

- Classifies text as NSFW (not safe for work) using a [HuggingFace transformer model](https://huggingface.co/textdetox/xlmr-large-toxicity-classifier).
- Validates at the sentence level by default; fails if any sentence exceeds the configured threshold.

Why this is used:

- Catches sexually explicit or otherwise inappropriate content that may not be covered by profanity or slur lists.
- Model-based approach handles paraphrased or implicit NSFW content better than keyword matching.

Recommendation:

- `input` and `output`
- Why `input`: prevents explicit user messages from being processed or logged.
- Why `output`: prevents the model from returning NSFW content to end users.

Comment on lines +432 to +435
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

nsfw_text recommendation is inconsistent with the default stage strategy.

Line 432 recommends nsfw_text for both input and output, but the default stage strategy list (Lines 516-517) does not include it. This can mislead default rollout configs.

📌 Proposed doc fix
- - Input guardrails: `pii_remover`, `uli_slur_match`, `ban_list`, `topic_relevance` (when scope enforcement is needed), `profanity_free`, `llamaguard_7b`
- - Output guardrails: `pii_remover`, `uli_slur_match`, `gender_assumption_bias`, `ban_list`, `profanity_free`, `llamaguard_7b`
+ - Input guardrails: `pii_remover`, `uli_slur_match`, `ban_list`, `topic_relevance` (when scope enforcement is needed), `profanity_free`, `llamaguard_7b`, `nsfw_text`
+ - Output guardrails: `pii_remover`, `uli_slur_match`, `gender_assumption_bias`, `ban_list`, `profanity_free`, `llamaguard_7b`, `nsfw_text`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/validators/README.md` around lines 432 - 435, The README
shows `nsfw_text` recommended for both `input` and `output` but the "default
stage strategy" list omits it; update the documentation so they match by either
adding `nsfw_text` to the default stage strategy list or removing it from the
`input`/`output` recommendation. Locate the `nsfw_text` mention and the `default
stage strategy` section (the list that defines the default rollout strategy) and
make them consistent, ensuring the doc clearly states whether `nsfw_text` is
enabled by default for `input` and/or `output`.

Parameters / customization:

- `threshold: float` (default: `0.8`) — probability threshold above which text is classified as NSFW
- `validation_method: str` (default: `"sentence"`) — granularity of validation:
- `"sentence"`: each sentence is classified independently; validation fails if **any** sentence exceeds the threshold. Preferred when inputs are multi-sentence and you want to catch a single offensive sentence without failing the whole message.
- `"full"`: the entire text is passed as one unit for classification. Use when inputs are short (single-sentence messages or responses) or when you want to evaluate overall tone rather than per-sentence content.
- `device: str | None` (default: `"cpu"`) — inference device (`"cpu"` or `"cuda"`)
- `model_name: str | None` (default: `"textdetox/xlmr-large-toxicity-classifier"`) — HuggingFace model identifier used for classification. Other acceptable value: `"michellejieli/NSFW_text_classifier"`
- `on_fail`

Notes / limitations:

- Model runs locally; first use will download the model weights unless pre-cached.
- Default model is English-focused; multilingual NSFW detection may require a different `model_name`.
- No programmatic fix is applied — with `on_fail=fix`, `safe_text` will be `""` and the response `metadata.reason` will identify this validator as the cause.
- **Latency**: this validator runs a local transformer model on CPU. For short, single-turn WhatsApp-style messages, sentence-level inference typically adds ~200–500 ms per request on CPU. Use `validation_method="full"` for shorter inputs to avoid per-sentence overhead. For high-throughput deployments, consider using GPU (`device="cuda"`) or moving this validator to async post-processing rather than the synchronous request path.

### 9) Profanity Free Validator (`profanity_free`)

Code:

Expand Down Expand Up @@ -491,6 +532,7 @@ Tuning strategy:
- `backend/app/core/validators/config/gender_assumption_bias_safety_validator_config.py`
- `backend/app/core/validators/config/topic_relevance_safety_validator_config.py`
- `backend/app/core/validators/config/llamaguard_7b_safety_validator_config.py`
- `backend/app/core/validators/config/nsfw_text_safety_validator_config.py`
- `backend/app/core/validators/config/profanity_free_safety_validator_config.py`
- `backend/app/schemas/guardrail_config.py`
- `backend/app/schemas/validator_config.py`
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from typing import Literal, Optional

from guardrails.hub import NSFWText

from app.core.validators.config.base_validator_config import BaseValidatorConfig


class NSFWTextSafetyValidatorConfig(BaseValidatorConfig):
type: Literal["nsfw_text"]
threshold: float = 0.8
validation_method: str = "sentence"
device: Optional[str] = "cpu"
model_name: Optional[str] = "textdetox/xlmr-large-toxicity-classifier"

def build(self):
return NSFWText(
threshold=self.threshold,
validation_method=self.validation_method,
device=self.device,
model_name=self.model_name,
on_fail=self.resolve_on_fail(),
)
5 changes: 5 additions & 0 deletions backend/app/core/validators/validators.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,11 @@
"type": "profanity_free",
"version": "0.1.0",
"source": "hub://guardrails/profanity_free"
},
{
"type": "nsfw_text",
"version": "0.1.0",
"source": "hub://guardrails/nsfw_text"
}
]
}
4 changes: 4 additions & 0 deletions backend/app/schemas/guardrail_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@
from app.core.validators.config.llamaguard_7b_safety_validator_config import (
LlamaGuard7BSafetyValidatorConfig,
)
from app.core.validators.config.nsfw_text_safety_validator_config import (
NSFWTextSafetyValidatorConfig,
)
from app.core.validators.config.profanity_free_safety_validator_config import (
ProfanityFreeSafetyValidatorConfig,
)
Expand All @@ -39,6 +42,7 @@
LLMCriticSafetyValidatorConfig,
PIIRemoverSafetyValidatorConfig,
LlamaGuard7BSafetyValidatorConfig,
NSFWTextSafetyValidatorConfig,
ProfanityFreeSafetyValidatorConfig,
TopicRelevanceSafetyValidatorConfig,
],
Expand Down
156 changes: 156 additions & 0 deletions backend/app/tests/test_guardrails_api_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,65 @@ def test_input_guardrails_with_llamaguard_7b_on_unsafe_text(integration_client):
assert body["success"] is False


# ---------------------------------------------------------------------------
# NSFWText
# ---------------------------------------------------------------------------


def test_input_guardrails_with_nsfw_text_on_explicit_content(integration_client):
response = integration_client.post(
VALIDATE_API_PATH,
json={
"request_id": request_id,
"organization_id": organization_id,
"project_id": project_id,
"input": "She slowly undressed and they had explicit sex.",
"validators": [{"type": "nsfw_text", "on_fail": "exception"}],
},
)

assert response.status_code == 200
body = response.json()
assert body["success"] is False


def test_input_guardrails_with_nsfw_text_with_low_threshold(integration_client):
"""A low threshold makes the detector more sensitive — explicit content is caught."""
response = integration_client.post(
VALIDATE_API_PATH,
json={
"request_id": request_id,
"organization_id": organization_id,
"project_id": project_id,
"input": "She slowly undressed and they had explicit sex.",
"validators": [
{"type": "nsfw_text", "threshold": 0.1, "on_fail": "exception"}
],
},
)

assert response.status_code == 200
body = response.json()
assert body["success"] is False


def test_input_guardrails_with_nsfw_text_exception_action(integration_client):
response = integration_client.post(
VALIDATE_API_PATH,
json={
"request_id": request_id,
"organization_id": organization_id,
"project_id": project_id,
"input": "Explicit sexual content goes here.",
"validators": [{"type": "nsfw_text", "on_fail": "exception"}],
},
)

assert response.status_code == 200
body = response.json()
assert body["success"] is False


# ---------------------------------------------------------------------------
# Combinations of toxicity detectors
# ---------------------------------------------------------------------------
Expand Down Expand Up @@ -373,3 +432,100 @@ def test_input_guardrails_with_profanity_free_and_llamaguard_7b_unsafe_text(
assert response.status_code == 200
body = response.json()
assert body["success"] is False


def test_input_guardrails_with_nsfw_text_and_profanity_free_on_explicit_profane_text(
integration_client,
):
"""Profane+explicit content: profanity_free reliably catches the profanity regardless
of whether the NSFWText model is available, guaranteeing a failure response."""
response = integration_client.post(
VALIDATE_API_PATH,
json={
"request_id": request_id,
"organization_id": organization_id,
"project_id": project_id,
"input": "What the fuck, this is explicit adult sexual content.",
"validators": [
{"type": "profanity_free", "on_fail": "exception"},
{"type": "nsfw_text", "on_fail": "exception"},
],
},
)

assert response.status_code == 200
body = response.json()
assert body["success"] is False


def test_input_guardrails_with_nsfw_text_and_slur_match_on_explicit_slur_text(
integration_client,
):
"""Slur+explicit content: uli_slur_match reliably catches the slur regardless
of whether the NSFWText model is available, guaranteeing a failure response."""
response = integration_client.post(
VALIDATE_API_PATH,
json={
"request_id": request_id,
"organization_id": organization_id,
"project_id": project_id,
"input": "This chakki has explicit sexual content.",
"validators": [
{"type": "uli_slur_match", "severity": "all", "on_fail": "exception"},
{"type": "nsfw_text", "on_fail": "exception"},
],
},
)

assert response.status_code == 200
body = response.json()
assert body["success"] is False


def test_input_guardrails_with_profanity_free_and_ban_list_clean_text(
integration_client,
):
"""Clean text passes both profanity_free and ban_list checks unchanged."""
response = integration_client.post(
VALIDATE_API_PATH,
json={
"request_id": request_id,
"organization_id": organization_id,
"project_id": project_id,
"input": "Tell me about renewable energy sources.",
"validators": [
{"type": "profanity_free"},
{"type": "ban_list", "banned_words": ["fossil"]},
],
},
)

assert response.status_code == 200
body = response.json()
assert body["success"] is True
assert body["data"][SAFE_TEXT_FIELD] == "Tell me about renewable energy sources."


def test_input_guardrails_with_lexical_toxicity_detectors_on_clean_text(
integration_client,
):
"""Clean text passes uli_slur_match, profanity_free, and ban_list unchanged."""
response = integration_client.post(
VALIDATE_API_PATH,
json={
"request_id": request_id,
"organization_id": organization_id,
"project_id": project_id,
"input": "What are some healthy breakfast options?",
"validators": [
{"type": "uli_slur_match", "severity": "all"},
{"type": "profanity_free"},
{"type": "ban_list", "banned_words": ["junk"]},
],
},
)

assert response.status_code == 200
body = response.json()
assert body["success"] is True
assert body["data"][SAFE_TEXT_FIELD] == "What are some healthy breakfast options?"
Loading