Skip to content

strip option preserves inner text of removed tags — request for content-removing option #259

@Taeknology

Description

@Taeknology

Environment

  • markdownify: 1.2.2
  • Python: 3.12.3

Current Behavior

strip=["script"] removes the tag but preserves its inner text as plain text:

```python
from markdownify import markdownify as md

html = '

Hello

<script>alert("js noise")</script>

World

'
print(md(html, strip=["script"]))

Hello

alert("js noise") ← script content remains as plain text

World

```

Root Cause

process_tag() collects and joins all child text before checking
should_convert_tag(). By the time get_conv_fn_cached() returns None
for a stripped tag, the child text has already been assembled:

```python
def process_tag(self, node, ...):
# Children processed unconditionally — strip has no effect here
child_strings = [self.process_element(el, ...) for el in children_to_convert]
text = ''.join(child_strings)

# should_convert_tag() checked only here — too late to suppress children
convert_fn = self.get_conv_fn_cached(node.name)  # returns None for stripped tags
if convert_fn is not None:
    text = convert_fn(node, text, ...)

return text  # child text always returned

```

Proposed Solution

Add a decompose parameter that suppresses child processing entirely:

```python
md(html, decompose=["script", "style", "noscript"])

→ Hello\n\nWorld

```

Implementation: early return '' in process_tag when the tag is in the
decompose list.

Workaround

```python
from bs4 import BeautifulSoup
from markdownify import markdownify as md

soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
md(str(soup), heading_style="ATX")
```

Use Case

This affects web scraping pipelines where <script> and <style> content
is meaningless as plain text — analytics snippets, minified JS bundles, and
inline CSS all leak into the markdown output when using strip.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions