Environment
- markdownify: 1.2.2
- Python: 3.12.3
Current Behavior
strip=["script"] removes the tag but preserves its inner text as plain text:
```python
from markdownify import markdownify as md
html = '
Hello
<script>alert("js noise")</script>
World
'
print(md(html, strip=["script"]))
Hello
alert("js noise") ← script content remains as plain text
World
```
Root Cause
process_tag() collects and joins all child text before checking
should_convert_tag(). By the time get_conv_fn_cached() returns None
for a stripped tag, the child text has already been assembled:
```python
def process_tag(self, node, ...):
# Children processed unconditionally — strip has no effect here
child_strings = [self.process_element(el, ...) for el in children_to_convert]
text = ''.join(child_strings)
# should_convert_tag() checked only here — too late to suppress children
convert_fn = self.get_conv_fn_cached(node.name) # returns None for stripped tags
if convert_fn is not None:
text = convert_fn(node, text, ...)
return text # child text always returned
```
Proposed Solution
Add a decompose parameter that suppresses child processing entirely:
```python
md(html, decompose=["script", "style", "noscript"])
→ Hello\n\nWorld
```
Implementation: early return '' in process_tag when the tag is in the
decompose list.
Workaround
```python
from bs4 import BeautifulSoup
from markdownify import markdownify as md
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
md(str(soup), heading_style="ATX")
```
Use Case
This affects web scraping pipelines where <script> and <style> content
is meaningless as plain text — analytics snippets, minified JS bundles, and
inline CSS all leak into the markdown output when using strip.
Environment
Current Behavior
strip=["script"]removes the tag but preserves its inner text as plain text:```python
from markdownify import markdownify as md
html = '
Hello
<script>alert("js noise")</script>World
'print(md(html, strip=["script"]))
Hello
alert("js noise") ← script content remains as plain text
World
```
Root Cause
process_tag()collects and joins all child text before checkingshould_convert_tag(). By the timeget_conv_fn_cached()returnsNonefor a stripped tag, the child text has already been assembled:
```python
def process_tag(self, node, ...):
# Children processed unconditionally — strip has no effect here
child_strings = [self.process_element(el, ...) for el in children_to_convert]
text = ''.join(child_strings)
```
Proposed Solution
Add a
decomposeparameter that suppresses child processing entirely:```python
md(html, decompose=["script", "style", "noscript"])
→ Hello\n\nWorld
```
Implementation: early
return ''inprocess_tagwhen the tag is in thedecomposelist.Workaround
```python
from bs4 import BeautifulSoup
from markdownify import markdownify as md
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
md(str(soup), heading_style="ATX")
```
Use Case
This affects web scraping pipelines where
<script>and<style>contentis meaningless as plain text — analytics snippets, minified JS bundles, and
inline CSS all leak into the markdown output when using
strip.