-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit ab55d86
⚡️ Speed up method
### 📄 234% (2.34x) speedup for ***`ElementHtml._get_children_html` in
`unstructured/partition/html/convert.py`***
⏱️ Runtime : **`12.3 milliseconds`** **→** **`3.69 milliseconds`** (best
of `101` runs)
### 📝 Explanation and details
Here is a **faster rewrite** of your program, based on your line
profiling results, the imported code constraints, and the code logic.
### Key optimizations.
- **Avoid repeated parsing:** The hotspot is in recursive calls to
`child.get_html_element(**kwargs)`, each of which is re-creating a new
`BeautifulSoup` object in every call.
Solution: **Pass down and reuse a single `BeautifulSoup` instance** when
building child HTML elements.
- **Minimize object creation:** Create `soup` once at the *topmost* call
and reuse for all children and subchildren.
- **Reduce .get_text_as_html use:** Optimize to only use the soup
instance when really necessary and avoid repeated blank parses.
- **Avoid double wrapping:** Only allocate wrappers and new tags if
absolutely required.
- **General micro-optimizations:** Use `None` instead of `or []`,
fast-path checks on empty children, etc.
- **Preserve all comments and signatures as specified.**
Below is the optimized version.
### Explanation of improvements
- **Soup passing**: The `get_html_element` method now optionally
receives a `_soup` kwarg. At the top of the tree, it is `None`, so a new
one is created. Then, for all descendants, the same `soup` instance is
passed via `_soup`, avoiding repeated parsing and allocation.
- **Children check**: `self.children` is checked once, and the attribute
itself is kept as a list (not or-ed with empty list at every call).
- **No unnecessary soup parsing**: `get_text_as_html()` doesn't need a
soup argument, since it only returns a Tag (from the parent module).
- **No changes to existing comments, new comments added only where logic
was changed.**
- **Behavior (output and signature) preserved.**
This **avoids creating thousands of BeautifulSoup objects recursively**,
which was the primary bottleneck found in the profiler. The result is
vastly improved performance, especially for large/complex trees.
✅ **Correctness verification report:**
| Test | Status |
| --------------------------- | ----------------- |
| ⚙️ Existing Unit Tests | 🔘 **None Found** |
| 🌀 Generated Regression Tests | ✅ **768 Passed** |
| ⏪ Replay Tests | ✅ **1 Passed** |
| 🔎 Concolic Coverage Tests | 🔘 **None Found** |
|📊 Tests Coverage | 100.0% |
<details>
<summary>🌀 Generated Regression Tests and Runtime</summary>
```python
from abc import ABC
from typing import Any, List, Optional, Union
# imports
import pytest # used for our unit tests
from bs4 import BeautifulSoup, Tag
from unstructured.partition.html.convert import ElementHtml
# --- Minimal stubs for dependencies ---
class Metadata:
def __init__(self, text_as_html: Optional[str] = None):
self.text_as_html = text_as_html
class Element:
def __init__(self, text="", category="default", id="0", metadata=None):
self.text = text
self.category = category
self.id = id
self.metadata = metadata or Metadata()
# --- The function and class under test ---
HTML_PARSER = "html.parser"
# --- Test helpers ---
class DummyElementHtml(ElementHtml):
"""A concrete subclass for testing, with optional custom tag."""
def __init__(self, element, children=None, html_tag="div"):
super().__init__(element, children)
self._html_tag = html_tag
# --- Unit tests for _get_children_html ---
@pytest.fixture
def soup():
# Fixture for a BeautifulSoup object
return BeautifulSoup("", HTML_PARSER)
def make_tag(soup, name, text=None, **attrs):
tag = soup.new_tag(name)
if text:
tag.string = text
for k, v in attrs.items():
tag[k] = v
return tag
# 1. BASIC TEST CASES
def test_single_child_basic(soup):
"""Single child: Should wrap parent and child in a div, in order."""
parent_el = Element("Parent", category="parent", id="p1")
child_el = Element("Child", category="child", id="c1")
child = DummyElementHtml(child_el)
parent = DummyElementHtml(parent_el, children=[child])
# Prepare the parent tag
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
# Call _get_children_html
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
divs = result.find_all("div", recursive=False)
def test_multiple_children_basic(soup):
"""Multiple children: All children should be appended in order."""
parent_el = Element("Parent", category="parent", id="p1")
child1_el = Element("Child1", category="child", id="c1")
child2_el = Element("Child2", category="child", id="c2")
child1 = DummyElementHtml(child1_el)
child2 = DummyElementHtml(child2_el)
parent = DummyElementHtml(parent_el, children=[child1, child2])
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
divs = result.find_all("div", recursive=False)
def test_no_children_returns_parent_wrapped(soup):
"""No children: Should still wrap parent in a div."""
parent_el = Element("Parent", category="parent", id="p1")
parent = DummyElementHtml(parent_el, children=[])
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
inner_divs = result.find_all("div", recursive=False)
def test_children_with_different_tags(soup):
"""Children with different HTML tags should be preserved."""
parent_el = Element("Parent", category="parent", id="p1")
child_el = Element("Child", category="child", id="c1")
child = DummyElementHtml(child_el, html_tag="span")
parent = DummyElementHtml(parent_el, children=[child])
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
# 2. EDGE TEST CASES
def test_empty_element_text_and_children(soup):
"""Parent and children have empty text."""
parent_el = Element("", category="parent", id="p1")
child_el = Element("", category="child", id="c1")
child = DummyElementHtml(child_el)
parent = DummyElementHtml(parent_el, children=[child])
parent_tag = make_tag(soup, "div", "", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
divs = result.find_all("div", recursive=False)
def test_deeply_nested_children(soup):
"""Test with deep nesting (e.g., 5 levels)."""
# Build a chain: root -> c1 -> c2 -> c3 -> c4 -> c5
el = Element("root", category="cat0", id="id0")
node = DummyElementHtml(el)
for i in range(1, 6):
el = Element(f"c{i}", category=f"cat{i}", id=f"id{i}")
node = DummyElementHtml(el, children=[node])
# At the top, node is the outermost parent
parent_tag = make_tag(soup, "div", "c5", **{"class": "cat5", "id": "id5"})
codeflash_output = node._get_children_html(soup, parent_tag); result = codeflash_output
# Should have one child at each level
current = result
for i in range(6):
divs = [c for c in current.contents if isinstance(c, Tag)]
current = divs[0]
def test_html_injection_in_text(soup):
"""Child text that looks like HTML should be escaped, not parsed as HTML."""
parent_el = Element("Parent", category="parent", id="p1")
child_el = Element("<b>bold</b>", category="child", id="c1")
child = DummyElementHtml(child_el)
parent = DummyElementHtml(parent_el, children=[child])
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
# The child div should have literal text, not a <b> tag inside
child_div = result.find_all("div", recursive=False)[1]
def test_children_with_duplicate_ids(soup):
"""Multiple children with the same id."""
parent_el = Element("Parent", category="parent", id="p1")
child1_el = Element("Child1", category="child", id="dup")
child2_el = Element("Child2", category="child", id="dup")
child1 = DummyElementHtml(child1_el)
child2 = DummyElementHtml(child2_el)
parent = DummyElementHtml(parent_el, children=[child1, child2])
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
# Both children should be present, even with duplicate ids
divs = result.find_all("div", recursive=False)
def test_children_with_none(soup):
"""Children list contains None (should ignore or raise)."""
parent_el = Element("Parent", category="parent", id="p1")
child_el = Element("Child", category="child", id="c1")
child = DummyElementHtml(child_el)
parent = DummyElementHtml(parent_el, children=[child, None])
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
# Should raise AttributeError when trying to call get_html_element on None
with pytest.raises(AttributeError):
parent._get_children_html(soup, parent_tag)
# 3. LARGE SCALE TEST CASES
def test_many_children_performance(soup):
"""Test with 500 children: structure and order."""
parent_el = Element("Parent", category="parent", id="p1")
children = [DummyElementHtml(Element(f"Child{i}", category="child", id=f"c{i}")) for i in range(500)]
parent = DummyElementHtml(parent_el, children=children)
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
divs = result.find_all("div", recursive=False)
def test_large_tree_width_and_depth(soup):
"""Test with a tree of width 10 and depth 3 (total 1 + 10 + 100 = 111 nodes)."""
def make_tree(depth, width):
if depth == 0:
return []
return [
DummyElementHtml(
Element(f"Child{depth}_{i}", category="cat", id=f"id{depth}_{i}"),
children=make_tree(depth-1, width)
)
for i in range(width)
]
parent_el = Element("Root", category="root", id="root")
children = make_tree(2, 10) # depth=2, width=10 at each node
parent = DummyElementHtml(parent_el, children=children)
parent_tag = make_tag(soup, "div", "Root", **{"class": "root", "id": "root"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
# The first level should have 1 parent + 10 children
divs = result.find_all("div", recursive=False)
# Each child should have its own children (10 each)
for child_div in divs[1:]:
sub_divs = child_div.find_all("div", recursive=False)
def test_large_text_content(soup):
"""Test with a single child with a very large text string."""
large_text = "A" * 10000
parent_el = Element("Parent", category="parent", id="p1")
child_el = Element(large_text, category="child", id="c1")
child = DummyElementHtml(child_el)
parent = DummyElementHtml(parent_el, children=[child])
parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
# The child div should contain the large text exactly
child_div = result.find_all("div", recursive=False)[1]
def test_children_with_varied_tags_and_attributes(soup):
"""Test children with different tags and extra attributes."""
parent_el = Element("P", category="parent", id="p")
child1_el = Element("C1", category="c1", id="c1")
child2_el = Element("C2", category="c2", id="c2")
child1 = DummyElementHtml(child1_el, html_tag="section")
child2 = DummyElementHtml(child2_el, html_tag="article")
parent = DummyElementHtml(parent_el, children=[child1, child2])
parent_tag = make_tag(soup, "header", "P", **{"class": "parent", "id": "p"})
codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from abc import ABC
from typing import Any, Optional, Union
# imports
import pytest # used for our unit tests
from bs4 import BeautifulSoup, Tag
from unstructured.partition.html.convert import ElementHtml
# Minimal stub for Element and its metadata
class Metadata:
def __init__(self, text_as_html=None):
self.text_as_html = text_as_html
class Element:
def __init__(self, text="", category=None, id=None, metadata=None):
self.text = text
self.category = category or "default-category"
self.id = id or "default-id"
self.metadata = metadata or Metadata()
HTML_PARSER = "html.parser"
# ---------------------------
# Unit tests for _get_children_html
# ---------------------------
# Helper subclass to expose _get_children_html for testing
class TestElementHtml(ElementHtml):
def public_get_children_html(self, soup, element_html, **kwargs):
return self._get_children_html(soup, element_html, **kwargs)
# Override get_html_element to avoid recursion issues in tests
def get_html_element(self, **kwargs: Any) -> Tag:
soup = BeautifulSoup("", HTML_PARSER)
element_html = self.get_text_as_html()
if element_html is None:
element_html = soup.new_tag(name=self.html_tag)
self._inject_html_element_content(element_html, **kwargs)
element_html["class"] = self.element.category
element_html["id"] = self.element.id
self._inject_html_element_attrs(element_html)
if self.children:
return self._get_children_html(soup, element_html, **kwargs)
return element_html
# ---- BASIC TEST CASES ----
def test_single_child_basic():
# Test with one parent and one child
parent_elem = Element(text="Parent", category="parent-cat", id="parent-id")
child_elem = Element(text="Child", category="child-cat", id="child-id")
child = TestElementHtml(child_elem)
parent = TestElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
# Call the function
result = parent.public_get_children_html(soup, parent_html)
def test_multiple_children_basic():
# Parent with two children
parent_elem = Element(text="P", category="p-cat", id="p-id")
child1_elem = Element(text="C1", category="c1-cat", id="c1-id")
child2_elem = Element(text="C2", category="c2-cat", id="c2-id")
child1 = TestElementHtml(child1_elem)
child2 = TestElementHtml(child2_elem)
parent = TestElementHtml(parent_elem, children=[child1, child2])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_no_children_returns_wrapper_with_only_parent():
# Parent with no children, should still wrap parent_html in a div
parent_elem = Element(text="Solo", category="solo-cat", id="solo-id")
parent = TestElementHtml(parent_elem, children=[])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_children_are_nested():
# Test with a deeper hierarchy: parent -> child -> grandchild
grandchild_elem = Element(text="GC", category="gc-cat", id="gc-id")
grandchild = TestElementHtml(grandchild_elem)
child_elem = Element(text="C", category="c-cat", id="c-id")
child = TestElementHtml(child_elem, children=[grandchild])
parent_elem = Element(text="P", category="p-cat", id="p-id")
parent = TestElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
child_div = result.contents[1]
grandchild_div = child_div.contents[1]
# ---- EDGE TEST CASES ----
def test_empty_text_and_attributes():
# Parent and child with empty text and missing attributes
parent_elem = Element(text="", category="", id="")
child_elem = Element(text="", category="", id="")
child = TestElementHtml(child_elem)
parent = TestElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_child_with_html_content():
# Child with HTML in text_as_html, should parse as HTML element
child_elem = Element(text="ignored", category="cat", id="cid",
metadata=Metadata(text_as_html="<span>HTMLChild</span>"))
child = TestElementHtml(child_elem)
parent_elem = Element(text="Parent", category="pcat", id="pid")
parent = TestElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
child_html = result.contents[1]
def test_parent_with_html_content_and_children():
# Parent with HTML in text_as_html, children as normal
parent_elem = Element(text="ignored", category="pcat", id="pid",
metadata=Metadata(text_as_html="<h1>Header</h1>"))
child_elem = Element(text="Child", category="ccat", id="cid")
child = TestElementHtml(child_elem)
parent = TestElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = parent.get_text_as_html()
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_children_with_duplicate_ids():
# Children with the same id, should not raise errors, but both ids should be present
child_elem1 = Element(text="A", category="cat", id="dup")
child_elem2 = Element(text="B", category="cat", id="dup")
child1 = TestElementHtml(child_elem1)
child2 = TestElementHtml(child_elem2)
parent_elem = Element(text="P", category="pcat", id="pid")
parent = TestElementHtml(parent_elem, children=[child1, child2])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_children_with_various_html_tags():
# Children with different html_tag settings
class CustomElementHtml(TestElementHtml):
_html_tag = "section"
child_elem = Element(text="Sec", category="cat", id="cid")
child = CustomElementHtml(child_elem)
parent_elem = Element(text="P", category="pcat", id="pid")
parent = TestElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_html_tag_property_override():
# Test that html_tag property is respected
class CustomElementHtml(TestElementHtml):
@Property
def html_tag(self):
return "article"
child_elem = Element(text="Art", category="cat", id="cid")
child = CustomElementHtml(child_elem)
parent_elem = Element(text="P", category="pcat", id="pid")
parent = TestElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_inject_html_element_attrs_is_called():
# Test that _inject_html_element_attrs is called (by side effect)
class AttrElementHtml(TestElementHtml):
def _inject_html_element_attrs(self, element_html: Tag) -> None:
element_html["data-test"] = "called"
child_elem = Element(text="Child", category="cat", id="cid")
child = AttrElementHtml(child_elem)
parent_elem = Element(text="P", category="pcat", id="pid")
parent = AttrElementHtml(parent_elem, children=[child])
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
# ---- LARGE SCALE TEST CASES ----
def test_large_number_of_children():
# Test with 500 children
num_children = 500
children = [TestElementHtml(Element(text=f"Child{i}", category="cat", id=f"id{i}")) for i in range(num_children)]
parent_elem = Element(text="Parent", category="pcat", id="pid")
parent = TestElementHtml(parent_elem, children=children)
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
def test_large_depth_of_nesting():
# Test with 100 nested single-child levels
depth = 100
current = TestElementHtml(Element(text=f"Level{depth}", category="cat", id=f"id{depth}"))
for i in range(depth-1, 0, -1):
current = TestElementHtml(Element(text=f"Level{i}", category="cat", id=f"id{i}"), children=[current])
parent = current
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent.element.text
parent_html["class"] = parent.element.category
parent_html["id"] = parent.element.id
result = parent.public_get_children_html(soup, parent_html)
# Traverse down the nesting, checking text at each level
node = result
for i in range(1, depth+1):
if len(node.contents) > 1:
node = node.contents[1]
else:
break
def test_large_tree_with_breadth_and_depth():
# 10 children, each with 10 children (total 1 + 10 + 100 = 111 nodes)
children = []
for i in range(10):
grandchildren = [TestElementHtml(Element(text=f"GC{i}-{j}", category="gcat", id=f"gid{i}-{j}")) for j in range(10)]
child = TestElementHtml(Element(text=f"C{i}", category="ccat", id=f"cid{i}"), children=grandchildren)
children.append(child)
parent_elem = Element(text="P", category="pcat", id="pid")
parent = TestElementHtml(parent_elem, children=children)
soup = BeautifulSoup("", HTML_PARSER)
parent_html = soup.new_tag("div")
parent_html.string = parent_elem.text
parent_html["class"] = parent_elem.category
parent_html["id"] = parent_elem.id
result = parent.public_get_children_html(soup, parent_html)
for i, child_div in enumerate(result.contents[1:]):
for j, gc_div in enumerate(child_div.contents[1:]):
pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
```
</details>
To edit these changes `git checkout
codeflash/optimize-ElementHtml._get_children_html-mcsd67co` and push.
[](https://codeflash.ai)
---------
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Saurabh Misra <[email protected]>ElementHtml._get_children_html
by 234% (#4087)1 parent 6aee131 commit ab55d86Copy full SHA for ab55d86
File tree
Expand file treeCollapse file tree
2 files changed
+6
-3
lines changedFilter options
- unstructured/partition/html
Expand file treeCollapse file tree
2 files changed
+6
-3
lines changedCollapse file: CHANGELOG.md
+1-1Lines changed: 1 addition & 1 deletion
- Display the source diff
- Display the rich diff
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
1 | 1 |
| |
2 | 2 |
| |
3 | 3 |
| |
4 |
| - | |
| 4 | + | |
5 | 5 |
| |
6 | 6 |
| |
7 | 7 |
| |
|
Collapse file: unstructured/partition/html/convert.py
unstructured/partition/html/convert.py
Copy file name to clipboardExpand all lines: unstructured/partition/html/convert.py+5-2Lines changed: 5 additions & 2 deletions
Original file line number | Diff line number | Diff line change | |
---|---|---|---|
| |||
56 | 56 |
| |
57 | 57 |
| |
58 | 58 |
| |
59 |
| - | |
| 59 | + | |
60 | 60 |
| |
61 | 61 |
| |
62 | 62 |
| |
63 | 63 |
| |
64 |
| - | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
65 | 68 |
| |
66 | 69 |
| |
67 | 70 |
| |
|
0 commit comments