Replies: 2 comments 3 replies
-
Hi @glypt! I'm Dosu and I’m helping the docling team. Neither DocumentConverter nor HybridChunker are fully thread-safe for concurrent document processing. DocumentConverter uses a lock to protect pipeline initialization, but once a pipeline/model/tokenizer is created, it may be shared across threads, which can cause race conditions—especially since components like HybridChunker and HuggingFaceTokenizer are not inherently thread-safe and use cached properties or internal state that can be mutated during processing. Protecting just the tokenizer with a lock helps, but it's not enough: you should avoid sharing any pipeline, model, or chunker instance (including HybridChunker) across threads. Instead, instantiate separate objects per thread, or use thread-local storage to ensure each thread has its own instances. This approach also applies to other ML components in the pipeline, such as OCR and layout models, which may not be thread-safe if shared across threads. For large-scale or distributed workloads, consider using docling-serve, which is designed for parallel document conversion at scale (source). To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
Is there a plan to work on the thread safe version of those functionalities (document converter and hybridchunker)? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I would like to use docling with accurate OCR (I already browsed all the options I need to use) and I would like to do that asynchronously. I would like to use the numbers of cores available on the machine probably using a ThreadPoolExecutor. And then launch the pipeline processing of each document on each thread. I already tried quickly but I seem to have a race, I only protected the tokenizer with a lock which is shared among all the document converter instances. What else should I pay attention to? Do you have an idea from where the race condition should come, are the DocumentConverter and HybridChunker thread-safe?
Beta Was this translation helpful? Give feedback.
All reactions