feat: Added a configurable connection timeout to TikaDocumentConverter#10294
feat: Added a configurable connection timeout to TikaDocumentConverter#10294asisdrico wants to merge 2 commits intodeepset-ai:mainfrom
Conversation
|
@asisdrico is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
| --- | ||
| enhancements: > | ||
| the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail | ||
| with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the | ||
| TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds. | ||
|
|
||
| ```python | ||
| from haystack.components.converters.tika import TikaDocumentConverter | ||
|
|
||
| converter = TikaDocumentConverter(tika_url=tika_url, timeout=300) | ||
| ``` |
There was a problem hiding this comment.
Let's rephrase the text. And the library we use for our release notes relies on ReStructuredText markdown formatting which I've updated
| --- | |
| enhancements: > | |
| the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail | |
| with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the | |
| TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds. | |
| ```python | |
| from haystack.components.converters.tika import TikaDocumentConverter | |
| converter = TikaDocumentConverter(tika_url=tika_url, timeout=300) | |
| ``` | |
| --- | |
| enhancements: | |
| - | | |
| The ``TikaDocumentConverter`` now supports a configurable connection timeout. This helps prevent conversion failures for long-running documents caused by Tika's default 60 second timeout. The default remains unchanged. | |
| .. code-block:: python | |
| from haystack.components.converters.tika import TikaDocumentConverter | |
| converter = TikaDocumentConverter(tika_url=tika_url, timeout=300) |
| # we extract the content as XHTML to preserve the structure of the document as much as possible | ||
| # this works for PDFs, but does not work for other file types (DOCX) | ||
|
|
||
| requestOptions = {"headers": {}, "timeout": self.timeout, "verify": False} |
There was a problem hiding this comment.
Are the headers and verify keys needed? They seem unrelated to the timeout feature.
|
@asisdrico thanks for the changes! For testing one of our integration tests in |
|
Hey @asisdrico are you able to continue working on this? |
Sorry, I'm a bit pressed for time at the moment. I expect to continue working on it over the weekend. I hope that is okay. |
|
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
Proposed Changes:
the conversion of longer documents or documents that make heavy use of tesseract when using the TikaDocumentConverter may fail with a connection timeout error, because the tika library has a default connection timeout of 60 seconds. This enhances the TikaDocumentConverter with a configurable timeout. The default timeout stays at 60 seconds.
How did you test it?
The error first came up when I tried to convert the following document with the TikaDocumentConverter:
https://www.koalitionsvertrag2025.de/sites/www.koalitionsvertrag2025.de/files/koav_2025.pdf
Running Tika in docker with tesseract activated, tika first runs OCR on the images on the first page. After that it processes the text. I always ran into the 60 second timeout limit of the http connection configured per default in the tika library (https://github.com/chrismattmann/tika-python) in the callServer method:
requestOptionsDefault = {
'timeout': 60,
'headers': headers,
'verify': False
}
Depending on the performance of the machine used setting the timeout to 300 seconds, the conversion of above document worked flawlessly.
Checklist
Yes
Yes
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.Yes
Yes
Yes
Yes