Here we can plan the next releases of Tesseract.
Future releasesHere are some ideas for future Tesseract releases.
Modernize the code using C++11 (see discussions here and here).
Use llvm’s tools: clang-format, clang-tidy, scan-build, sanitizers.
Replace more Tesseract data types by C++ standard types (GenericVector
, …), especially for the API.
Add json (or xml) output format. It will be used for full ocr and for psm 2 - layout info only.
Add option to use alternative binarization methods from leptonica.
Add an option to output separate files for multipage input (out1.hocr, out2.hocr …).
Add multi-threading option to the command line (openmp will be disabled at runtime in this mode).
Explore the option to use Protocol Buffers or FlatBuffers for the traineddata.
Improve error handling and don’t ignore return values from functions (see discussion).
Replace tprintf etc. by advanced logging API with log levels.
Requirements (see also discussion):
Log levels:
Related issues:
Useful links:
4.0.0See the release notes.
See also the discussion for issue #1423.
Open issues which should be fixedhttps://github.com/zdenop/tessdata_downloader
Depending on available resources and opinions, these suggestions will either be added to the planning for the next or a future release or abandoned.
This will make the command slower, because each file must be opened and parsed. Add this as –list-langs-details or as –list-lang-details for one language file based on lang-code?
tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
In addition to the current proprietary format Tesseract could also support ZIP archives (see discussion).
A possible implementation using libarchive is available, but needs more testing.
Tesseract 4.0 should be a full replacement for Tesseract 3.05 and have the same features when used with the old OCR engine (--oem 0
). The following regressions still need verification (are they really regressions, or are they just missing features for LSTM):
These features still work with the old OCR engine (--oem 0
), but are missing and desired for LSTM.
#### Black list / White list (See issue). Here is a workaround.Fixed in 4.1.0.
Here we collect important issues and features for the release(s) following 4.0.0.
This does not include OpenCL or the old Tesseract engine.
Mostly solved, but could be improved.
It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3