createWorker
arguments changed
createWorker
createWorker("chi_sim", 1)
worker.initialize
and worker.loadLanguage
functions now do nothing and can be deleted from code
createWorker
worker.reinitialize
In other words, code should be modified from this:
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const ret = await worker.recognize(file);
To this:
const worker = await Tesseract.createWorker("eng");
const ret = await worker.recognize(file);
Breaking Changes Impacting Fewer Users
corePath
will need to update the contents of their corePath
directory
corePath
should point to a directory that contains all 4 of the files below from Tesseract.js-core v5:
tesseract-core.wasm.js
tesseract-core-simd.wasm.js
tesseract-core-lstm.wasm.js
tesseract-core-simd-lstm.wasm.js
worker.detect
function disabled by default
legacyCore: true
and legacyLang: true
in createWorker
options
Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
jsdelivr
by default (rather than GitHub pages)
tesseract.dev.js
and worker.dev.js
removedTesseract.recognize
and Tesseract.detect
worker.recognize
and worker.detect
insteadTesseract contains 2 recognition models—LSTM and Legacy. The vast majority of users only use the LSTM model (the default). However, the Legacy model takes up more space, and previous versions of Tesseract.js loaded all of the resources required for both models. This resulted in significant wasteful network activity. For example, for Chinese (simplified) 73% of the size of the code and data was attributable to the (usually) unused Legacy model.
What justifies the breaking changes tocreateWorker
/loadLanguage
/initialize
?
The primary reason is that these changes are necessary to facilitate the major improvement of v5—significantly reducing file sizes. How this reduction is achieved is described in the answer directly above. As Tesseract.js is a JavaScript library generally run in the browser, having reasonable file sizes is a high priority. This is especially true as use on mobile devices becomes more common. Making this improvement would have been impossible without combining createWorker
/loadLanguage
/initialize
.
Previously, the user specified which recognition model (OEM) to use during initialize
. As initialize
was run after createWorker
and loadLanguage
(which load the code and language required for each model), there was no way for these functions to only load the data required for the chosen model. By combining these functions, Tesseract.js knows what model is being used before it loads code or data, so can load only the required resources.
In addition to this primary reason, combining these functions should simplify the process of creating a worker. The large number of functions required to create a new worker (4 in v3
and 3 in v4
) was pushing some users towards using Tesseract.recognize
instead (as this handles everything in a single function). Simplifying the process of creating a new worker will hopefully result in more users using workers, which is more efficient than Tesseract.recognize
(which creates and destroys a worker every time it is used).
Within createWorker
, if you set oem
to 0
(Tesseract Legacy) or 2
(Tesseract Legacy + LSTM), code and language data for both the Legacy and LSTM models will be loaded automatically. You can force both models to be loaded regardless of oem
by setting legacyCore: true
and legacyLang: true
in the createWorker
options. For example:
const worker = await Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
If your application re-initializes existing workers with a different language or OEM, this is now achieved using worker.reinitialize
(rather than worker.loadLanguage
and worker.initialize
). For example, the following snippet recognizes file
using the LSTM model, and then switches to the Legacy model and re-runs recognition.
const worker = await Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
const retLSTM = await worker.recognize(file);
worker.reinitialize("eng", 0);
const retLegacy = await worker.recognize(file);
How does this release impact iOS compatibility?
iOS v17.0
and v17.1
include a bug that causes the Legacy + LSTM build of Tesseract.js to crash. Apple patched this issue in iOS v17.2
. This bug does not impact the LSTM-only build, which became the default in Tesseract.js v5. Therefore, developers who want their application to be compatible with iOS v17.0
and v17.1
are advised to upgrade to Tesseract.js v5. Discussion regarding this issue is documented in #804.
Start by reviewing the examples directory--most uses of Tesseract.js have a corresponding example. If you are struggling to upgrade your project after reviewing both this issue and the examples, feel free to open a new git issue.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4