Draft: OCR Caching Files & Command Batch Generations (!2) · Merge requests · project / entity_to_text

💬 Description

add a new layer of performance by allowing developers to cache OCR'ed files

🔢 To Review Caching

Use the new way to cache OCR files

// Anywhere at least once in the code (Eg. module.install) in order to prepare the storage.
\Drupal::service('entity_to_text_tika.storage.local_file')->prepareStorage();

// Load the already OCR'ed file if possible to avoid unecessary calls to Tika.
$body = \Drupal::service('entity_to_text_tika.storage.local_file')->load($file, 'eng+fra');

if (!$body) {
  // When the OCR'ed file is not available, then run Tika over it and store it for the next run.
  $body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');
  // Save the OCR'ed file for the next run.
  \Drupal::service('entity_to_text_tika.storage.local_file')->save($file, $body, 'eng+fra');
}

🔢 To Review Command

The module expose a Drush command to generate OCR for all Drupal files.

This command is intended to be used sporadically, as it can be resource intensive. The purpose is to generate OCR for all files that have not been OCR'ed yet. This may be usefully after an initial install, a new OCR language has been added or right after files migration.

# Warmup all files that does not already have an associated .ocr file.
drush e2t:t:w
# Warmup all files even if the files has already been processed before.
drush e2t:t:w --force
# Warmup the file with FID 2.
drush e2t:t:w --fid=2

Update the "Unreleased" section of the CHANGELOG.md with chan

Github PR: https://github.com/antistatique/drupal-entity-to-text/pull/12

Edited Apr 25, 2024 by Kevin Wenger

Draft: OCR Caching Files & Command Batch Generations

💬 Description

🔢 To Review Caching

🔢 To Review Command

Merge request reports