Skip to content

Draft: OCR Caching Files & Command Batch Generations

Kevin Wenger requested to merge 10x/ocr-caching-files into 1.0.x

💬 Description

add a new layer of performance by allowing developers to cache OCR'ed files

🔢 To Review Caching

  1. Use the new way to cache OCR files
// Anywhere at least once in the code (Eg. module.install) in order to prepare the storage.
\Drupal::service('entity_to_text_tika.storage.local_file')->prepareStorage();

// Load the already OCR'ed file if possible to avoid unecessary calls to Tika.
$body = \Drupal::service('entity_to_text_tika.storage.local_file')->load($file, 'eng+fra');

if (!$body) {
  // When the OCR'ed file is not available, then run Tika over it and store it for the next run.
  $body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');
  // Save the OCR'ed file for the next run.
  \Drupal::service('entity_to_text_tika.storage.local_file')->save($file, $body, 'eng+fra');
}

🔢 To Review Command

The module expose a Drush command to generate OCR for all Drupal files.

This command is intended to be used sporadically, as it can be resource intensive. The purpose is to generate OCR for all files that have not been OCR'ed yet. This may be usefully after an initial install, a new OCR language has been added or right after files migration.

# Warmup all files that does not already have an associated .ocr file.
drush e2t:t:w
# Warmup all files even if the files has already been processed before.
drush e2t:t:w --force
# Warmup the file with FID 2.
drush e2t:t:w --fid=2
  • Update the "Unreleased" section of the CHANGELOG.md with chan

Github PR: https://github.com/antistatique/drupal-entity-to-text/pull/12

Edited by Kevin Wenger

Merge request reports