Skip to content
Snippets Groups Projects
Select Git revision
  • 1.1.x
  • 1.2.x default
  • 1.0.x
  • 10x/ocr-caching-files
  • 1.2.0
  • 1.1.0
  • 1.0.0
7 results

entity_to_text

  • Clone with SSH
  • Clone with HTTPS
  • Kevin Wenger's avatar
    Merge pull request #17 from antistatique/feature/update-drupal-support-july-2025
    Kevin Wenger authored and GitHub committed
    update drupal version support (drop 9.x and < 10.3)
    3ba726f5
    History

    Entity to Text

    This suite is primarily a set of APIs and tools to improve the developer experience.

    This module provides a number of utility and helper APIs for developers to transform content into plain text.

    Use Entity to Text if

    • You need to get plain-text content of Nodes for Indexing content into a Search Engine (Solr, Elasticsearch, ...).
    • You want to get plain-text of Nodes Paragraphs for SEO or JSON-LD.
    • You need to transform "Node entity" field(s) into plain-text content.
    • You need to transform "Paragraphs entity" field(s) into plain-text content.
    • You need to transform "File entity" into plain-text through Tika.

    Dependencies

    The main module requires ezyang/htmlpurifier

    The submodule entity_to_text_tika requires the library vaites/php-apache-tika. The submodule entity_to_text_paragraphs requires the library drupal/paragraphs.

    Which version should I use?

    Drupal Core Entity to Text
    8.x -
    9.x 1.0.x
    10.x 1.1.x
    11.x 1.1.x

    Getting Started

    We highly recommend you to install the module using composer.

    $ composer require drupal/entity_to_text

    Examples

    Node fields to text

    Usage

    /** @var string $field_body_content */
    $field_body_content = \Drupal::service('entity_to_text.extractor.node_to_text')->fromFieldtoText('body', $node);
    /** @var string $field_foo_content */
    $field_foo_content = \Drupal::service('entity_to_text.extractor.node_to_text')->fromFieldtoText('field_foo', $node);

    Paragraphs to text

    Prerequisite

    • Enabled entity_to_text_paragraphs module

    Usage

    /** @var array[] $bodies */
    $bodies = \Drupal::service('entity_to_text_paragraphs.extractor.paragraphs_to_text')->fromParagraphToText($node->field_paragraphs);

    File to text

    Prerequisite

    • Having access to Tika as a RESTful API via the Tika server.
    • Enabled entity_to_text_tika module
    • Setup the settings.php configuration
    /**
     * Apache Tika connection.
     */
    $settings['entity_to_text_tika.connection']['host'] = 'tika';
    $settings['entity_to_text_tika.connection']['port'] = '9998';

    Usage

    /** @var \Drupal\file\Entity\File $file */
    $file = $file_item->entity;
    $body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');

    or for an advanced usage avoiding multiple calls to Tika by using cached ocr file:

    // Anywhere at least once in the code (Eg. module.install) in order to prepare the storage.
    \Drupal::service('entity_to_text_tika.storage.local_file')->prepareStorage();
    
    // Load the already OCR'ed file if possible to avoid unecessary calls to Tika.
    $body = \Drupal::service('entity_to_text_tika.storage.local_file')->load($file, 'eng+fra');
    
    if (!$body) {
      // When the OCR'ed file is not available, then run Tika over it and store it for the next run.
      $body = \Drupal::service('entity_to_text_tika.extractor.file_to_text')->fromFileToText($file, 'eng+fra');
      // Save the OCR'ed file for the next run.
      \Drupal::service('entity_to_text_tika.storage.local_file')->save($file, $body, 'eng+fra');
    }

    Generate OCR via CLI

    The module provides a Drush command for generating OCR (Optical Character Recognition) for all files within Drupal. It's important to note that this command should be used judiciously due to its potential resource intensity.

    Its primary objective is to generate OCR for files that have not undergone OCR processing yet. It's designed to work seamlessly with the Advanced feature set, leveraging cached OCR files efficiently. This command proves especially useful after a fresh installation, the addition of a new OCR language, or during file migrations.

    # Warmup all files that does not already have an associated .ocr file.
    drush e2t:t:w
    # Warmup all files even if the files has already been processed before.
    drush e2t:t:w --force
    # Warmup the file with FID 2.
    drush e2t:t:w --fid=2

    Supporting organizations

    This project is sponsored by Antistatique, a Swiss Web Agency. Visit us at www.antistatique.net or Contact us.

    Credits

    Entity to Text is currently maintained by Kevin Wenger. Thank you to all our wonderful contributors too.