Hello,

I have justed tested the module with the version 1.0-beta10 and tika 1.19.

I works fine for file fields directly on the index entity type and on index on media entity.

But my use case is that I want to index attachment with a following structure:

node > block content > media > file

Each ">" is an entity reference field.

The problem is that in search_api_attachments/src/Plugin/search_api/processor/FilesExtractor.php

  protected function getFileFieldsAndFileEntityItems() {
    $file_elements = [];

    // Retrieve file fields of indexed bundles.
    foreach ($this->getIndex()->getDatasources() as $datasource) {
      if ($datasource->getPluginId() == 'entity:file') {
        $file_elements[static::SAA_FILE_ENTITY] = $this->t('File entity');
      }
      foreach ($datasource->getPropertyDefinitions() as $property) {
        if ($property instanceof FieldDefinitionInterface) {
          if ($property->getType() == 'file') {
            $file_elements[$property->getName()] = $property->getLabel();
          }
          if ($property->getType() == "entity_reference") {
            if ($property instanceof FieldConfig) {
              $deps = $property->getDependencies();
              if (in_array('media.type.file', $deps['config'])) {
                $file_elements[$property->getName()] = $property->getLabel();
              }
            }
          }
        }
      }
    }
    return $file_elements;
  }

There is no possibility to explore deeply nested entity reference and there is a hardcoded dependency on the "media.type.file" media type.

Should a recursive scan of the entity reference fields can be done or should the architecture of the module should change completely?

Thanks for any help.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Grimreaper created an issue. See original summary.

izus’s picture

Hi,

about the media :
yes, we don't support yet all the file fields in media but only the default one (the core one)
in FilesExtractor addFieldValues() we limit that to have a first working media integration ready to work

 // Supporting only the default media file field for now.
                    if ($media->bundle() == 'file') {

it definitely needs more test and love, maybe in a separate issue. i didn't had a lot of feedback about this feature. so thanks for that :)

about the entity reference : yes, i think we can try that for now : in the indexed entity, the list of fields that can be indexed should be: the file fields + the file fields of the references entities (by the way this will make the media stuff more generic and not limited the the default file field)

Entity reference always remines me that a infinite loop can occur, idk if core is protected against that :
node 1 > block content 1 > media 1 > file 1 AND node 1
(idk why a media would reference a node but this seems possible in admin/structure/media/manage/file/fields/add-field), and in that case we'll have an infinite loop. i suggest we don't search in file entity reference fields at all in a first time. that means that at the first time we find a file entity we don't go deeper in the search of other nested referenced files. hoping this case wont be needed !

Thanks

sam.spinoy@gmail.com’s picture

Any update on this? I had to revert to an older version because this patch has the functionality described in this ticket.

apaderno’s picture

Version: 8.x-1.0-beta10 » 8.x-1.x-dev
Grimreaper’s picture

Hello,

To fix this issue and all related issues limiting the usage of this module, I think a deep rework on the module will be necessary.

Instead of a Search API plugin that adds info on the fly, I think to add a new base field on File entity directly.

This base field will be a computed base field, and the extraction will happens when computing this base field value.

So, it will no matter anymore how (the depth of recursion) you access your file field, directly on content or using media, or etc.

This will allow to expose extracted value on APIs, JSON:API for example.

This new architecture will allow to have features like, "I want to know from which file field the extracted content is from", etc.

This is also almost what does the "file_extracted_text" field formatter.

About existing features on the Search API plugin:
- excluded_extensions
- number_indexed
- number_first_bytes
- max_filesize
- excluded_private
- excluded_mimes

It will still be possible to handle it, but no more in addFieldValues(). But like HTML filter, where you config on which field you want the processor to be enabled.

About the view filter :
- still possible to limit the query, either by checking if using the name of this base field or directly this base field.

"file_extracted_text" field formatter:
- excluded_extensions: still possible
- max_filesize: still possible
- excluded_private: still possible

I agree that it will be almost a complete rewrite of the module. Maybe a 8.x-2.x branch or 2.0.x now that semantic versionning is here.

Opinions on that?

izus’s picture

excellent suggestion !
and yes, when we request the comuted field value, w'll call the current "extract or get from cache" method
go go go ! let it be :)

ps: will open the new suggested branch as soon as a working patch landed

Grimreaper’s picture

Assigned: Unassigned » Grimreaper

Thanks for the feedback :)

I will start today, and I think finish this weekend.

Grimreaper’s picture

And here is the patch. Maybe a WIP patch because it does not rewrite the whole module to use this.

I hesitated between "string_long" and "map" for the base field type.

I think it would had been possible to use an "->addConstraint" on the base field directly if we had the following settings ad global settings:
- excluded_extensions
- number_indexed
- number_first_bytes
- max_filesize
- excluded_private
- excluded_mimes

But as it will be per Search:API index and in the field formatter, impossible to use that.

I think a new service should be created ExtractionManager, this service will handle extraction and cache, it should not be the Search API plugin that should do that. So that would avoid a lot of code duplication and alleviate the Search API plugin.

During the making of this patch, I found a bug #3126679: Impossible to configure search api attachments and the more I look at the code, the more I want to rewrite the module.

I don't know if it will be possible to work on a dedicated branch, not a standard one like "8.x-2.x", a "refactoring" branch (before merge on 2.0.x) to avoid to have a lot of patches to do. Or if you prefer that I make a fork on Github or Gitlab or etc. And ping in an issue when it will be ready and ok for me?

I think the queue stuff to delay extraction should be optional, so a settings should be added to opt-in or not.

izus’s picture

hi,
thanks for this first spot

i didn't take time to review it yet but the most important for now is to set a workflow, so here are the bases for me :

1) i created the 8.x-2.x branch https://git.drupalcode.org/project/search_api_attachments/-/branches and pushed the code into it
(if renaming is necessary it can be done at any time ;))
2) yes for a service to extract
3) ok for an optional queue, but i'd prefer to have it enabled by default and let users disable it if they want to (actually it's vital for site that relay on remote servers for extraction (solr / tika) as the remote service can be down sometimes ;))
4) you can still make patches or work in a public repo and ping here for a merge in the new branch. if the repo is public i can set it as an upstream and rebase from it when a milestone is ready ;)

Thanks a lot :)

Grimreaper’s picture

Hello,

Thanks a lot for the quick replies.

Ok, I will:

  1. make a fork on Github
  2. create a d.o issue to have an issue ID to commit
  3. make all the rework I want
  4. ping back in the d.o issue when ok for me to discuss with you

:)

Grimreaper’s picture

Hello,

And here is my work: https://www.drupal.org/project/file_extractor

Explanations are here #3126845: Version 2.0.0

As said in the above issue, during the refactoring I found #2844979: Index nested attachments (eg. entity reference, paragraphs) - arbitrary level, so this issue was in fact a duplicate.

If you want to discuss more on the fork, it can be done on #3126845: Version 2.0.0.

@izus, if you want to become co-maintainer on https://www.drupal.org/project/file_extractor, I will be happy with that. I don't know what you want to do with Search API Attachments, if you want to still maintain it or recommend File Extractor and put Search API Attachments in "maintenance fix only"/"deprecated".

herved’s picture

Thank you for this patch, it works great overall.
I found a minor issue when the extraction_method is not set, it should exit otherwise an error is thrown.

This seems to be handled in the latest version of the file_extractor module (which also contains many more great additions) so ATM I'm thinking to switch from search_api_attachments to file_extractor.
But before I pull the plug, I was wondering if there was a consensus or not regarding search_api_attachments 2.x vs file_extractor 2.x ?
Is search_api_attachments going to be deprecated in favor of file_extractor so that we all focus on a single code base?

Thank you,
Hervé

Grimreaper’s picture

Hello,

Thanks @herved for your feedbacks.

I receive no reply since my last comment. Sorry about that.

Regards,

izus’s picture

Hi all,

sorry for the delay
I don't plan to deprecate search_api_attachments

Thank you

liquidcms’s picture

just tested patch with a file field in an attached paragraph field. Took a while to find the field as it doesnt match with the README:

"8) Go to admin/config/search/search-api/index/my_index/fields/add/nojs and:
- in the General section, add the "Search api attachments: My pdfs" field.
- in the Content section, add the "Title".
- in the Content section, add the "Body"."

but i within standard field additions visible when expanding the Paragraph field (i think as mentioned above)..

Works great. Thank you for awesome patch. :)

ianwensink’s picture

This patch works great for me too. I think this is the perfect way to have the computed field, instead of the General field. Thanks for the work!

ianwensink’s picture

I'm running into a bug where extractOrGetFromCache is called when computed the value. I think this call should be wrapped inside a isFileIndexable call, just like all of the other occurrences.

Specifically when trying to upload a file as a logo inside my theme, it will error on the $file->id() not being present. That's okay, since I don't need to extract text from it. Calling isFileIndexable() beforehand will prevent Drupal from crashing I think.

Another, less fancy, route is to wrap extractOrGetFromCache inside a try/catch. The try/catch is what I'm doing for now to work around this issue.

Asterovim’s picture

Thanks you. I can confirm the patch search_api_attachments-index_nested-3008580-17.patch work's pefectly.
Drupal : 9.3.6 and Search api Attachments 9.0.0

valgibson’s picture

Can confirm that the patch in #17 also applies perfectly to Drupal 9.5.3 (saa 9.0.0). Hope this will be commited in the next stable release.

Asterovim’s picture

Hello.

It's work's with the patch into Drupal Core 10.0.9 and Search API attachments 9.0.1.

Thanks you.