Skip to content

Indexation in scheduled mode process same products more than once #37916

@gimanik

Description

@gimanik

Preconditions and environment

  • 2.4.5-p3 (but it's older problem and still exists on development branch)
  • Indexers in scheduled mode
  • Default batch size: 1000
  • In indexer change-log more than 1000 unique entity_id's to be indexed and lot of duplicated records (preferable 10 000+)

Steps to reproduce

Run indexer in mview mode, for example by running cron job indexer_update_all_views.
Example based on catalog_product_price index:
Current version_id = 40000
Last version_id = 66630
Index schedule_status (unique product_ids): idle (1093 in backlog)

Expected result

Only a total of 1093 products should be indexed, each product once.

Actual result

Some products are indexed more than once.
Magento\Framework\Mview\View uses Magento\Framework\Mview\View\ChangeLogBatchWalker to iterate through collection
However first batch is taken by select:
SELECT DISTINCT `catalog_product_price_cl`.`entity_id` FROM `catalog_product_price_cl` WHERE (version_id > 40000) AND (version_id <= 66630) GROUP BY `entity_id` LIMIT 1000
which returns 1000 IDs which are passed to indexer and indexed.
Second batch:
SELECT DISTINCT `catalog_product_price_cl`.`entity_id` FROM `catalog_product_price_cl` WHERE (version_id > 41000) AND (version_id <= 66630) GROUP BY `entity_id` LIMIT 1000
which also returns 1000 IDs which are passed to indexer and indexed.
This is because we take unique IDs from whole range, and we increase only vsFrom by batchSize (@see Magento\Framework\Mview\View)
In my case to index those unique 1093 products it needed to run 27 batches, each batch processed this amount of products:
1000
1000
1000
968
927
885
843
802
760
718
677
635
593
552
510
468
427
385
343
302
260
218
177
135
93
52
12

Which adds up to total of 14742 indexer executions, so the last 12 products were reindexed 27 times in each batch.

Additional information

Magento\Framework\Mview\View
Each batch query:
Current: $vsFrom to $currentVersionId
Should be: $vsFrom to ($vsFrom + $batchSize).
So instead of batch queries:

WHERE (version_id > 40000) AND (version_id <= 66630) WHERE (version_id > 41000) AND (version_id <= 66630) WHERE (version_id > 42000) AND (version_id <= 66630) WHERE (version_id > 43000) AND (version_id <= 66630) WHERE (version_id > 44000) AND (version_id <= 66630) WHERE (version_id > 45000) AND (version_id <= 66630) WHERE (version_id > 46000) AND (version_id <= 66630) WHERE (version_id > 47000) AND (version_id <= 66630) WHERE (version_id > 48000) AND (version_id <= 66630) WHERE (version_id > 49000) AND (version_id <= 66630) WHERE (version_id > 50000) AND (version_id <= 66630) WHERE (version_id > 51000) AND (version_id <= 66630) WHERE (version_id > 52000) AND (version_id <= 66630) WHERE (version_id > 53000) AND (version_id <= 66630) WHERE (version_id > 54000) AND (version_id <= 66630) WHERE (version_id > 55000) AND (version_id <= 66630) WHERE (version_id > 56000) AND (version_id <= 66630) WHERE (version_id > 57000) AND (version_id <= 66630) WHERE (version_id > 58000) AND (version_id <= 66630) WHERE (version_id > 59000) AND (version_id <= 66630) WHERE (version_id > 60000) AND (version_id <= 66630) WHERE (version_id > 61000) AND (version_id <= 66630) WHERE (version_id > 62000) AND (version_id <= 66630) WHERE (version_id > 63000) AND (version_id <= 66630) WHERE (version_id > 64000) AND (version_id <= 66630) WHERE (version_id > 65000) AND (version_id <= 66630) WHERE (version_id > 66000) AND (version_id <= 66630) 

Each batch query should look like:

WHERE (version_id > 40000) AND (version_id <= 41000) WHERE (version_id > 41000) AND (version_id <= 42000) WHERE (version_id > 42000) AND (version_id <= 43000) WHERE (version_id > 43000) AND (version_id <= 44000) WHERE (version_id > 44000) AND (version_id <= 45000) WHERE (version_id > 45000) AND (version_id <= 46000) WHERE (version_id > 46000) AND (version_id <= 47000) WHERE (version_id > 47000) AND (version_id <= 48000) WHERE (version_id > 48000) AND (version_id <= 49000) WHERE (version_id > 49000) AND (version_id <= 50000) WHERE (version_id > 50000) AND (version_id <= 51000) WHERE (version_id > 51000) AND (version_id <= 52000) WHERE (version_id > 52000) AND (version_id <= 53000) WHERE (version_id > 53000) AND (version_id <= 54000) WHERE (version_id > 54000) AND (version_id <= 55000) WHERE (version_id > 55000) AND (version_id <= 56000) WHERE (version_id > 56000) AND (version_id <= 57000) WHERE (version_id > 57000) AND (version_id <= 58000) WHERE (version_id > 58000) AND (version_id <= 59000) WHERE (version_id > 59000) AND (version_id <= 60000) WHERE (version_id > 60000) AND (version_id <= 61000) WHERE (version_id > 61000) AND (version_id <= 62000) WHERE (version_id > 62000) AND (version_id <= 63000) WHERE (version_id > 63000) AND (version_id <= 64000) WHERE (version_id > 64000) AND (version_id <= 65000) WHERE (version_id > 65000) AND (version_id <= 66000) WHERE (version_id > 66000) AND (version_id <= 66630) 

Release note

No response

Triage and priority

  • Severity: S0 - Affects critical data or functionality and leaves users without workaround.
  • Severity: S1 - Affects critical data or functionality and forces users to employ a workaround.
  • Severity: S2 - Affects non-critical data or functionality and forces users to employ a workaround.
  • Severity: S3 - Affects non-critical data or functionality and does not force users to employ a workaround.
  • Severity: S4 - Affects aesthetics, professional look and feel, “quality” or “usability”.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions