Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeWindowPartitionMapping(start_offset=0,end_offset=1) does not respect allow_nonexistent_upstream_partitions #26935

Open
mmutso-boku opened this issue Jan 8, 2025 · 0 comments
Labels
area: partitions Related to Partitions type: bug Something isn't working

Comments

@mmutso-boku
Copy link

mmutso-boku commented Jan 8, 2025

What's the issue?

I have a case where one partition in downstream asset depends on two partitions in the upstream asset - one partition being the same partition and the other being the next partition. E.g 2025-01-01-05:00 in downstream needs 2025-01-01-05:00 and 2025-01-01-06:00 from the upstream.

For this, I use

partition_mapping=TimeWindowPartitionMapping(start_offset=0,
                                                 end_offset=1)

as the API docs quite clearly state that this is how it works:

end_offset (int): If not 0, then the ends of the upstream windows are shifted by this
offset relative to the ends of the downstream windows. For example, if start_offset=0
and end_offset=1, then the downstream partition "2022-07-04" would map to the upstream
partitions "2022-07-04" and "2022-07-05". If the upstream and downstream
PartitionsDefinitions are different, then the offset is in the units of the downstream.
Defaults to 0.

Issue is that it seems as if the default (or explicit) allow_nonexistent_upstream_partitions=False is ignored, as this is what happens if we use the example from above with the two partitions:

  1. 2025-01-01-05:00 in the upstream appears and gets materialized
  2. 2025-01-01-05:00 in the downstream gets materialized <-- This shouldn't happen
  3. 2025-01-01-06:00 in the upstream appears and gets materialized
  4. 2025-01-01-05:00 in the downstream gets materialized again

What did you expect to happen?

What is actually wanted and expected (continuation from above):

  1. 2025-01-01-05:00 in the upstream appears and gets materialized
  2. 2025-01-01-05:00 in the downstream does not get materialized as 06:00 partition in the upstream does not exist/is not materialized
  3. 2025-01-01-06:00 in the upstream appears and gets materialized
  4. 2025-01-01-05:00 in the downstream gets materialized

How to reproduce?

@asset(partitions_def=TimeWindowPartitionsDefinition(start=datetime(2025, 1, 8, 9),
                                                     end=datetime(2025, 1, 8, 11),
                                                     fmt='%Y-%m-%d-%H:%M',
                                                     cron_schedule='*/5 * * * *'),
       auto_materialize_policy=AutoMaterializePolicy.eager(max_materializations_per_minute=1))
def testing_asset1() -> None:
    return


@asset(deps=[AssetDep(
    asset=testing_asset1,
    partition_mapping=TimeWindowPartitionMapping(start_offset=0,
                                                 end_offset=1))],
       auto_materialize_policy=AutoMaterializePolicy.eager(max_materializations_per_minute=5),
       partitions_def=TimeWindowPartitionsDefinition(start=datetime(2025, 1, 8, 9),
                                                     end=datetime(2025, 1, 8, 11),
                                                     fmt='%Y-%m-%d-%H:%M',
                                                     cron_schedule='*/5 * * * *'))
def testing_asset2() -> None:
    return

Datetimes need adjusting, but a simple upstream-downstream chain, for faster results, partitioned in 5-minute chunks.
Downstream partition depends on two partitions of upstream - the same partition and the next one (start_offset=0, end_offset=1).

Turn on the automation sensor, and observe how downstream partitions are always materialized twice. The first time is prematurely, when the "same" partition is materialized but the "next" partition in the upstream does not exist yet, and the second time is when the "next" partition in the upstream appears and materializes, which behaves like the partition_mapping defines and is expected. The first materialization is not expected and should not happen.

Dagster version

1.9.6

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@mmutso-boku mmutso-boku added the type: bug Something isn't working label Jan 8, 2025
@garethbrickman garethbrickman added the area: partitions Related to Partitions label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: partitions Related to Partitions type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants