Speed up on, saving to cloud #214

peterdudfield · 2025-01-02T12:13:57Z

Pull Request

Description

Only uses processes, not threads in the cloud
reduce chunks from 8 in lat, lon and variables to 2, this reduce the numbers of files by 64. Down from ~50,000 to ~700. This makes writing to s3 much quicker. Reduces the run time from 1 hour to 10 mins.
For archive, all the options default to what they were before

openclimatefix/ocf-infrastructure#697

How Has This Been Tested?

CI tests
Ran locally
Tested on Development

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

devsjc

I can see the utility in this but I'd change the implementation a little bit personally. Additionally, there are some consequences to this change that you might not know about as well, I'l try to explain here:

A bit of context first: The previous consumer iteration just had the one chunk for geographical coordinates. This was fine becase at any given point when making the dataset, all of that data was available to be written. This is because the old consumer would pull all the raw data, merge it all in memory, and then write it to the store. The pre merge was memory intensive and slow, but it meant for certain that that whole chunk of all the geographic data would always be written all at once.

The new consumer doesn't do this - it pulls the raw files and writes them on the fly to specified regions of the store, which its what enables parallel processing. However for this to be safe, it has to be the case that each individual write is covered by a chunk, i.e. no chunk can be bigger than the data that is expected to be recieved by a single file. See

Concurrent writes with region are safe as long as they modify distinct chunks in the underlying Zarr arrays (or use an appropriate lock).

from https://docs.xarray.dev/en/stable/user-guide/io.html?appending-to-existing-zarr-stores=#distributed-writes

Mostly, sources of NWP data provide files per parameter, so as long as there is one parameter per chunk in the store, they can be written in parallel. But some (ceda for one) split their files by areas, so there are four files each covering a different geographical region. As such, there has to be at least four chunks per geographical coordinate in the store, whether its being saved to S3 or otherwise, or else the regional writing will be broken.

The long and short of this is that the default chunking modification isn't a coordinate issue, but a RawRepository one - it depends on how they provide thair data. By using 8 by default I had a catch-all for everything we covered so far, but I appreciate that is only necessary for the ones that provide multiple area files. As such, I think this PR should be slightly reworked to modify where this logic is carried out, away from the NWPDimensionCoordinateMap class and into the RepositoryMetadata. Alternatively, we can drop the chunk size down to be equal to the length of each coordinate but use an appropriate lock as instructed by the docs? But this might lose us some concurrency performance gains.

I'm happy to make the changes - hopefully my reasoning above makes sense?

devsjc · 2025-01-06T11:13:54Z

src/nwp_consumer/internal/repositories/raw_repositories/ecmwf_realtime.py

@@ -194,6 +196,7 @@ def _download(self, url: str) -> ResultE[pathlib.Path]:
        ).with_suffix(".grib").expanduser()

        # Only download the file if not already present
+        log.info("Checking for local file: '%s'", local_path)


I think this and the below should be debug logs

devsjc · 2025-01-06T13:11:51Z

src/nwp_consumer/internal/services/consumer_service.py

I had it as threads because it's IO that's intensive, as opposed to compute, in each iteration. How come you cahnge it to processes? Also, if concurrency is set to True, why would n_jobs want to then be set to 1? Would that not make it not concurrent again?

peterdudfield · 2025-01-06T13:33:19Z

src/nwp_consumer/internal/services/consumer_service.py

-        if os.getenv("CONCURRENCY", "True").capitalize() == "False":
+        prefer = "threads"
+
+        concurrency = os.getenv("CONCURRENCY", "True").capitalize() == "False"


sorry, there is some funny logic here that is not clear, I will tidy up

peterdudfield · 2025-01-06T13:38:40Z

I can see the utility in this but I'd change the implementation a little bit personally. Additionally, there are some consequences to this change that you might not know about as well, I'l try to explain here:

A bit of context first: The previous consumer iteration just had the one chunk for geographical coordinates. This was fine becase at any given point when making the dataset, all of that data was available to be written. This is because the old consumer would pull all the raw data, merge it all in memory, and then write it to the store. The pre merge was memory intensive and slow, but it meant for certain that that whole chunk of all the geographic data would always be written all at once.

The new consumer doesn't do this - it pulls the raw files and writes them on the fly to specified regions of the store, which its what enables parallel processing. However for this to be safe, it has to be the case that each individual write is covered by a chunk, i.e. no chunk can be bigger than the data that is expected to be recieved by a single file. See

Concurrent writes with region are safe as long as they modify distinct chunks in the underlying Zarr arrays (or use an appropriate lock).

from https://docs.xarray.dev/en/stable/user-guide/io.html?appending-to-existing-zarr-stores=#distributed-writes

Mostly, sources of NWP data provide files per parameter, so as long as there is one parameter per chunk in the store, they can be written in parallel. But some (ceda for one) split their files by areas, so there are four files each covering a different geographical region. As such, there has to be at least four chunks per geographical coordinate in the store, whether its being saved to S3 or otherwise, or else the regional writing will be broken.

The long and short of this is that the default chunking modification isn't a coordinate issue, but a RawRepository one - it depends on how they provide thair data. By using 8 by default I had a catch-all for everything we covered so far, but I appreciate that is only necessary for the ones that provide multiple area files. As such, I think this PR should be slightly reworked to modify where this logic is carried out, away from the NWPDimensionCoordinateMap class and into the RepositoryMetadata. Alternatively, we can drop the chunk size down to be equal to the length of each coordinate but use an appropriate lock as instructed by the docs? But this might lose us some concurrency performance gains.

I'm happy to make the changes - hopefully my reasoning above makes sense?

Yea, it seems a long way round to do it, but I couldnt quite work out how to change it otherwise. Perhaps there's a way you can modify it

peterdudfield added 4 commits January 2, 2025 12:12

add NUMBER_CONCURRENT_JOBS

c13ba38

mypy fix

9154578

mypy

8cef147

role back

c63a554

peterdudfield requested a review from devsjc January 2, 2025 13:06

peterdudfield marked this pull request as draft January 2, 2025 13:11

peterdudfield added 19 commits January 2, 2025 13:45

turn on verbose

a070be3

try with processes

2b33bc1

add logging

6821301

add log

68651b3

make chunk size of lat and lon 1

c5b439a

role back

bca7ce2

change to 17 and 18 chunks for lat lon

1131ec5

have option to change large chunk size divider

7f4df0d

lint

df748a0

fix

30b9510

lint

f39af76

lint

e7ceff9

try using safe_chunks=False

69de6dc

remove truncate

6150c46

tidy

feba5d1

lint

c1d8702

remove chunking of 1

0a18e80

change to 2 chunks in lat lon

e150bd6

add logging

1fd874d

peterdudfield changed the title ~~add NUMBER_CONCURRENT_JOBS~~ Speed up on, saving to cloud Jan 3, 2025

peterdudfield added 4 commits January 3, 2025 09:45

remove safe chunks

a7d304e

Move maximum chink size back to 8, change to 2 in ecmwf realtime

64cfa43

lint

0d24ef4

tidy up, for parrellel threads and processes

75bc57e

tidy

daa2a9a

peterdudfield marked this pull request as ready for review January 3, 2025 10:07

Merge branch 'main' into concurrent-jobs

45d0c0c

devsjc reviewed Jan 6, 2025

View reviewed changes

peterdudfield commented Jan 6, 2025

View reviewed changes

devsjc mentioned this pull request Jan 7, 2025

fix(coordinate): Log warning on unsafe regional writes #216

Merged

fix(coordinate): Log warning on unsafe regional writes (#216)

6c3e9ed

devsjc approved these changes Jan 7, 2025

View reviewed changes

devsjc merged commit 0f51e30 into main Jan 7, 2025
4 checks passed

devsjc deleted the concurrent-jobs branch January 7, 2025 09:10

devsjc temporarily deployed to github-pages January 7, 2025 09:11 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up on, saving to cloud #214

Speed up on, saving to cloud #214

peterdudfield commented Jan 2, 2025 •

edited

Loading

devsjc left a comment •

edited

Loading

devsjc Jan 6, 2025

devsjc Jan 6, 2025

peterdudfield Jan 6, 2025

peterdudfield commented Jan 6, 2025

Speed up on, saving to cloud #214

Speed up on, saving to cloud #214

Conversation

peterdudfield commented Jan 2, 2025 • edited Loading

Pull Request

Description

How Has This Been Tested?

Checklist:

devsjc left a comment • edited Loading

Choose a reason for hiding this comment

devsjc Jan 6, 2025

Choose a reason for hiding this comment

devsjc Jan 6, 2025

Choose a reason for hiding this comment

peterdudfield Jan 6, 2025

Choose a reason for hiding this comment

peterdudfield commented Jan 6, 2025

peterdudfield commented Jan 2, 2025 •

edited

Loading

devsjc left a comment •

edited

Loading