Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU execution of atm_recover_large_step_variables via OpenACC #1220

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

gdicker1
Copy link
Collaborator

This PR enables GPU calculation of post-acoustic step variable reconstitution by adding OpenACC directives to the atm_recover_large_step_variables_work routine.

Timing information for the OpenACC data transfers in this routine is captured in the log file by a new timer: atm_recover_large_step_variables [ACC_data_xfer]

Invariant fields used in the work routine are copied in during mpas_atm_dynamics_init and deleted in mpas_atm_dynamics_finalize by building off the fields already handled in previous PRs.

@gdicker1
Copy link
Collaborator Author

NOTE: the changes in this PR slightly change answers. The changes seem to come from a couple of loops. In my regional test case, if the calculations of rho_p, rho_zz, rtheta_p, exner, and pressure_p are done on the CPU instead, no differences occur.

First this loop

         do k = 1, nVertLevels
            rho_p(k,iCell) = rho_p_save(k,iCell) + rho_pp(k,iCell)
            rho_zz(k,iCell) = rho_p(k,iCell) + rho_base(k,iCell)
         end do

And in this loop

            do k = 1, nVertLevels
               rtheta_p(k,iCell) = rtheta_p_save(k,iCell) + rtheta_pp(k,iCell) &
                                 - dt * rho_zz(k,iCell) * rt_diabatic_tend(k,iCell)
               theta_m(k,iCell) = (rtheta_p(k,iCell) + rtheta_base(k,iCell))/rho_zz(k,iCell)
               exner(k,iCell) = (zz(k,iCell)*(rgas/p0)*(rtheta_p(k,iCell)+rtheta_base(k,iCell)))**rcv
               ! pressure_p is perturbation pressure
               pressure_p(k,iCell) = zz(k,iCell) * rgas * (exner(k,iCell)*rtheta_p(k,iCell)+rtheta_base(k,iCell)  &
                                                          * (exner(k,iCell)-exner_base(k,iCell)))
            end do

@mgduda mgduda self-requested a review July 31, 2024 21:29
@mgduda mgduda added Atmosphere OpenACC Work related to OpenACC acceleration of code labels Jul 31, 2024
@mgduda
Copy link
Contributor

mgduda commented Aug 2, 2024

Although there isn't anything in particular that looks suspect to me, it does seem like the differences in results after the first timestep of a 12-km regional simulation comparing with the current develop branch are a bit larger than I would have expected (based on the differences we see before and after the merge of #1216).

I also get a core dump at the end of the simulation on Derecho when using the following modules:

module --force purge
ml ncarenv/23.09    
ml craype/2.7.23    
ml nvhpc/24.3
ml ncarcompilers/1.0.0
ml cray-mpich/8.1.27
ml parallel-netcdf/1.12.3
ml cuda/12.2.1

It may be worth taking a second look at the changes in this PR to see if we can find anything that's subtly off.

@gdicker1 gdicker1 force-pushed the atmosphere/acc_recover_large_step branch from 168d6e3 to 4a923ad Compare August 29, 2024 21:42
@gdicker1
Copy link
Collaborator Author

@mgduda, this is ready for another round of review!

There are slight answer differences, but isolated just to exner and pressure_p from what I can tell. No answer differences occur if the code below is added after the loop to calculate rtheta_p, theta_m, exner, and pressure_p (near line 2892 in this PR).

         ! Calculate exner and pressure_p on CPU
         !$acc update host(rtheta_p)
         do iCell=cellStart,cellEnd
            do k = 1, nVertLevels
               exner(k,iCell) = (zz(k,iCell)*(rgas/p0)*(rtheta_p(k,iCell)+rtheta_base(k,iCell)))**rcv
               ! pressure_p is perturbation pressure
               pressure_p(k,iCell) = zz(k,iCell) * rgas * (exner(k,iCell)*rtheta_p(k,iCell)+rtheta_base(k,iCell)  &
                                                          * (exner(k,iCell)-exner_base(k,iCell)))
            end do
         end do
         !$acc update device(exner,pressure_p)

@mgduda
Copy link
Contributor

mgduda commented Dec 11, 2024

@gdicker1 If I use the current HEAD of the develop branch (ce4d670) as a baseline, I find that in addition to the code you mention above to compute exner and pressure_p on the host, I also need to add

         rw(1,iCell) = 0.0

and

         rw(nVertLevels+1,iCell) = 0.0

around lines 2860 and 2873 (in this PR branch) in order to get identical results with the baseline.
It makes sense to me that this might be needed, since we create rather than copyin the rw field; but what surprises me is that this isn't needed in the PR branch (if we use 63fdcf7 as the baseline, and don't merge onto the HEAD of the develop branch). Perhaps we are just lucky in getting some zeros in the memory that's given to us on the device?

If you agree that we should be initializing the rw field at the first and last level, can you add those changes to the PR branch?

@gdicker1
Copy link
Collaborator Author

@gdicker1 If I use the current HEAD of the develop branch (ce4d670) as a baseline, I find that in addition to the code you mention above to compute exner and pressure_p on the host, I also need to add ...

I agree with those changes, and will get them pushed. I would assume we were just getting lucky before this point!

@gdicker1
Copy link
Collaborator Author

@mgduda this is ready for re-review. Apologies for not pinging after I pushed cde6447.

@gdicker1
Copy link
Collaborator Author

@mgduda, ready for another review!

@mgduda mgduda self-requested a review January 14, 2025 23:45
@mgduda
Copy link
Contributor

mgduda commented Jan 15, 2025

@gdicker1 Even after applying the patch to compute exner and pressure_p on the host, I'm seeing answer differences with the changes in this PR. Can you verify that you are able to get bitwise-identical results when correcting the computation of just exner and pressure_p?

@mgduda
Copy link
Contributor

mgduda commented Jan 16, 2025

@gdicker1 It looks like there's a possible bug in your commit f631856: wwAvg and ruAvg are now in a delete clause rather than a copyout clause.

@gdicker1 gdicker1 force-pushed the atmosphere/acc_recover_large_step branch from cc7e7b7 to adfdc2f Compare January 16, 2025 18:18
@gdicker1
Copy link
Collaborator Author

Force-push cc7e7b7 to adfdc2f to rebase this branch on develop. (Currently 4ea6abb, the merge of #1223 onto develop).

@gdicker1
Copy link
Collaborator Author

This should be ready for another review after pushing fixup c29fcb5.

@mgduda, does this pass as bitwise identical with this output from compare_netcdf.py?

Comparing files: f1 - f2
        f1=baseline_acc_1736985688/restart.2019-09-01_00.06.00.nc
        f2=merge-dev-1223_1737050011/restart.2019-09-01_00.06.00.nc

            Variable  Min       Max       
=========================================
mminlu is not a numeric field and will not be compared
initial_time is not a numeric field and will not be compared
xtime is not a numeric field and will not be compared
              relhum -10.240860  106.321869
        refl10cm_max -35.000000  2.681815

Above, the baseline results compared with this PR with:

  1. The PR branch modified so the calculations for exner and pressure_p are conducted on the host
  2. Both the baseline and test compiled with OPENACC=true

@gdicker1
Copy link
Collaborator Author

@mgduda, does this pass as bitwise identical with this output from compare_netcdf.py?
...

I found the difference, I have different output requests in my streams.atmosphere for the baseline and the test run. This means some diagnostics are computed on different schedules. I'll ping again once I've made sure the runs are exactly the same.

@gdicker1
Copy link
Collaborator Author

@mgduda, using the same streams.atmosphere settings, I now get the same results. This is ready for review!

These changes enable the GPU execution of the
atm_recover_large_step_variables_work subroutine by adding OpenACC directives.
In order to factor out the time to transfer data between CPU and GPU within
this routine, the new timer 'atm_recover_large_step_variables [ACC_data_xfer]'
has been added to the log file.

Changes include:
- Preparing the routine for porting. Modifying whitespace to make regions clear,
  changing implicit loop assignments to be explicit, and fusing some loops.
- Adding OpenACC parallel and loop directives to the do-loops.
- Managing the invariant fields needed for this routine in
  mpas_atm_dynamics_{init,finalize} so they are available across timesteps.
- Managing the other fields needed in the routine with OpenACC directives and
  using default(present) to ensure data isn't missed. default(present) clauses
  cause a run-time error if data isn't present and will help as we fuse data
  regions.
@gdicker1 gdicker1 force-pushed the atmosphere/acc_recover_large_step branch from baa8b40 to be338c9 Compare January 17, 2025 23:12
@gdicker1
Copy link
Collaborator Author

Force-push baa8b40 to be338c9 squashes this into one commit.

@mgduda mgduda self-requested a review January 18, 2025 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere OpenACC Work related to OpenACC acceleration of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants