Skip to content

bug in backward() or help needed #2077

Open
@Apogeum12

Description

I have a strange issue with backward() I have two generators, gen1 and gen2, I calculate loss on three ways, loss_1, loss_2, loss_3

All compute for gen1 are ok
Part 1.
let out = gen1.forward(input);
let out2 = gen2.forward(out.clone())
... calculate loss
then total_loss = loss_1 + loss_2 + loss_3
and:
let grads1 = total_loss.backward();
let grads_gen1 = GradientsParams::from_grads(grads, &gen1);
This works fine

And then:
Part 2.
let out3 = gen2.forward(out)
let out = gen1.forward(out3)
... calculate loss
then in similar way but for others arguments total_loss = loss_1 + loss_2 + loss_3
and:
let grads2 = total_loss.backward();
let grads_gen2= GradientsParams::from_grads(grads, &gen2);

and update models
.. .step()...

all compute are done in one loop. And here for grads2 I have a issue, when I remove a loss_3 from total_loss everything works fine but with loss_3 I have this error msg:

thread 'main' panicked at /home/euuki/.cargo/registry/src/index.crates.io-6f17d22bba15001f/burn-tensor-0.13.2/src/tensor/api/numeric.rs:22:9:
=== Tensor Operation Error ===
  Operation: 'Add'
  Reason:
    1. The provided tensors have incompatible shapes. Incompatible size at dimension '2' => '1278 != 1280', which can't be broadcasted. Lhs tensor shape [2, 1, 1278, 1278], Rhs tensor shape [2, 1, 1280, 1280]. 
    2. The provided tensors have incompatible shapes. Incompatible size at dimension '3' => '1278 != 1280', which can't be broadcasted. Lhs tensor shape [2, 1, 1278, 1278], Rhs tensor shape [2, 1, 1280, 1280]. 

stack backtrace:
   0:     0x577d25479995 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
   1:     0x577d2549e41b - core::fmt::write::hc090a2ffd6b28c4a
   2:     0x577d2547773f - std::io::Write::write_fmt::h8898bac6ff039a23
   3:     0x577d2547976e - std::sys_common::backtrace::print::ha96650907276675e
   4:     0x577d2547aa29 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
   5:     0x577d2547a76d - std::panicking::default_hook::h207342be97478370
   6:     0x577d2547aec3 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
   7:     0x577d2547ada4 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
   8:     0x577d25479e59 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
   9:     0x577d2547aad7 - rust_begin_unwind
  10:     0x577d24f95ff3 - core::panicking::panic_fmt::hdc63834ffaaefae5
  11:     0x577d25085a41 - core::panicking::panic_display::hd504bfa7a23e079b
  12:     0x577d24f6910d - burn_tensor::tensor::api::numeric::<impl burn_tensor::tensor::api::base::Tensor<B,_,K>>::add::panic_cold_display::haa334297998f63f1
  13:     0x577d2507fd6f - burn_tensor::tensor::api::numeric::<impl burn_tensor::tensor::api::base::Tensor<B,_,K>>::add::h197481e80e899342
  14:     0x577d2503c0b7 - burn_autodiff::grads::Gradients::register::h7f50eb7d3a39e84f
  15:     0x577d25018acf - <burn_autodiff::ops::module::<impl burn_tensor::tensor::ops::modules::base::ModuleOps<burn_autodiff::backend::Autodiff<B,C>> for burn_autodiff::backend::Autodiff<B,C>>::conv2d::Conv2DWithBias as burn_autodiff::ops::backward::Backward<B,4_usize,3_usize>>::backward::h0eb6b130b28fdf5d
  16:     0x577d25060b8f - <burn_autodiff::ops::base::OpsStep<B,T,SB,_,_> as burn_autodiff::graph::base::Step>::step::hd9a3762bb2722aab
  17:     0x577d25446e8d - burn_autodiff::runtime::server::AutodiffServer::backward::h034fbb21f4457df1
  18:     0x577d24fcf75a - <burn_autodiff::runtime::mutex::MutexClient as burn_autodiff::runtime::client::AutodiffClient>::backward::h235c73a9c48299ab
  19:     0x577d2504a5c2 - burn_test::nn::flowscaller::deg_training::flow_net::h0a08effeaee47cf1
  20:     0x577d250a7f28 - burn_test::main::h3db9588c4d2b6cd5
  21:     0x577d24fc2a33 - std::sys_common::backtrace::__rust_begin_short_backtrace::h1ab658f13ba31837
  22:     0x577d2503da39 - std::rt::lang_start::{{closure}}::ha5d1a6d22c45cdd7
  23:     0x577d25472ff0 - std::rt::lang_start_internal::h3ed4fe7b2f419135
  24:     0x577d250a8035 - main
  25:     0x731bf462a1ca - __libc_start_call_main
                               at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  26:     0x731bf462a28b - __libc_start_main_impl
                               at ./csu/../csu/libc-start.c:360:3
  27:     0x577d24f96745 - _start
  28:                0x0 - <unknown>

where 1280 it's output size for tensor for width and height. All padding etc. should be ok for gen1 and gen2 because in otherways Part 1 wouldn't work. Do you have any idea or some sugestion what could be cause? I wanted write on discord on the channel #help, but with logs msg exceeded 2000 chars

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions