Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

best effort subscription not working between two computers #30

Open
ghost opened this issue Oct 17, 2018 · 32 comments
Open

best effort subscription not working between two computers #30

ghost opened this issue Oct 17, 2018 · 32 comments

Comments

@ghost
Copy link

ghost commented Oct 17, 2018

What works:

  • run ros2 run demo_nodes_py talker on one computer
  • run ros2 run demo_nodes_py listener on another computer
    This seems to work because talker publishes as reliable and listener subscribes as reliable

What doesn't work:

  • run ros2 run demo_nodes_py talker on one computer
  • run ros2 topic echo /chatter std_msgs/String on another computer
    It seems like this not working is related to the fact that topic echo subscribes as best effort. If this is run on a single computer, everything works fine, but something about best effort subscription isn't working between computers.

This is a major roadblock that will keep us from updating to bouncy.

@ClarkTucker
Copy link
Contributor

Works for me... is there something more you could share that describes what happens?

ros2_1

ros2_2

@ghost
Copy link
Author

ghost commented Oct 17, 2018

We're running through an unmanaged switch. Both computers are plugged into it, and there's no gateway.

@ClarkTucker
Copy link
Contributor

That configuration seems like it should work.

@ghost
Copy link
Author

ghost commented Oct 17, 2018

What configuration are you running?

@ClarkTucker
Copy link
Contributor

The same.

@ghost
Copy link
Author

ghost commented Oct 17, 2018

I've emailed you a link to our build.

@ClarkTucker
Copy link
Contributor

Your build also works for me.

Would it be possible to get a network packet capture taken on one of the two hosts? Start the capture, run the two test programs, wait for a bit (30 seconds?), then stop capture...

ctucker@ubuntu_2:~/asi_ros2$ ros2 topic echo /chatter std_msgs/String
data: 'Hello World: 2'

data: 'Hello World: 3'

data: 'Hello World: 4'

data: 'Hello World: 5'

data: 'Hello World: 6'

data: 'Hello World: 7'

@ghost
Copy link
Author

ghost commented Oct 17, 2018

on_listener_computer.pcapng.tar.gz

It doesn't look like I was seeing any of the autodiscovery from the other computer, but maybe I was just looking at it wrong.

@ClarkTucker
Copy link
Contributor

Yep. Can you take a capture on the other computer?

@ghost
Copy link
Author

ghost commented Oct 17, 2018

test_two.tar.gz
It looks like the echo computer saw autodiscovery stuff this time (weird). I've attached captures taken at the same time on both the talker and echo computers

@ClarkTucker
Copy link
Contributor

In that last set of captures, it looks like discovery completed successfully, and I can see that there was a match on the /chatter topic. However, no DATA messages show up at all.
Is it possible that you are running a firewall on either machine?

@ghost
Copy link
Author

ghost commented Oct 17, 2018

the builtin ufw is the only one that I know of, and it's disabled on both computers. And messages do get through if we subscribe reliable. It's the just the best effort subscription (echo) that doesn't work.

@ClarkTucker
Copy link
Contributor

Hmmm. I get very different captures when I run the two programs:

ros2 run demo_nodes_py talker
ros2 topic echo /chatter

They create only a single DDS DataWriter / DataReader on the "/chatter" topic, and none of the others that I see in your capture[s] ( for example, "/talker/get_parametersReply", "/talker/get_parameter_typesReply", etc).

Are you running a different test?

@ghost
Copy link
Author

ghost commented Oct 17, 2018

Ah. The other computer was running the cpp talker by accident. That includes parameter services. The python nodes don't. We could make another capture without it if that helps.

@ClarkTucker
Copy link
Contributor

OK, that explains it, I just wanted to make sure I was looking at the right thing.

@ClarkTucker
Copy link
Contributor

I still can't reproduce this locally...
Let's try using the 'log' version of the coredx library:

  1. Find the location of the libdds_cf.so file
  2. Rename that file to be libdds_cf_nolog.so: mv libdds_cf.so libdds_cf_nolog.so
  3. Create a link to the logging library: ln -s libdds_cf_log.so libdds_cf.so

Then, set the DDS_DEBUG environment variable to 7, and run the test:

export DDS_DEBUG=7
ros2 run demo_nodes_py talker 2>&1 | grep -E 'chatter|UDP' > talker_debug.log

And, for completeness, you could do the same on the 'echo' side.

I would expect the log to look a little like this:

...
1539870361.028823409: UDP         : DATA   : read msg from 127.0.0.1:43700 (fd 6) (748 bytes)
1539870361.028854505: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028872756: UDP         : DATA   : read msg from 127.0.0.1:43700 (fd 6) (112 bytes)
1539870361.028900436: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028918015: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028937638:             : DISCVRY: EXISTING WRITER...alive on topic rt/chatter
1539870361.028947979: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028969146: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.079378326:             : DATA   : Reader(     DCPSPublication) [01060A00.00460000.2FBB0001.000003C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079400643: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079433241:             : DATA   : Reader(    DCPSSubscription) [01060A00.00460000.2FBB0001.000004C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079436873: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079446069:             : DATA   : Reader(  ParticipantMessage) [01060A00.00460000.2FBB0001.000200C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079449156: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079521036: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079548636: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079566803: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870362.033470897:             : DATA   : Writer(          rt/chatter): new change 1
...

@ChuiVanfleet
Copy link

Clark, I've been working with Bryant on this issue. Here are the logs:
chatter_with_debug.tar.gz
There are 4 files:

  • log of listener debug output
  • log of talker while the listener was running debug output
  • log of echo debug output
  • log of talker while the ros2 echo was running debug output

We really appreciate your help on this. Let me know if there is anything else we can do to help resolve this.

thanks.

@ClarkTucker
Copy link
Contributor

OK. That's very helpful. I can verify that the talker is sending samples in both scenarios. They are sent over multicast (and apparently not received). When matched with the listener (reliable), we also send a heartbeat (multicast + unicast). This allows the listener to NACK the missing sample which is then [re]sent via unicast.

When matched with echo (best_effort), the sample is sent over multicast only. This as in the listener scenario, is not received.

So, the question is, why are the multicast 'chatter' samples not being received at the listener/echo machine? [The earlier captures show that at least some of the 'discovery' data is successfully transferred...]

Could you rerun the echo scenario with an additional debug setting:

export COREDX_UDP_DEBUG=66

And a slightly different grep:

grep -E 'chatter|UDP|IP'

This should show us specifically which interface[s] coredx is trying to write to.

@ChuiVanfleet
Copy link

ChuiVanfleet commented Oct 18, 2018

Here you go. Thank you for the quick response!

echo_udp_ip_debug.log

Also, for what it's worth, talker is running on the 172.31,255.112 computer, and the listener is running on the 172.31.255.103 computer.

@ClarkTucker
Copy link
Contributor

Cool, thanks. Could you send the 'talker' side as well?

@ChuiVanfleet
Copy link

My bad. We ran both talker and echo again.

echo_with_debug_66.tar.gz

@ClarkTucker
Copy link
Contributor

I think I've got it. Because the two computers share a 'common' IP address [172.17.0.1], we are incorrectly(?) inferring that the two applications (talker + echo) are hosted on the same computer. This impacts how we write multicast packets, resulting in the observed behavior.

  1. If the 'common' 172.17.0.1 address is not required, then my first recommendation would be to change it so that it is not unique.

  2. If that is not possible, then you could configure CoreDX to not use that address. This can be achieved by setting the IP address explicitly with export COREDX_IP_ADDR=172.31.255.xyz. Alternatively, by tailoring the UDP transport configuration [would require mods to rmw_coredx -- it currently just uses a default udp transport configuration].

  3. Finally, you could configure CoreDX to ignore the fact that it thinks the two applications are hosted on the same machine. The setting CoreDX_UdpTransportConfig . try_to_keep_mcast_local = FALSE (0) should do the trick. [This would also require some modification of the rmw_coredx layer to support udp transport configuration.]

@ChuiVanfleet
Copy link

So I'm confused about this 'common' ip address. In all the logs that we've sent you, All other NICs were disabled, leaving only the connection on the 172.31.255.1/24 subnet. Where is this 172,17.0.1 address coming from? Is that the UDP multicast address?

Thanks for your helping me understand.

@ChuiVanfleet
Copy link

So setting the COREDX_IP_ADDR variable appears to work for us.

@ClarkTucker
Copy link
Contributor

CoreDX queries the OS for all the 'up' network interfaces.
For example, on the .103 machine, we get this:


1539879209.990466447: IP          : TRANSPT: INTERFACES: 
1539879209.990468904: IP          : TRANSPT:    IfIndex: 13 family IPv4  addr: 172.17.0.1:0 mcast: 1 loop: 0
1539879209.990470701: IP          : TRANSPT:    IfIndex: 18 family IPv4  addr: 172.31.255.103:0 mcast: 1 loop: 0
1539879209.990472709: IP          : TRANSPT:    IfIndex: 18 family IPv6  addr: fe80:0:0:0:fa7d:947:76ea:5884,0 (scp:18) mcast: 1 loop: 0

@ClarkTucker
Copy link
Contributor

And, by default, we will make use of all 'up' interfaces.

I'm glad to hear that the setting COREDX_IP_ADDR worked.

@ChuiVanfleet
Copy link

So we both do have docker installed which is using that 172.17.0.1 ip address. Let me try disabling that network interface and try that again. Do you have docker installed on your two test machines as well?

@ClarkTucker
Copy link
Contributor

Nope. Just a single interface.

@ChuiVanfleet
Copy link

We just removed the docker ip interface and all appears to be working correctly. Even if docker is installed on one computer then coredx works fine.

If I understand correctly, and correct me if I'm wrong, coredx checks the ip address of the publisher and subscriber to determine if they are on the same computer or not. However in cases where docker is installed, coredx will always assume that the publisher and subscriber are on the same machine. Could it be changed to use something more unique like a mac address instead?

Thank you for your help!

@ClarkTucker
Copy link
Contributor

In general, I think your analysis is correct. However, I would say it slightly differently to indicate that it is not really tied to Docker, and that the behavior is not mandatory:

Each CoreDX participant checks the IP address of each discovered peer participant to determine if they are on the same computer or not. In cases where identical IP addresses are detected, CoreDX will, by default, assume that the two participants are on the same machine. This default behavior can be disabled with the CoreDX_UdpTransportConfig . try_to_keep_mcast_local flag.

Concerning using MAC address for this test: The only information we are guaranteed to have about a peer is IP address. We don't have any information about the MAC address of discovered peers, otherwise that might be a better test.

@ChuiVanfleet
Copy link

Okay. I understand. Thanks again for your help and quick replies!

@ClarkTucker
Copy link
Contributor

OK, Thanks for your patience and help as we worked through this! I really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants