best effort subscription not working between two computers #30

ghost · 2018-10-17T17:20:46Z

What works:

run ros2 run demo_nodes_py talker on one computer
run ros2 run demo_nodes_py listener on another computer
This seems to work because talker publishes as reliable and listener subscribes as reliable

What doesn't work:

run ros2 run demo_nodes_py talker on one computer
run ros2 topic echo /chatter std_msgs/String on another computer
It seems like this not working is related to the fact that topic echo subscribes as best effort. If this is run on a single computer, everything works fine, but something about best effort subscription isn't working between computers.

This is a major roadblock that will keep us from updating to bouncy.

The text was updated successfully, but these errors were encountered:

ClarkTucker · 2018-10-17T18:17:20Z

Works for me... is there something more you could share that describes what happens?

ghost · 2018-10-17T18:20:59Z

We're running through an unmanaged switch. Both computers are plugged into it, and there's no gateway.

ClarkTucker · 2018-10-17T18:24:48Z

That configuration seems like it should work.

ghost · 2018-10-17T18:26:06Z

What configuration are you running?

ClarkTucker · 2018-10-17T18:26:34Z

The same.

ghost · 2018-10-17T18:40:49Z

I've emailed you a link to our build.

ClarkTucker · 2018-10-17T19:16:03Z

Your build also works for me.

Would it be possible to get a network packet capture taken on one of the two hosts? Start the capture, run the two test programs, wait for a bit (30 seconds?), then stop capture...

ctucker@ubuntu_2:~/asi_ros2$ ros2 topic echo /chatter std_msgs/String
data: 'Hello World: 2'

data: 'Hello World: 3'

data: 'Hello World: 4'

data: 'Hello World: 5'

data: 'Hello World: 6'

data: 'Hello World: 7'

ghost · 2018-10-17T19:38:46Z

on_listener_computer.pcapng.tar.gz

It doesn't look like I was seeing any of the autodiscovery from the other computer, but maybe I was just looking at it wrong.

ClarkTucker · 2018-10-17T19:52:24Z

Yep. Can you take a capture on the other computer?

ghost · 2018-10-17T20:06:51Z

test_two.tar.gz
It looks like the echo computer saw autodiscovery stuff this time (weird). I've attached captures taken at the same time on both the talker and echo computers

ClarkTucker · 2018-10-17T20:38:22Z

In that last set of captures, it looks like discovery completed successfully, and I can see that there was a match on the /chatter topic. However, no DATA messages show up at all.
Is it possible that you are running a firewall on either machine?

ghost · 2018-10-17T20:43:02Z

the builtin ufw is the only one that I know of, and it's disabled on both computers. And messages do get through if we subscribe reliable. It's the just the best effort subscription (echo) that doesn't work.

ClarkTucker · 2018-10-17T22:07:41Z

Hmmm. I get very different captures when I run the two programs:

ros2 run demo_nodes_py talker
ros2 topic echo /chatter

They create only a single DDS DataWriter / DataReader on the "/chatter" topic, and none of the others that I see in your capture[s] ( for example, "/talker/get_parametersReply", "/talker/get_parameter_typesReply", etc).

Are you running a different test?

ghost · 2018-10-17T22:11:43Z

Ah. The other computer was running the cpp talker by accident. That includes parameter services. The python nodes don't. We could make another capture without it if that helps.

ClarkTucker · 2018-10-17T22:25:10Z

OK, that explains it, I just wanted to make sure I was looking at the right thing.

ClarkTucker · 2018-10-18T13:47:55Z

I still can't reproduce this locally...
Let's try using the 'log' version of the coredx library:

Find the location of the libdds_cf.so file
Rename that file to be libdds_cf_nolog.so: mv libdds_cf.so libdds_cf_nolog.so
Create a link to the logging library: ln -s libdds_cf_log.so libdds_cf.so

Then, set the DDS_DEBUG environment variable to 7, and run the test:

export DDS_DEBUG=7
ros2 run demo_nodes_py talker 2>&1 | grep -E 'chatter|UDP' > talker_debug.log

And, for completeness, you could do the same on the 'echo' side.

I would expect the log to look a little like this:

...
1539870361.028823409: UDP         : DATA   : read msg from 127.0.0.1:43700 (fd 6) (748 bytes)
1539870361.028854505: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028872756: UDP         : DATA   : read msg from 127.0.0.1:43700 (fd 6) (112 bytes)
1539870361.028900436: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.028918015: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028937638:             : DISCVRY: EXISTING WRITER...alive on topic rt/chatter
1539870361.028947979: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.028969146: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (76 bytes)
1539870361.079378326:             : DATA   : Reader(     DCPSPublication) [01060A00.00460000.2FBB0001.000003C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079400643: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079433241:             : DATA   : Reader(    DCPSSubscription) [01060A00.00460000.2FBB0001.000004C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079436873: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079446069:             : DATA   : Reader(  ParticipantMessage) [01060A00.00460000.2FBB0001.000200C7] sending ACKNACK to Locator( UDPv4     U  Address: 10.0.0.70 port:7410)
1539870361.079449156: UDP         : DATA   : write msg UNICAST to 10.0.0.70:7410 (fd: 10) (72 bytes)
1539870361.079521036: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079548636: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870361.079566803: UDP         : DATA   : read msg from 10.0.0.70:49112 (fd 8) (72 bytes)
1539870362.033470897:             : DATA   : Writer(          rt/chatter): new change 1
...

ChuiVanfleet · 2018-10-18T15:24:42Z

Clark, I've been working with Bryant on this issue. Here are the logs:
chatter_with_debug.tar.gz
There are 4 files:

log of listener debug output
log of talker while the listener was running debug output
log of echo debug output
log of talker while the ros2 echo was running debug output

We really appreciate your help on this. Let me know if there is anything else we can do to help resolve this.

thanks.

ClarkTucker · 2018-10-18T15:49:47Z

OK. That's very helpful. I can verify that the talker is sending samples in both scenarios. They are sent over multicast (and apparently not received). When matched with the listener (reliable), we also send a heartbeat (multicast + unicast). This allows the listener to NACK the missing sample which is then [re]sent via unicast.

When matched with echo (best_effort), the sample is sent over multicast only. This as in the listener scenario, is not received.

So, the question is, why are the multicast 'chatter' samples not being received at the listener/echo machine? [The earlier captures show that at least some of the 'discovery' data is successfully transferred...]

Could you rerun the echo scenario with an additional debug setting:

export COREDX_UDP_DEBUG=66

And a slightly different grep:

grep -E 'chatter|UDP|IP'

This should show us specifically which interface[s] coredx is trying to write to.

ChuiVanfleet · 2018-10-18T15:57:44Z

Here you go. Thank you for the quick response!

echo_udp_ip_debug.log

Also, for what it's worth, talker is running on the 172.31,255.112 computer, and the listener is running on the 172.31.255.103 computer.

ClarkTucker · 2018-10-18T16:07:20Z

Cool, thanks. Could you send the 'talker' side as well?

ChuiVanfleet · 2018-10-18T16:15:46Z

My bad. We ran both talker and echo again.

echo_with_debug_66.tar.gz

ClarkTucker · 2018-10-18T16:35:40Z

I think I've got it. Because the two computers share a 'common' IP address [172.17.0.1], we are incorrectly(?) inferring that the two applications (talker + echo) are hosted on the same computer. This impacts how we write multicast packets, resulting in the observed behavior.

If the 'common' 172.17.0.1 address is not required, then my first recommendation would be to change it so that it is not unique.
If that is not possible, then you could configure CoreDX to not use that address. This can be achieved by setting the IP address explicitly with export COREDX_IP_ADDR=172.31.255.xyz. Alternatively, by tailoring the UDP transport configuration [would require mods to rmw_coredx -- it currently just uses a default udp transport configuration].
Finally, you could configure CoreDX to ignore the fact that it thinks the two applications are hosted on the same machine. The setting CoreDX_UdpTransportConfig . try_to_keep_mcast_local = FALSE (0) should do the trick. [This would also require some modification of the rmw_coredx layer to support udp transport configuration.]

ChuiVanfleet · 2018-10-18T16:42:30Z

So I'm confused about this 'common' ip address. In all the logs that we've sent you, All other NICs were disabled, leaving only the connection on the 172.31.255.1/24 subnet. Where is this 172,17.0.1 address coming from? Is that the UDP multicast address?

Thanks for your helping me understand.

ChuiVanfleet · 2018-10-18T16:46:19Z

So setting the COREDX_IP_ADDR variable appears to work for us.

ClarkTucker · 2018-10-18T16:47:59Z

CoreDX queries the OS for all the 'up' network interfaces.
For example, on the .103 machine, we get this:


1539879209.990466447: IP          : TRANSPT: INTERFACES: 
1539879209.990468904: IP          : TRANSPT:    IfIndex: 13 family IPv4  addr: 172.17.0.1:0 mcast: 1 loop: 0
1539879209.990470701: IP          : TRANSPT:    IfIndex: 18 family IPv4  addr: 172.31.255.103:0 mcast: 1 loop: 0
1539879209.990472709: IP          : TRANSPT:    IfIndex: 18 family IPv6  addr: fe80:0:0:0:fa7d:947:76ea:5884,0 (scp:18) mcast: 1 loop: 0

ClarkTucker · 2018-10-18T16:50:18Z

And, by default, we will make use of all 'up' interfaces.

I'm glad to hear that the setting COREDX_IP_ADDR worked.

ChuiVanfleet · 2018-10-18T16:51:11Z

So we both do have docker installed which is using that 172.17.0.1 ip address. Let me try disabling that network interface and try that again. Do you have docker installed on your two test machines as well?

ClarkTucker · 2018-10-18T16:51:39Z

Nope. Just a single interface.

ChuiVanfleet · 2018-10-18T17:06:23Z

We just removed the docker ip interface and all appears to be working correctly. Even if docker is installed on one computer then coredx works fine.

If I understand correctly, and correct me if I'm wrong, coredx checks the ip address of the publisher and subscriber to determine if they are on the same computer or not. However in cases where docker is installed, coredx will always assume that the publisher and subscriber are on the same machine. Could it be changed to use something more unique like a mac address instead?

Thank you for your help!

ClarkTucker · 2018-10-18T17:19:03Z

In general, I think your analysis is correct. However, I would say it slightly differently to indicate that it is not really tied to Docker, and that the behavior is not mandatory:

Each CoreDX participant checks the IP address of each discovered peer participant to determine if they are on the same computer or not. In cases where identical IP addresses are detected, CoreDX will, by default, assume that the two participants are on the same machine. This default behavior can be disabled with the CoreDX_UdpTransportConfig . try_to_keep_mcast_local flag.

Concerning using MAC address for this test: The only information we are guaranteed to have about a peer is IP address. We don't have any information about the MAC address of discovered peers, otherwise that might be a better test.

ChuiVanfleet · 2018-10-18T17:24:20Z

Okay. I understand. Thanks again for your help and quick replies!

ClarkTucker · 2018-10-18T18:13:58Z

OK, Thanks for your patience and help as we worked through this! I really appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best effort subscription not working between two computers #30

best effort subscription not working between two computers #30

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018 •

edited by ghost

Loading

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018 •

edited by ghost

Loading

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018 •

edited by ghost

Loading

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018 •

edited

Loading

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

best effort subscription not working between two computers #30

best effort subscription not working between two computers #30

Comments

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018 • edited by ghost Loading

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018 • edited by ghost Loading

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018 • edited by ghost Loading

ClarkTucker commented Oct 17, 2018

ghost commented Oct 17, 2018

ClarkTucker commented Oct 17, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018 • edited Loading

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ChuiVanfleet commented Oct 18, 2018

ClarkTucker commented Oct 18, 2018

ghost commented Oct 17, 2018 •

edited by ghost

Loading

ghost commented Oct 17, 2018 •

edited by ghost

Loading

ghost commented Oct 17, 2018 •

edited by ghost

Loading

ChuiVanfleet commented Oct 18, 2018 •

edited

Loading