Skip to content

External reconnect fails as long as "old" connection has not been terminated gracefully #843

Closed
@n0l0cale

Description

Describe the bug
RaspberryMatic seems not accept new connections, as long as the old one have not been terminated.

To Reproduce
Steps to reproduce the behavior:

  1. connect RaspberryMatic to e.g. Home Assistant
  2. configure some sensors to be monitored by your control software
  3. restart your RaspberryMatic Software

Expected behavior
A reconnect should be successful.

System information (please complete the following information):

  • Version RaspberryMatic 3.51.6.20200420
  • Hardware RaspberryPi3 B+
  • for reference: Homematic Version 0.110.1

Additional context
This issue has also been discussed in https://community.home-assistant.io/t/homematic-reconnect-does-not-work/161853/18

Activity

danielperna84

danielperna84 commented on May 22, 2020

@danielperna84

To clarify, this is about XML-RPC. But only for HmIP. There are no problems with the old wired and wireless protocols.
That being said, I don't believe RaspberryMatic can do anything about this.

n0l0cale

n0l0cale commented on May 22, 2020

@n0l0cale
Author

thought maybe we can come a step further when more guys are involved.....

jens-maus

jens-maus commented on May 22, 2020

@jens-maus
Owner

@n0l0cale Please provide adequate example XMLRPC communication output between home assistent and the CCU/RaspberryMatic/HmIPServer where the issue is demonstrated. Only with detailed information you will have any chance that this issue will get fixed by eQ3.

added
⚓ upstream issueThis is a bug/issue for/in upstream software (OCCU, etc.)
❓ undecidedNo decision to accept or reject ticket yet
on May 22, 2020
n0l0cale

n0l0cale commented on May 22, 2020

@n0l0cale
Author

@danielperna84 - could you please give me assistance with providing such a sample? I think you are deeper in this than I am....

danielperna84

danielperna84 commented on May 22, 2020

@danielperna84

@n0l0cale
I'll see if I can come up with a script that continuously reproduces the error. Maybe over the weekend.

danielperna84

danielperna84 commented on May 22, 2020

@danielperna84

Ok, so I have created the following test-script: https://gist.github.com/danielperna84/28653b1ebf760064cdbecf5f4828f5f3

Adjust the variables at the top (the IP addresses mainly), then execute it with python3 hmip-xml-rpc-bug.py. Don't forget to allow the incoming traffic on the specified port.

On the CCU / RaspberryMatic you'll get something the following when the second init request has been sent:

==> hmserver.log <==
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-2] (un)registerCallback on LegacyServiceHandler called from url: http://192.168.1.30:32001
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-0] init finished
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer INFO  [vert.x-worker-thread-3] Added InterfaceId: testclient
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer ERROR [vert.x-worker-thread-3] Serialization Exception: Could not add interface: testclient
de.eq3.cbcs.legacy.communication.rpc.RpcSerializationException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:173)
        at de.eq3.cbcs.legacy.communication.rpc.internal.transport.http.HttpTransport.sendRequest(HttpTransport.java:107)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.sendRequest(RpcClient.java:94)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.invoke(RpcClient.java:82)
        at com.sun.proxy.$Proxy22.listDevices(Unknown Source)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.LegacyBackendClient.listDevices(LegacyBackendClient.java:139)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.DeviceUtil.synchronizedBackendDevices(DeviceUtil.java:144)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:109)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:26)
        at io.vertx.core.eventbus.impl.HandlerRegistration.deliver(HandlerRegistration.java:212)
        at io.vertx.core.eventbus.impl.HandlerRegistration.handle(HandlerRegistration.java:191)
        at io.vertx.core.eventbus.impl.EventBusImpl.lambda$deliverToHandler$3(EventBusImpl.java:505)
        at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337)
        at io.vertx.core.impl.TaskQueue.lambda$new$0(TaskQueue.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.nextTag(XMLStreamReaderImpl.java:1257)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseObject(XmlRpcParser.java:437)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:394)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseArray(XmlRpcParser.java:521)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:387)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseMethodResponse(XmlRpcParser.java:164)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:171)
        ... 16 more

This only happens for HmIP. Doing the same with port 2001 shows no errors in the hmserver.log (although I don't get ANY logs for 2001).

At the bottom of the script the wait times could be increased. What will be observed is, that after the first init we receive events from the devices (if we wait long enough for events to happen obviously). After the second init we don't receive any events anymore. I assume this is a result of the exception.

That being said, the reason for this behavior seems to be, that we don't just return an empty list to the CCU after the init - it already has the device descriptions from the first init. If a clearDevices() is added after the first time.sleep(5), we don't get any errors on the CCU during the second init. This fits the observation in Home Assistant, which needs to be restarted to receive events again. The restart of Home Assistant clears the local devices, which is why upon the next startup it will be in the state like this script is during the first init.

So to sum this up: the HMIPServer on the CCU fails to accept init-requests if the listDevices-call it makes to our server returns more than just an empty list. With the old RF and wired devices this works without any issues.
From my point of view this definitely is a bug, because there would be no point in asking the remote server for the known devices if the presence of devices leads to a failure. On top of that, the CCU should not require the remote-end to clear its knowledge about present devices and set them up all the way from the beginning.

As far as I am aware of, the XML-RPC API documentation does not state, that the format of the response to the listDevices call has to match a specific format. Hence copying what the CCU itself returns when calling listDevices (that's essentially what the script is doing) should not lead to such an error. And again, this works for the old wired and RF interfaces. So for HmIP it should either behave the same way, or the limitations need to be communicated clearly.

Edit:
One more detail I found while looking at this on RaspberryMatic with strace is, that the response of our server always gets cut to 8192 bytes. The (redacted) trace (done on a different machine) looks like this:

...
837   send(101, "POST / HTTP/1.1\r\nContent-Length: 158\r\nHost: 192.168.1.142:43043\r\nConnection: Keep-Alive\r\nUser-Agent: Apache-HttpClient/4.5.2 (Java/1.8.0_232)\r\n\r\n
<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?><methodCall><methodName>listDevices</methodName><params><param><value>my-hmip</value></param></params></methodCall>", 303, 0) = 303
837   poll([{fd=101, events=POLLIN|POLLERR}], 1, 600000) = 1 ([{fd=101, revents=POLLIN}])
837   recv(101, "HTTP/1.0 200 OK\r\nServer: BaseHTTP/0.6 Python/3.7.3\r\nDate: Wed, 29 Apr 2020 22:04:01 GMT\r\nContent-type: text/xml\r\nContent-length: 145462\r\n\r\n
<?xml version='1.0'?>\n<methodResponse>\n<params>\n<param>\n<value><array><data>\n<value><struct>\n<member>\n<name>TYPE</name>\n<value><string>HMIP-PSM</string></value>...............<string>SERVICE</string></value>\n</data></array></value>\n</member>\n<member>\n<name>FIRMWARE</name>\n<value><string>2.6.2</string></value>\n<", 8192, MSG_DONTWAIT) = 1599
837   dup2(55, 101)                     = 101
837   close(101)                        = 0
837   write(2, "de.eq3.cbcs.legacy.communication.rpc.RpcSerializationException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]\nMessage: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT", 198) = 198
...

Obviously the XML is bad since it ends abruptly. So the exception makes sense. The question is why the CCU only receives the first 8192 bytes. I also traced the other end, and the data that gets sent as a response is not truncated. This leads me to believe, that on the CCU there seems to be some buffer-limitation for incoming data.

n0l0cale

n0l0cale commented on Jun 3, 2020

@n0l0cale
Author

@jens-maus hi man - should I support you in writing a bug report to the upstream project? Do you have any contact details or do you want to do it by your own? Do you need maybe any further information?

jens-maus

jens-maus commented on Jun 3, 2020

@jens-maus
Owner

@n0l0cale Nope. here is the correct place for this kind of bug report on HmIPServer because eq3 ist more or less listening/reading here when I point them at this ticket (will do). So thanks for the infos here. As soon as I had time to reproduce it I will flag it as reproduced and then its' eq3 task to investigate+fix.

jens-maus

jens-maus commented on Jun 3, 2020

@jens-maus
Owner

@danielperna84 I finally tried to reproduce the issue using your hmip-xml-rpc-bug.py script and here are my results:

On the CCU / RaspberryMatic you'll get something the following when the second init request has been sent:

==> hmserver.log <==
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-2] (un)registerCallback on LegacyServiceHandler called from url: http://192.168.1.30:32001
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.LegacyServiceHandler INFO  [vert.x-worker-thread-0] init finished
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer INFO  [vert.x-worker-thread-3] Added InterfaceId: testclient
May 22 22:49:03 de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer ERROR [vert.x-worker-thread-3] Serialization Exception: Could not add interface: testclient
de.eq3.cbcs.legacy.communication.rpc.RpcSerializationException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:173)
        at de.eq3.cbcs.legacy.communication.rpc.internal.transport.http.HttpTransport.sendRequest(HttpTransport.java:107)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.sendRequest(RpcClient.java:94)
        at de.eq3.cbcs.legacy.communication.rpc.internal.rpc.RpcClient.invoke(RpcClient.java:82)
        at com.sun.proxy.$Proxy22.listDevices(Unknown Source)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.LegacyBackendClient.listDevices(LegacyBackendClient.java:139)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.DeviceUtil.synchronizedBackendDevices(DeviceUtil.java:144)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:109)
        at de.eq3.cbcs.legacy.bidcos.rpc.internal.InterfaceInitializer.handle(InterfaceInitializer.java:26)
        at io.vertx.core.eventbus.impl.HandlerRegistration.deliver(HandlerRegistration.java:212)
        at io.vertx.core.eventbus.impl.HandlerRegistration.handle(HandlerRegistration.java:191)
        at io.vertx.core.eventbus.impl.EventBusImpl.lambda$deliverToHandler$3(EventBusImpl.java:505)
        at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337)
        at io.vertx.core.impl.TaskQueue.lambda$new$0(TaskQueue.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[42,20]
Message: found: CHARACTERS, expected START_ELEMENT or END_ELEMENT
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.nextTag(XMLStreamReaderImpl.java:1257)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseObject(XmlRpcParser.java:437)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:394)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseArray(XmlRpcParser.java:521)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseValue(XmlRpcParser.java:387)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRpcParser.parseMethodResponse(XmlRpcParser.java:164)
        at de.eq3.cbcs.legacy.communication.rpc.internal.format.xml.XmlRequestResponseProcessor.parseResponse(XmlRequestResponseProcessor.java:171)
        ... 16 more

I was able to reproduce these error outputs in hmserver.log as soon as an listDevices() answer is sent to HmIPServer with a non-empty device list. So yes, there indeed seems to be an issue in dealing with receiving a list of devices with a listDevices() rpc execution in HmIPServer.

This only happens for HmIP. Doing the same with port 2001 shows no errors in the hmserver.log (although I don't get ANY logs for 2001).

Please note that the BidCos-RF service on Port 2001 is not provided by HmIPServer but the rfd service which in fact in a C++ tool and output its debug/error output directly to syslog, hence you have to look into /var/log/messages for potential error messages for the rfd/BidCos-RF service on Port 2001.

At the bottom of the script the wait times could be increased. What will be observed is, that after the first init we receive events from the devices (if we wait long enough for events to happen obviously). After the second init we don't receive any events anymore. I assume this is a result of the exception.

This I could not reproduce here. In fact, even after the second init() I perfectly receive events, at least for the virtual devices I could still trigger in the WebUI.

That being said, the reason for this behavior seems to be, that we don't just return an empty list to the CCU after the init - it already has the device descriptions from the first init. If a clearDevices() is added after the first time.sleep(5), we don't get any errors on the CCU during the second init. This fits the observation in Home Assistant, which needs to be restarted to receive events again. The restart of Home Assistant clears the local devices, which is why upon the next startup it will be in the state like this script is during the first init.

Can you please provide example output of your script showing the exact payload of the error? Would really help if one can see that after the second init() no events are received anymore. As said, I can perfectly use the WebUI here and do a PRESS_SHORT or PRESS_LONG on the internal/virtual devices the WebUI provides and then I can see these events in the python script, even after the second init.

From my point of view this definitely is a bug, because there would be no point in asking the remote server for the known devices if the presence of devices leads to a failure. On top of that, the CCU should not require the remote-end to clear its knowledge about present devices and set them up all the way from the beginning.

I fully agree with you that it should be perfectly able to directly provide a device list in the listDevices() answer and the XMLRPC service of HmIPServer should then just diff the provided devices and just call newDevices() with the devices missing. However, please note that the XMLRPC Service in HmIPServer is unfortunately quite legacy and has some shortcomings/bugs. But I will try to get eQ3 informed and see that this issue will be resolved. As a workaround (until the issue is fixed), I would suggest that you change the initialization in home assistent that no device list is provided in the listDevices() execution for the HmIP service query and that you do the diffing (old vs new devices) on your own. AFAIK ioBroker and other third-party services are doing the same and do not rely on the XMLRPC service for doing the listDevices() based diffing as one would expect. See here, for example:

https://github.com/ioBroker/ioBroker.hm-rpc/blob/master/hm-rpc.js#L1013

There, ioBroker.hm-rpc is doing the same for the HmIP service and I think for the exact same reason.

As far as I am aware of, the XML-RPC API documentation does not state, that the format of the response to the listDevices call has to match a specific format. Hence copying what the CCU itself returns when calling listDevices (that's essentially what the script is doing) should not lead to such an error. And again, this works for the old wired and RF interfaces. So for HmIP it should either behave the same way, or the limitations need to be communicated clearly.

Nope, the format should be perfectly fine as you provided it, at least AFAICS.

One more detail I found while looking at this on RaspberryMatic with strace is, that the response of our server always gets cut to 8192 bytes. The (redacted) trace (done on a different machine) looks like this:

...
837   recv(101, "HTTP/1.0 200 OK\r\nServer: BaseHTTP/0.6 Python/3.7.3\r\nDate: Wed, 29 Apr 2020 22:04:01 GMT\r\nContent-type: text/xml\r\nContent-length: 145462\r\n\r\n
<?xml version='1.0'?>\n<methodResponse>\n<params>\n<param>\n<value><array><data>\n<value><struct>\n<member>\n<name>TYPE</name>\n<value><string>HMIP-PSM</string></value>...............<string>SERVICE</string></value>\n</data></array></value>\n</member>\n<member>\n<name>FIRMWARE</name>\n<value><string>2.6.2</string></value>\n<", 8192, MSG_DONTWAIT) = 1599
...

Obviously the XML is bad since it ends abruptly. So the exception makes sense. The question is why the CCU only receives the first 8192 bytes. I also traced the other end, and the data that gets sent as a response is not truncated. This leads me to believe, that on the CCU there seems to be some buffer-limitation for incoming data.

Thanks for that one. This indeed looks like there is a buffer limiting factor here that could be one of the main reasons why using listDevices() with the HmIPServer XMLRPC service is not working correctly ATM. I will forward these observations to my eQ3 contacts and hope they will fix it in the next release.

16 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    ⚓ upstream issueThis is a bug/issue for/in upstream software (OCCU, etc.)🐛 bug-reportSomething isn't working🏷️ HmIPServerThis refs the HmIPServer component👍 importantThis is an important issue/ticket with high priority

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      External reconnect fails as long as "old" connection has not been terminated gracefully · Issue #843 · jens-maus/RaspberryMatic