Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#251 Health check circuit breaker #282

Conversation

EtherZa
Copy link
Contributor

@EtherZa EtherZa commented Jul 17, 2024

Addition of a health check circuit breaker for consumers (SlimMessageBus.Host.CircuitBreaker.HealthCheck).

Consumers can be optionally associated with one or more tags (and severities) in order to temporarily pause the consumer while a health check is failing.

The HealthCheckBackgroundService taps into the published health reports to distribute health status changes generated by the Microsoft.Extensions.Diagnostics.HealthChecks library. It is added as a hosted service to catch reports outside of the life time of the SlimMessageBus.

    // add health checks with tags
    builder.Services
        .AddHealthChecks()
        .AddCheck<StorageHealthCheck>("Storage", tags: ["Storage"]);
        .AddCheck<SqlServerHealthCheck>("SqlServer", tags: ["Sql"]);

    builder.Services
        .AddSlimMessageBus(mbb => {
            ...

            mbb.Consume<Message>(cfg => {
                ...

                // configure consumer to monitor tag/state
                cfg.PauseOnUnhealthyCheck("Storage");
                cfg.PauseOnDegradedHealthCheck("Sql");
            })
        })

To add the functionality, a few questionable changes have been made. Please shout if you are not comfortable with any of them/would prefer an alternate implementation.

  • IEnumerable<AbstractConsumerSettings> is required by AbstractConsumer's constructor (retrieve associated circuit breaker configuration)
  • AbstractConsumerSettings maintains a reference to the associated MessageBusSettings (IServiceProvider to create the IConsumerCircuitBreaker instances and to retrieve the associated bus name for logging)
  • AbstractConsumerBuilder extends IHasPostConfigurationActions that will be executed by the MessageBusBuilder pipeline (circuit breakers service registrations)

A sample application has been included (Sample.CircuitBreaker.HealthCheck) to demonstrate the functionality with health checks that fail randomly.

Ideas for future improvements

Although I have no plans on doing this now, the functionality could be extended by incorporating an IConsumerInterceptor to watch for exceptions and, on catching one, initiate an ad hoc health check (once per exception type/period/etc). The evaluation period could then be reduced from the standard health check publishing interval to a reactive check based on an application failure backed up by the default cadence.

@EtherZa EtherZa force-pushed the feature/251-health-check-circuit-breaker branch 5 times, most recently from 0e0cf3b to d1865a5 Compare July 17, 2024 07:29
@EtherZa
Copy link
Contributor Author

EtherZa commented Jul 17, 2024

Update 2:
That wasn't it either.

OutboxTests.Given_CommandHandlerInTransaction_When_ExceptionThrownDuringHandlingRaisedAtTheEnd_Then_TransactionIsRolledBack_And_NoDataSaved_And_NoEventRaised is attempting a distributed transaction promotion.

[xUnit.net 00:01:07.39] [08:10:42 DBG] SlimMessageBus.Host.Outbox.OutboxForwardingPublishInterceptor Forwarding published message of type CustomerCreatedEvent to the outbox
[xUnit.net 00:01:07.39] [08:10:42 ERR] Microsoft.EntityFrameworkCore.Update An exception occurred in the database while saving changes for context type 'SlimMessageBus.Host.Outbox.DbContext.Test.DataAccess.CustomerContext'.
[xUnit.net 00:01:07.39] System.PlatformNotSupportedException: This platform does not support distributed transactions.
[xUnit.net 00:01:07.39] at System.Transactions.Oletx.OletxTransactionManager.CreateTransaction(TransactionOptions options)
[xUnit.net 00:01:07.39] at System.Transactions.TransactionStatePromoted.EnterState(InternalTransaction tx)
[xUnit.net 00:01:07.39] at System.Transactions.EnlistableStates.Promote(InternalTransaction tx)
[xUnit.net 00:01:07.39] at System.Transactions.Transaction.Promote()

Update:
Sql also throws this "connectivity" exception when a task is cancelled while spooling the data.

Condition added to catch SqlException and check if the token has been cancelled. If so a TaskCanceledException exception will be thrown instead.

https://learn.microsoft.com/en-us/answers/questions/207655/sqlexception-with-error-code-0-and-message-a-sever


@zarusz It looks like sql may be experiencing an issue.

[07:36:13 ERR] SlimMessageBus.Host.Outbox.Services.OutboxSendingTask Error while processing outbox messages
 Microsoft.Data.SqlClient.SqlException (0x80131904): A severe error occurred on the current command.  The results, if any, should be discarded.
 Operation cancelled by user.
    at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
    at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, SqlCommand command, Boolean callerHasConnectionLock, Boolean asyncClose)
    at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
    at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
    at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
    at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
    at Microsoft.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader(Boolean isInternal, Boolean forDescribeParameterEncryption)
    at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
    at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync(IAsyncResult asyncResult)
    at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
 --- End of stack trace from previous location ---
    at SlimMessageBus.Host.Outbox.Sql.SqlOutboxRepository.LockAndSelect(String instanceId, Int32 batchSize, Boolean tableLock, TimeSpan lockDuration, CancellationToken token) in /_/src/SlimMessageBus.Host.Outbox.Sql/SqlOutboxRepository.cs:line 54
    at SlimMessageBus.Host.Outbox.Services.OutboxSendingTask.SendMessages(IServiceProvider serviceProvider, IOutboxRepository outboxRepository, CancellationToken cancellationToken) in /_/src/SlimMessageBus.Host.Outbox/Services/OutboxSendingTask.cs:line 245
    at SlimMessageBus.Host.Outbox.Services.OutboxSendingTask.Run() in /_/src/SlimMessageBus.Host.Outbox/Services/OutboxSendingTask.cs:line 173
 ClientConnectionId:39c33641-24b9-4302-92bb-4b475a76b5ad

@EtherZa EtherZa force-pushed the feature/251-health-check-circuit-breaker branch from de87cc0 to 5a0fc7b Compare July 17, 2024 07:58
@EtherZa EtherZa force-pushed the feature/251-health-check-circuit-breaker branch 2 times, most recently from b4b46b6 to 4be0dbe Compare July 19, 2024 16:20
@zarusz
Copy link
Owner

zarusz commented Jul 20, 2024

@EtherZa excellent feature and nice proposed API. Agreed that in the future a consumer interceptor could also observe for failure rates and proactively switch off the circuit - that could be added later.

I will be able to give it some review in 1.5 weeks as I am out now without my laptop.

@EtherZa
Copy link
Contributor Author

EtherZa commented Jul 20, 2024

@zarusz No worries. I just hope that's a pleasure trip :)

docs/intro.t.md Outdated Show resolved Hide resolved
@@ -2,66 +2,122 @@

public abstract class AbstractConsumer : IAsyncDisposable, IConsumerControl
Copy link
Owner

@zarusz zarusz Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we invert the responsibilities here?

As it is the SlimMessageBus.Host knows about circuit breakers while its an opt-in plugin. I sense a leakage of concerns here.

What if the circuit breaker plugin (and its background host task) could traverse the IMasterBus and its child buses and Consumers to find the ones with matching tags and either start or stop them depending on the circuit status?

That way we contain the circuit breaker logic in the plugin assembly, and the .Host layer is free of that logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had a look at this approach, but if a circuit breaker traverses IMasterBus as you suggest:

  1. it will be ignorant to any other status changes made by other workflows and then may unwittingly reverse an intended configuration change
  2. be side stepped by toggling the IConsumerControl methods directly via MessageBusBase.Consumers (unable to start with circuit breakers already in the loop)
  3. require a centralised co-ordinator to aggregate state for all circuit breakers (shifting complexity but still resulting in an inferior experience)

The alternate approach that I investigated was to create a shim/proxy that intercepts the Consumers property in MessageBusBase. It could then trip/open the circuits as required, but would need to change the collection to be that of IConsumerControl so that a proxy could intercept the methods. This approach then cuts off all access to the underlying consumers.

ie.

protected void AddConsumer(AbstractConsumer consumer) => _consumers.Add(new CircuitBreakerProxy(consumer));

In the end the supplied implementation rendered the best results at the expense of being more invasive than ideal, but please let me know if you have another approach.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel pretty strongly we need to go for externalization of this logic outside of the .Host layer and into the plugin layer which needs to coordinate consumers and maintain the Started/Stopped state based on the respective circuit state (I see the plugin acting as an adapter).

If anything we could add a factory method to be able to wrap these consumers into some proxies (from the circuit plugin) - or another smart design.

I am less concerned that someone will use IConsumerControl explicitly when using a circuit breaker. And even if so, then if the consumers were never started (or are stopped imperatively) the breaker logic could detect that and could withhold from starting such consumers if the breaker is on again.

The one thing that might get tricky with externalized logic is to maintain identity (or a pointer) to the consumer instance, but if anything we could (assign some identity to these consumers to find them easily by ID). However recalling vaguely the AbstractConsumer instances should remain (not being disposed) between Start/Stop cycles.

I am open to prototyping this logic on the side together if that helps express my intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll need to dig into it a little bit, but I am not sure when I will be able to do so. I have some pretty tight deadlines over the next 3 months which are going to leave little time unfortunately.

I think that I have a fair grasp of what you are after, but there are some immediate issues that will need to be overcome as you have described:

  1. MessageBusBase does create/dispose the consumers between Start/Stop cycles.
  2. There is no reference between configuration and consumer. I added this in the current PR but making it externally visible is not ideal as the objects are not immutable.
  3. MessageBusHostedService can be replaced to manage the running state, but due to 1, there is no interception point to stop a consumer from starting if the circuit has already been tripped.

@EtherZa EtherZa force-pushed the feature/251-health-check-circuit-breaker branch from 4be0dbe to 947df45 Compare July 30, 2024 16:28
Copy link

sonarqubecloud bot commented Aug 3, 2024

@zarusz
Copy link
Owner

zarusz commented Oct 19, 2024

@EtherZa would it be possible to either:

  1. Rebase on release/v3, resolve few conflicts and change PR target to release/v3? I could merge as is and then refactor based on my earlier ask (shift some responsibilities).
  2. Incorporate my earlier feedback. I understand you might have other priorities, hence option 1.

@EtherZa EtherZa force-pushed the feature/251-health-check-circuit-breaker branch from 947df45 to e17fbf0 Compare October 20, 2024 01:51
Signed-off-by: Richard Pringle <richardpringle@gmail.com>
@EtherZa EtherZa force-pushed the feature/251-health-check-circuit-breaker branch from e17fbf0 to 3ca9335 Compare October 20, 2024 02:45
@EtherZa
Copy link
Contributor Author

EtherZa commented Oct 20, 2024

@zarusz I have rebased the change on release/v3. If you have the appetite to do so, you are more than welcome to jump in and make the changes. Work committments are still a little on the heavy side at the moment.

Copy link

@zarusz zarusz changed the base branch from master to release/v3 October 20, 2024 05:20
@zarusz zarusz changed the base branch from release/v3 to feature/v3_circuit_breaker December 25, 2024 14:50
@zarusz zarusz merged commit d63b275 into zarusz:feature/v3_circuit_breaker Dec 25, 2024
3 checks passed
@zarusz zarusz mentioned this pull request Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants