Feature/simple process worker #76

psarma89 · 2021-04-07T20:29:11Z

About

This PR creates a SimpleProcessWorker and SimpleManagerWorker
Removes the internal queue
Combines the functionality of the ReadWorker and ProcessWorker into the SimpleProcessWorker
Adds a simple-worker flag to use the new SimpleManagerWorker
Keeps support for multiple queues, concurrency, batchsize, and interval; removes prefetch-multiplier support when using SimpleManagerWorker
Conducted throughput testing to make sure that SimpleProcessWorker does not lose any throughput when compared to ProcessWorker

To Do

See if it makes sense to make things more DRY between ReadWorker, ProcessWorker and SimpleProcessWorker by subclassing from a common parent
See if it makes sense to make things more DRY between SimpleManagerWorker and MangerWorker by subclassing from a common parent
Add tests for new classes
Update README

Decisions

Decision 1: For SimpleProcessWorker, do we want to keep track of number of messages_processed due to memory implications, is this relevant still?
- Decision 1a: If we keep this, do we count a message grabbed and discarded due to visibility timeout as a going against the count, i.e messages_processed?
Decision 2: Should read_message in SimpleProcessWorker break or not do the visibility timeout check here at all since we are doing it in process_message?
Decision 3: Should we immediately make messages visible that have failed, but have a long visibility timeout?
Decision 4: Do we support batchsize in SimpleProcessWorker?
Decision 5: Support multiple sqs queues for SimpleManagerWorker?
Decision 6: Support concurrency for SimpleManagerWorker?

psarma89 · 2021-04-07T21:07:37Z

@spulec @jimjshields @nadlerjessie Curious on your thoughts for the different decision points we have to finish SimpleProcessWorker and general review of this PR.

coveralls · 2021-04-07T21:25:24Z

Coverage decreased (-0.1%) to 97.531% when pulling bcfdcc1 on psarma89:feature/simple-process-worker into d0c8e3d on spulec:master.

nadlerjessie

I'm excited about this @psarma89!! Overall this looks good, I agree there's an opportunity to refactor to avoid repeated code. My biggest sticking point is the idea of discarding messages because they exceed the visibility timeout. I think it will cause more trouble than it's worth. Removing batch size should remove the potential gotcha - I don't have a huge need for batch size, I don't know about others.

I left comments inline which address some of your questions but to be explicit (answers in bold):

Decision 1: For SimpleProcessWorker, do we want to keep track of number of messages_processed due to memory implications, is this relevant still? I don't see the harm in keeping it
Decision 1a: If we keep this, do we count a message grabbed and discarded due to visibility timeout as a going against the count, i.e messages_processed? I don't think we should discard messages due to visibility timeout, more comments in PR
Decision 2: Should read_message in SimpleProcessWorker break or not do the visibility timeout check here at all since we are doing it in process_message? Same as above
Decision 3: Should we immediately make messages visible that have failed, but have a long visibility timeout? **I don't understand the question. Can you give an example?**
Decision 4: Do we support batchsize in SimpleProcessWorker? Would rather concurrency over batchsize. That might address some of my visibility timeout concerns too
Decision 5: Support multiple sqs queues for SimpleManagerWorker? I don't think it's necessary
Decision 6: Support concurrency for SimpleManagerWorker? I think this it's nice to keep it

nadlerjessie · 2021-04-07T21:26:45Z

pyqs/worker.py

+            if int(end - start) >= self.visibility_timeout:
+                # Don't add any more messages since they have
+                # re-appeared in the sqs queue Instead just reset and get
+                # fresh messages from the sqs queue
+                msg = (
+                    "Clearing Local messages since we exceeded "
+                    "their visibility_timeout"
+                )
+                logger.warning(msg)
+                break


I know you asked about wanting both checks but I'm wondering - do we want either of them? My understanding about the initial motivation was that we wanted to avoid processing messages multiple times. The internal queue made it more likely that the message was pulled and the visibility timeout expired before we processed it. @spulec could speak to that more.

It's still possible to execute a message twice. Let's we have a batchsize of 2, a vis timeout of 60 seconds, and the first message takes 65 seconds. Message 2 could exceed the vis timeout without ever running it.

However I don't know if the risk of running it multiple times is worse than the risk of pulling a message and never executing it. If we have a max receives of 1, that means message number 2 ends up in the DLQ without ever running it. IMO this is the largest pain point with pyqs today.

If we want to keep it at least now there's a way around it that is not available with the internal queue but it still sets up a potential trap.

That is my understanding as well. It is there to avoid double processing a message. Would this not be relevant for example in sending email campaigns in clementine or are there application checks to avoid it. I am OK without doing this check and adding better documentation to explain the implications, and putting the onus on the user.

I think most apps that have max receives > 1 should have logic to address if a message gets picked up again. I suppose it's possible the checks wouldn't capture if two are picked up at the exact same time. I think an at least once pattern is preferable to what we have here, which would lead to messages never being processed, but I don't know if that's true for all applications.

The real issue here is batchsize. If we remove batchsize I don't think this check does anything (we read the message and immediately start processing it). Then can safely remove these checks and the possibility of something going straight to the DLQ without ever executing the task.

nadlerjessie · 2021-04-07T21:39:58Z

pyqs/worker.py

+        messages = self._get_connection().receive_message(
+            QueueUrl=self.queue_url,
+            MaxNumberOfMessages=self.batchsize,
+            WaitTimeSeconds=LONG_POLLING_INTERVAL,


Do we want to keep the LONG_POLLING_INTERVAL. That means it could take 20 seconds after writing an item to SQS to pick it up. It would be nice if this was configurable or if the wait time was 0 if batchsize = 1. It's more API calls but I don't think that matters much

This is already an issue #60 and yes maybe we can make it configurable, but this would generally apply to SimpleProcessWorker or ReadWorker, so maybe we can deal with it in a separate PR.

nadlerjessie · 2021-04-07T21:42:41Z

pyqs/worker.py

+            if self.messages_processed \
+                    >= self._messages_to_process_before_shutdown:
+                self.shutdown()


You asked about whether or not we want to keep messages_processed. What are the shutdown implications if we don't? I'm not worried about the memory from keeping a counter.

In hindsight, it's probably safer to keep, so unless we all strongly disagree, we will most likely just keep it, but I wanted it to surface.

psarma89 · 2021-04-08T15:03:02Z

@nadlerjessie regarding your point about Decision 3 and the below comment. I was thinking of this for situations where we may have a moderate visibility timeout, lets say 20 mins, and max receives > 1. If something fails quickly, but has the chance of succeeding again if retried (let's say Databricks was busy, but 10 mins later it could be fine), do we want to wait the 20 mins to retry that message or make it immediately visible. I think this is an edge case, and interesting to think about, but maybe the safest thing we can do is do nothing with it. I maybe underestimating the complexity of making it immediately visible and taking that ownership away from the engineer and their understanding of their application and setting appropriate visibility timeouts. So not opposed to doing nothing for this.

Decision 3: Should we immediately make messages visible that have failed, but have a long visibility timeout? I don't understand the question. Can you give an example?

psarma89 · 2021-04-09T17:03:04Z

As an update, here is where we landed on the decisions:

Decisions

Decision 1: For SimpleProcessWorker, do we want to keep track of number of messages_processed due to memory implications, is this relevant still?
- We want to keep this as a best practice to prevent potential memory leaks in the worker since you can't control what might actually be run on the worker
- Decision 1a: If we keep this, do we count a message grabbed and discarded due to visibility timeout as a going against the count, i.e messages_processed?
  - We decided to remove the check for visibility timeout, so this becomes irrelevant
  - Removing the visibility timeout check and allowing the worker to process the message even past its visibility timeout means we solve the problem of never processing something if max_receives=1 and we incorrectly set a shorter visibility timeout and exceed the visibility timeout
  - It increases the probability that we process a message more than once, especially if batchsize > 1, but this can be solved by the developer checking if the message has already been processed. It shifts the onus to the developer to account for this, which is better than trying to prevent this in the library
Decision 2: Should read_message in SimpleProcessWorker break or not do the visibility timeout check here at all since we are doing it in process_message?
- See above explanation of Decision 1a, but this also becomes irrelevant
Decision 3: Should we immediately make messages visible that have failed, but have a long visibility timeout?
- We defer addressing this concern at the moment, since it is the responsibility of the developer to understand their application code and set appropriate visibility timeouts
Decision 4: Do we support batchsize in SimpleProcessWorker?
- We will continue to support batchsize but default to batchsize=1 when using the --simple-worker flag
- Making batchsize 1 and not allowing it to be configurable means we are relying on the latency of sqs sending us the 1 message, and for fast messages, we would be increasing the latency disproportionately and reduce throughput
Decision 5: Support multiple sqs queues for SimpleManagerWorker?
- We will continue to support multiple sqs queues
Decision 6: Support concurrency for SimpleManagerWorker?
- We will continue to support concurrency
We will not make LONG_POLLING_INTERVAL configurable in this PR

…visibility timeout info

…ment

… parent

…s_message

…Worker methods it doesn't need

…cesses

… worker

psarma89 · 2021-04-19T15:29:31Z

@jimjshields @nadlerjessie @spulec I finished the refactoring work and adding tests to this PR. This is ready for a full review. I will update the README once everyone has a chance to review. In the meantime, I will be running a subset of the throughput testing that I ran to confirm the behavior is the same.

psarma89 added 4 commits April 7, 2021 16:12

Add initial SimpleProcessWorker implementation

4889961

Add SimpleManagerWorker intial implementation

9bf631a

Add a simple-worker flag to use the SimpleManagerWorker class

c9bb542

Fix manager worker tests

f3af494

nadlerjessie reviewed Apr 7, 2021

View reviewed changes

psarma89 added 20 commits April 9, 2021 13:28

Remove visibility timeout check from SimpleProcessWorkerbut keep the …

394246d

…visibility timeout info

Default batchsize to be 1 if simple-worker flag is passed

c2041a1

Rename SimpleManagerWorker to BaseManager

e02acd2

Subclass SimpleManagerWorker and ManagerWorker from BaseManager

1ef4c9f

Update BaseManager to SimpleManagerWorker for main

ab0d385

Move SimpleProcessWorker methods to SimpleManagerWorker sub class

fdd9d40

Add methods to BaseManager that sub classes should override and imple…

f0e98b3

…ment

Move worker_children state down to the subclasses

f3b9d66

Make replace_workers something that subclasses have to implement

e587c13

Move up _run_hooks to the parent class BaseWorker

0c49f79

Move common pre_process_context code from child classes to BaseWorker…

4bafcfa

… parent

Move messages_processed into instance state for ProcessWorker

fbc716e

Add interval back to Simple* classes

891150c

Refactor ProcessWorker and SimpleProcessWorker to better align proces…

4194526

…s_message

Move _process_task into parent BaseWorker

ad710a9

Created a new base class BaseProcessWorker so that we don't give Read…

0a06269

…Worker methods it doesn't need

Fix test_master_shuts_down_busy_process_workers to look at worker pro…

c5411e3

…cesses

Add tests for SimpleManagerWorker

e81e2de

Add tests for different default batchsize for simple worker vs normal…

407dcbe

… worker

Fix pre-commit issue

f62bafe

Revert test test_master_shuts_down_busy_process_workers to original

b8d63ee

psarma89 force-pushed the feature/simple-process-worker branch from 2f8b820 to b8d63ee Compare May 13, 2021 17:39

Don't run python 2 travis workflow

bcfdcc1

psarma89 force-pushed the feature/simple-process-worker branch from 8b4c6a2 to bcfdcc1 Compare May 14, 2021 18:24

spulec merged commit 44da356 into spulec:master May 14, 2021

psarma89 mentioned this pull request May 17, 2021

Initial README update for simple process worker #77

Merged

psarma89 deleted the feature/simple-process-worker branch March 14, 2024 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/simple process worker #76

Feature/simple process worker #76

psarma89 commented Apr 7, 2021 •

edited

Loading

psarma89 commented Apr 7, 2021

coveralls commented Apr 7, 2021 •

edited

Loading

nadlerjessie left a comment

nadlerjessie Apr 7, 2021

psarma89 Apr 8, 2021

nadlerjessie Apr 8, 2021

nadlerjessie Apr 7, 2021

psarma89 Apr 8, 2021

nadlerjessie Apr 7, 2021

psarma89 Apr 8, 2021

psarma89 commented Apr 8, 2021

psarma89 commented Apr 9, 2021 •

edited

Loading

psarma89 commented Apr 19, 2021

Feature/simple process worker #76

Feature/simple process worker #76

Conversation

psarma89 commented Apr 7, 2021 • edited Loading

About

To Do

Decisions

psarma89 commented Apr 7, 2021

coveralls commented Apr 7, 2021 • edited Loading

nadlerjessie left a comment

Choose a reason for hiding this comment

nadlerjessie Apr 7, 2021

Choose a reason for hiding this comment

psarma89 Apr 8, 2021

Choose a reason for hiding this comment

nadlerjessie Apr 8, 2021

Choose a reason for hiding this comment

nadlerjessie Apr 7, 2021

Choose a reason for hiding this comment

psarma89 Apr 8, 2021

Choose a reason for hiding this comment

nadlerjessie Apr 7, 2021

Choose a reason for hiding this comment

psarma89 Apr 8, 2021

Choose a reason for hiding this comment

psarma89 commented Apr 8, 2021

psarma89 commented Apr 9, 2021 • edited Loading

Decisions

psarma89 commented Apr 19, 2021

psarma89 commented Apr 7, 2021 •

edited

Loading

coveralls commented Apr 7, 2021 •

edited

Loading

psarma89 commented Apr 9, 2021 •

edited

Loading