Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix plugin lock when restoring containers #25005

Closed
wants to merge 1 commit into from

Conversation

ohmystack
Copy link
Contributor

@ohmystack ohmystack commented Jul 25, 2016

The problem occurs when starting dockerd with the containers using the
same unavailable plugin.
The containers are restored one-by-one, not asynchronous.
The dockerd needs to wait (N)*(8 secs) for restoring all the containers.
(N is the number of containers using the unavailable plugin.)

How to reproduce

Precondition: I have a rbd-docker-plugin, with a plugin name "rbd")

Step 1. Run the rbd-docker-plugin and the dockerd.
Step 2. Create 2 containers using plugin rbd.

docker create --name cont01 --volume-driver=rbd -v docker_volumes/web1:/aaa ubuntu:14.04     sleep 5
docker create --name cont02 --volume-driver=rbd -v docker_volumes/web1:/aaa ubuntu:14.04 sleep 5

Step 3. Stop rbd-docker-plugin.
Step 4. Restart dockerd. (Now dockerd is unable to connect plugin rbd)

Then, you will find the dockerd logs like this:

WARN[0001] Unable to locate plugin: rbd, retrying in 1s
WARN[0002] Unable to locate plugin: rbd, retrying in 2s
WARN[0004] Unable to locate plugin: rbd, retrying in 4s
WARN[0008] Unable to locate plugin: rbd, retrying in 8s
ERRO[0016] get docker_volumes/web1: Error looking up volume plugin rbd: plugin not found
WARN[0016] Unable to locate plugin: rbd, retrying in 1s
WARN[0017] Unable to locate plugin: rbd, retrying in 2s
WARN[0019] Unable to locate plugin: rbd, retrying in 4s
WARN[0023] Unable to locate plugin: rbd, retrying in 8s
ERRO[0031] get docker_volumes/web1: Error looking up volume plugin rbd: plugin not found

Totally use 2 * 16 seconds = 32 secs.

Should be

Every container, who waits for the plugin, should start to wait at the same time, like this,

WARN[0001] Unable to locate plugin: rbd, retrying in 1s
WARN[0002] Unable to locate plugin: rbd, retrying in 2s
WARN[0004] Unable to locate plugin: rbd, retrying in 4s
WARN[0008] Unable to locate plugin: rbd, retrying in 8s
ERRO[0016] get docker_volumes/web1: Error looking up volume plugin rbd: plugin not found
ERRO[0016] get docker_volumes/web1: Error looking up volume plugin rbd: plugin not found

Totally use 16 seconds.

The cause

Though daemon.prepareMountPoints(c) runs in seperated goroutines,
containers are still initialized synchronously because it is blocked at
"GetWithRef".

When multiple goroutines call drivers.lookup for the same volume concurrently,

  • There are 2 steps in lookup to initialize a volume driver.
    • Step1. Get the plugin by plugin.LookupWithCapability(name, extName), put plugin into storage.plugins[name]
    • Step2. Put plugin into drivers.extensions[name]
  • The wait loop (waits 16s) is in Step1. lookup holds a lock (for Step1 + Step2) at the beginning, which is before Step1's lock. So that the 2nd goroutine's Step1 cannot go into its wait loop, Every goroutine's Step1 becomes synchronous.
    • When the 2nd goroutine's Step1 can run, it doesn't know this plugin with this "name" has failed before (in 1st goroutine)
    • The 2nd goroutine waits again, waits another 16s.
    • Totally, it becomes (N)*(16s).

Here's the fixes

  1. The 1st goroutine holds the lock
  2. The 2nd, 3rd, ..., goroutines wait for the lock
  3. If the 1st goroutine's Step1 + Step2 fails, it tells 2nd, 3rd, ..., other goroutines to return error, not to do Step1 + Step2 again.

Only after getting the volume driver successfully should the volume be locked by name.

There is no need to lock drivers when "lookup" a driver. Because
"lookup" actually does a GetOrCreate thing. When it is doing
(pkg/plugins/plugins.go)get(name string), there is already a lock to
ensure the same plugin won't be registered twice. So we can remove
the outer lock in "lookup". Nor, it will block the "prepareMountPoints".

@ohmystack
Copy link
Contributor Author

@cpuguy83 Would you mind reviewing?

@cpuguy83
Copy link
Member

@ohmystack I'll have to re-review the whole system. These locks are there for a reason, so just removing them seems dangerous without careful review.

Also, not sure how this would change the sequential-ness since it still needs to wait while doing the lookup. It should be you've just chagned where it waits at.... but again I'll have to go through the whole system.

Also ping @anusha-ragunathan

@anusha-ragunathan
Copy link
Contributor

anusha-ragunathan commented Jul 26, 2016

lookup() uses two types of locks:

  • driverLock is of type locker.Locker, a granular lock which blocks any other access to the given driver "name".
    drivers.driverLock.Lock(name)
    defer drivers.driverLock.Unlock(name)
  • drivers.Lock() protects the entire "extensions" map. It's a global lock. Any access to the extensions map should use this global lock.
    drivers.Lock()
    ext, ok := drivers.extensions[name]
    drivers.Unlock()

Both are necessary to prevent races. Refer to https://github.com/docker/docker/blob/3d13fddd2bc4d679f0eaa68b0be877e5a816ad53/pkg/locker/README.md for more info.

@ohmystack
Copy link
Contributor Author

ohmystack commented Jul 27, 2016

@anusha-ragunathan Agree. drivers.driverLock.Lock(name) protects the "GetOrCreate" action for specific drivers.extensions[name].

After looking carefully, I find the problem is that:

  • There are 2 steps in lookup to initialize a volume driver.
    • Step1. Get the plugin by plugin.LookupWithCapability(name, extName), put plugin into storage.plugins[name]
    • Step2. Put plugin into drivers.extensions[name]
  • The wait loop (waits 8s) is in Step1. lookup holds a lock at the beginning to prevent Step2 races, however, this lock is before Step1. So, the 2nd goroutine's Step1 cannot go into its wait loop, Step1 becomes synchronous.
    • When the 2nd goroutine's Step1 can run, it doesn't know this plugin with this "name" has failed before (in 1st goroutine)
    • The 2nd goroutine waits again, waits another 8s.
    • Totally, it becomes (N)*(8s).

So, the fix can be:

  1. The 1st goroutine holds the lock
  2. The 2nd, 3rd, ..., goroutines wait for the lock
  3. If the 1st goroutine's Step1 fails, it tells 2nd, 3rd, ..., other goroutines to return error, not to do Step1 again.

Do you agree with this logic?
I'll update a commit soon.

@ohmystack
Copy link
Contributor Author

@cpuguy83 @anusha-ragunathan

  • I reverted the change in my old commit.
  • Add an important "SetError" method to pkg/locker/locker.go, which can tell other goroutines, who are waiting for the lock and want to do the same thing, to give up the lock and not to repeat the stupid error again.

How do you think about this change? (If ok, I will do a squash.)

@ohmystack
Copy link
Contributor Author

ohmystack commented Jul 29, 2016

The method name, maybe CancelWithError is better name than SetError?

CancelWithError indicates that other goroutines who are waiting for the lock will get an error, instead of getting the lock. Their locker.Lock(name) is canceled by this caller.

@ohmystack
Copy link
Contributor Author

CC @thaJeztah

@GordonTheTurtle GordonTheTurtle added the dco/no Automatically set by a bot when one of the commits lacks proper signature label Aug 1, 2016
@GordonTheTurtle GordonTheTurtle removed the dco/no Automatically set by a bot when one of the commits lacks proper signature label Aug 1, 2016
@ohmystack
Copy link
Contributor Author

From Jenkins logs, 2 tests failed because No space left on device of test machines... 😓
I'll retrigger the tests later.

@@ -90,6 +93,13 @@ func (l *Locker) Lock(name string) {
// once locked then we can decrement the number of waiters for this lock
nameLock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this where it's going to block at?

I think we'll need to consider a different implementation, possibly a condition variable that can broadcast to all waiters on error or signal a single waiter when the lock is ready.

Copy link
Contributor Author

@ohmystack ohmystack Aug 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @cpuguy83 . All goroutines block here, but only one can do CancelWithError, this goroutine will unlock the lock. After that, one of the other goroutines will continue running from this line, and it checks the error, give up the lock and returns the error, just like the lock is canceled. Though the waiters are canceled one by one, it is very quick because the judgement logic is inside of locker.Locker("name").

Another way, like you said, I've thought about using sync.Cond to replace the sync.Mutex.

locker.Lock("name")
if locker.IsFirstOneGetTheLock("name") {   // Maybe mix sync.Once to do this
  // do the stuff, get the plugin (wait 16s), get err
  if err != nil {
    locker.SetError("name", err)
  }
  locker.Broadcast("name")
} else {
  for condition()  {    // How to define the condition here?
    locker.Wait("name")
  }
  locker.Unlock("name)
}

Its advantage is that pkg/locker looks much more like a "Lock" package. However, it is more complicated and harder to be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to expose the condition variable as the API, only need to use it internally instead of using a literal sync.Mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. But I have a concern that if we put the "only do once during a locker[name]'s lifecycle" logic inside the pkg/locker, the package maybe not so universal.

@cpuguy83
Copy link
Member

cpuguy83 commented Aug 1, 2016

The behavior still feels a little strange, but I see what it's trying to accomplish.

@anusha-ragunathan This shouldn't be a problem at all with the new plugin arch, will it?

@ohmystack
Copy link
Contributor Author

ohmystack commented Aug 1, 2016

@cpuguy83 @anusha-ragunathan I think the new one will be fine.

In the new plugin arch,

  • Step1 and Step2 are divided. This is good. cmd/dockerd/daemon.go does pluginInit before any other containers, so that containers won't have to worry about the starting order and to wait 16s.
  • The get in new plugin/manager.go only does the "get" action, the lock is very quick. Old plugin arch tries to load in the end of get, and the load takes a 16s backoff.

HOWEVER, if the old plugin runs in the new arch without plugins.json or they are not in plugins.json, problem is still there. The daemon will still be block at daemon.prepareMountPoints(c).

I have build with DOCKER_EXPERIMENTAL=1 and test it.

drivers.driverLock.Lock(name)
if err := drivers.driverLock.Lock(name); err != nil {
return nil, fmt.Errorf("Error looking up volume plugin %s: %v", name, err)
}
defer drivers.driverLock.Unlock(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock() failure already Unlocks. Calling Unlock on defer on an already Unlocked lock will cause runtime errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the defer is after the if err.
If there is an error, it returns immediately at "line 96". No chance to run "line 98" defer Unlock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, what you have is fine. I'm used to seeing this format:
err := Lock()
defer UnLock()
if err != nil {
// handle error
}

@anusha-ragunathan
Copy link
Contributor

New plugin arch doesnt have expo backoff retry logic like legacy. All plugins are started alongside the daemon. Old plugins cannot run using the new infra. They need to be rewritten.

If a goroutine holds a lock for a long time to do a job and it gets
an error, this goroutine should tell others, who are also waiting for
this lock to do the same job, that this job will return error, don't
waste your time again. This happens when looking for an volume plugin.
Everytime dockerd waits an 16s without knowing the same volume/plugin
has already failed before.

CancelWithError indicates that other goroutines who are waiting for the
lock will get an error, instead of getting the lock. Their
locker.Lock(name) is canceled by this caller.

Signed-off-by: ohmystack <jun.jiang02@ele.me>
@ohmystack
Copy link
Contributor Author

I did a squash, and re-triggered the tests.

@ohmystack
Copy link
Contributor Author

According to #23446

Note: backwards compatibility with existing plugins is maintained

Though new arch is fine, this fix is still needed by legacy volume plugins.

@anusha-ragunathan
Copy link
Contributor

@ohmystack : What "backwards compatibility with existing plugins is maintained" means is that, new plugin arch will not invalidate legacy plugins. Legacy plugins will continue to work in parallel with the new plugins.

@ohmystack
Copy link
Contributor Author

I see. How about I say it in this way, the new plugin arch is fine, but this bug is still needed to be fixed for legacy plugins.

@@ -309,16 +309,20 @@ func (s *VolumeStore) create(name, driverName string, opts, labels map[string]st
// This makes sure there are no races between checking for the existence of a volume and adding a reference for it
func (s *VolumeStore) GetWithRef(name, driverName, ref string) (volume.Volume, error) {
name = normaliseVolumeName(name)
Copy link
Contributor

@anusha-ragunathan anusha-ragunathan Aug 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intent of the changes in store.go? Are you trying to fix all callsites of Locker.Lock to set and catch appropriate errors? If so, there's more (eg, libcontainer/client.go and even VolumeStore's Get(), CreateWithRef(), etc) and I'd recommend doing that in a separate PR.

Copy link
Contributor Author

@ohmystack ohmystack Aug 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two locks prevent different races, which you have analyzed in previous comments. If a lock would hold for a long time, such as 16s, we should catch the error to prevent it from happening again and again. This is why I catch the error at these 2 places. These 2 locks may take 16s until they unlock.

If the lock action is very quick, it is totally fine to ignore the error.
If a lock uses CancelWithError, every Lock action of this lock should be wrapped in an if err block. If the lock doesn't use the CancelWithError, it is fine to ignore the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think you're right, maybe there are many "slow" locks in other places. It is a good idea to fix them in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetDriver() calls lookup() which can be a long standing operation. There are other calls that hold a Locker.Lock and call GetDriver, for ex, Remove().

Copy link
Contributor Author

@ohmystack ohmystack Aug 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Maybe next PR? This time, for restoring containers at the daemon start, only needs these 2 places. If I remove the GetWithRef part, it won't work again... 😥

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it would be better if there's a separate version of Lock called LockWithError() that checks for error. Callers that use this version should always use CancelWithError() to set errors. This would be cleaner. What do you think?

Copy link
Contributor Author

@ohmystack ohmystack Aug 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not...
Lock() and LockWithError() cannot be mixed used for one named lock. Once the lock is created in locks[name], it should be "cancelable" or "un-cancelable".

So another bool variable, cancelable, needs to be added into lockCtr struct, then we should check that both in old Lock() and new LockWithError().
When writing the logic, if Lock() meets a cancelable locker, the problem comes, old Lock() cannot return an error to indicate that it is locking a wrong type locker...

Beside, adding a new method LockWithError() still cannot ensure that developer will catch the error correctly and not do Unlock after LockWithError() returns an error.

Copy link
Contributor

@anusha-ragunathan anusha-ragunathan Aug 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Lock() encounters a cancelable locker, it can Panic. This will ensure that the developer is conscious of choosing a cancel-able Lock vs a normal Lock. What we dont want is to turn all locks into cancelable locks.

If developer doesnt catch LockWithError() error, then Unlock will result in runtime error. So its okay.

@ohmystack
Copy link
Contributor Author

ohmystack commented Aug 4, 2016

My whole fix, "cancel a lock", might be a bad idea.
If using CancelWithError(), there would be 2 problems:

  • (volume/store/store.go)Remove, its error has more details, the refs. It is unable to be reported to other lock holders. So that other Remove waiters cannot return the full error message as the first one.
  • A lock may be used by different methods at the same time. The cancel may take effect to a different method and lead to a bad result. e.g.
 s.Remove             | s.Get
--------------------------------
 Lock                 |
                      | Lock
 Cance:errVolumeInUse | wait...
 Unlock               | wait
                      | get the lock, but return an "errVolumeInUse"   <- Canceled by a different method's error
                      |
                      |

Need to find another way to fix this bug.
@anusha-ragunathan What do you think? Maybe it's time to close this PR, and thanks for your support

@ohmystack ohmystack closed this Aug 5, 2016
@thaJeztah
Copy link
Member

Thanks @ohmystack, sorry that it didn't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants