Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcsim: Synchronize calls to OptionsManager internal data structures #2907

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

brakthehack
Copy link
Contributor

Description

As described in #2906, OptionsManager is racy as calls to its internal
data structures are unsynchronized. When multiple threads are making
calls to OptionsManager, they can race against each other.

This change protects these data structures with a mutex, to avoid the
race conditions.

Closes: #2906

Type of change

Please mark options that are relevant:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update
  • Build related change

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide
instructions so we can reproduce. If applicable, please also list any relevant
details for your test configuration.

Wrote a simple test and ran it with the -race flag. I consistently hit race conditions without my change and none with my change:

Snip:

	vcsim := vcsimulator.NewSimulator()
	defer vcsim.Stop()

	vcsim2 := vcsimulator.NewSimulator()
	defer vcsim2.Stop()

	t.Run("test", func(t *testing.T) {
		client := vcsim.Client.Client
		client2 := vcsim2.Client.Client
		m := object.NewOptionManager(client, client.ServiceContent.Setting.Reference())
		m2 := object.NewOptionManager(client2, client2.ServiceContent.Setting.Reference())
		ctx, cancel := context.WithTimeout(context.Background(), time.Second*2)
		defer cancel()
		for {
			select {
			case <-ctx.Done():
				return
			default:
			}
			// adding the new options to the simulated VC
			opt := fmt.Sprintf("config.%d", rand.Int())
			go m.Update(ctx, []vimtypes.BaseOptionValue{
				&vimtypes.OptionValue{Key: opt, Value: "bar"},
			})

			// adding the new options to the simulated VC
			opt2 := fmt.Sprintf("config.%d", rand.Int())
			go m2.Update(ctx, []vimtypes.BaseOptionValue{
				&vimtypes.OptionValue{Key: opt2, Value: "bar"},
			})

			m.Query(ctx, opt)
		}

Checklist:

  • My code follows the CONTRIBUTION
    guidelines of
    this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged

As described in vmware#2906, `OptionsManager` is racy as calls to its internal
data structures are unsynchronized. When multiple threads are making
calls to `OptionsManager`, they can race against each other.

This change protects these data structures with a mutex, to avoid the
race conditions.

Closes: vmware#2906
Copy link
Member

@dougm dougm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brakthehack . Each vcsim ManagedObject already has its own lock by default, where the lock is held during any API invoked method call here:

ctx.Map.WithLock(ctx, handler, func() {
res = m.Call(args)
})

For example, the m.Call() above would be a method such as UpdateOptions. And WithLock() does a Mutex Lock() + Unlock() around the call.

I noticed the test in your issue creates 2 vcsim instances:

        vcsim := vcsimulator.NewSimulator()
	defer vcsim.Stop()

	vcsim2 := vcsimulator.NewSimulator()
	defer vcsim2.Stop()

Are you able to reproduce the issue with a single vcsim instance? If not, I wonder if we actually need to deep copy here:

m.Setting = vpx.Setting

As the 2 vcsim instances my actually be racing on vpx.Setting ? I will take a closer look.

@github-actions
Copy link
Contributor

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Mark as fresh by adding the comment /remove-lifecycle stale.

Copy link
Member

@akutz akutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @brakthehack!

@@ -30,6 +31,7 @@ import (

type OptionManager struct {
mo.OptionManager
lock sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite this as:

type OptionManager struct {
    sync.Mutex
    mo.OptionManager
}

That way:

func (m *OptionManager) init(r *Registry) {
    m.Lock()
    defer m.Unlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having read the rest of the change, please rewrite further:

type OptionManager struct {
    sync.RWMutex
    mo.OptionManager
}

There's zero reason this should be an exclusive write lock. It's far more performant to rely on shared locks for read-only operations.

@@ -55,6 +60,9 @@ func (m *OptionManager) QueryOptions(req *types.QueryOptions) soap.HasFault {
body := &methods.QueryOptionsBody{}
res := &types.QueryOptionsResponse{}

m.lock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Move the lock calls the top of this function. The only thing currently above the calls are non-expensive variable declarations, so it is well-worth moving the lock entrance to the top of the function where it's more evident.
  2. Rewrite to use the RLock/RUnlock methods:
    m.RLock()
    defer m.RUnlock()

@@ -70,6 +78,7 @@ func (m *OptionManager) QueryOptions(req *types.QueryOptions) soap.HasFault {
return body
}

// find requires holding the lock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either:

  • Update the GoDoc to be more clear:
    // find is not thread-safe and requires callers to already hold either
    // a shared or exclusive lock from the OptionsManager

or

  • Move the function into UpdateOptions so only it can invoke find, ex:
    func (m *OptionManager) UpdateOptions(req *types.UpdateOptions) soap.HasFault {
    	m.Lock()
    	defer m.Unlock()
    
    	find := func(key string) *types.OptionValue {
    		for _, opt := range m.Setting {
    			setting := opt.GetOptionValue()
    			if setting.Key == key {
    				return setting
    			}
    		}
    		return nil
    	}
    
    	// ...

@akutz
Copy link
Member

akutz commented Dec 15, 2022

For some reason @dougm 's comments didn't show up when I read this PR. GH must have been brain-farting, because I swore this PR showed up as "new."

Anyway, Doug is correct in that each MO has its own lock domain, so my suggestions should not actually be necessary. Still, they're worth considering as general guidelines for handling concurrency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] vcsim: OptionsManager is racy
4 participants