Performance issues & some fixes #375

dbiollo · 2015-12-15T01:21:15Z

I've been testing out zulip with large data sets (1000's of users, streams, etc). While some of the performance problems require UI design changes, there are a few optimizations that can be made directly on the backend. Here the small ones I've found so far:

1. Case-insensitive index on UserProfile.email.

Code like the following does a O(n) lookup on the UserProfile table:

def get_user_profile_by_email(email):
    return UserProfile.objects.select_related().get(email__iexact=email.strip())

All it needs for indexed-based DB look up is an index like what is in 0003_custom_indexes.py:

        migrations.RunSQL("CREATE INDEX upper_user_email_idx ON zerver_userprofile ((upper(email)));",
                          reverse_sql="DROP INDEX upper_user_email_idx;"),

There might be other tables (prereg user?) that need this index. Alternatively, Postgres has a CITEXT column type for Case-Insensitive text.

2. Realm.get_admin_users*

Turns out that this authorization check is rather expensive when you have a lot of users. Primarily slow due to an OR with the group_permissions table. I could be wrong but it appears that Zulip doesn't use the group permissions of guardian. Best to check, but adding the "with_group_users=False" made this an order of magnitude faster:

     def get_admin_users(self):
         # This method is kind of expensive, due to our complex permissions model.
-        candidates = get_users_with_perms(self, only_with_perms=['administer'])
+        candidates = get_users_with_perms(self, only_with_perms=['administer'], with_group_users=False)
         return candidates

3. Filtering UserProfile on is_action, is_bot

Indexing these columns helps performance in a few cases, as it is a common filter. Needs corresponding migration file.

 class UserProfile(AbstractBaseUser, PermissionsMixin):
     # Fields from models.AbstractUser minus last_name and first_name,
     # which we don't use; email is modified to make it indexed and unique.
     email = models.EmailField(blank=False, db_index=True, unique=True)
     is_staff = models.BooleanField(default=False)
-    is_active = models.BooleanField(default=True)
-    is_bot = models.BooleanField(default=False)
+    is_active = models.BooleanField(default=True, db_index=True)
+    is_bot = models.BooleanField(default=False, db_index=True)
     date_joined = models.DateTimeField(default=timezone.now)
     is_mirror_dummy = models.BooleanField(default=False)

4. Caching entire company directory can break memcache

Memcache has a reasonable 1MB size limit per cached object. When you have a company with 1000's of users, the following caching code breaks, getting error 37 from Memcache. It's more scalable to refactor the code to not fetch all company users in one go, but at least to fix the broken memcache, the cache on this function could simply be disabled:

-@cache_with_key(active_user_dicts_in_realm_cache_key, timeout=3600*24*7)
+# @cache_with_key(active_user_dicts_in_realm_cache_key, timeout=3600*24*7)
 def get_active_user_dicts_in_realm(realm):
      return UserProfile.objects.filter(realm=realm, is_active=True) \
                                .values('id', 'full_name', 'short_name', 'email', 'is_bot')

5. alert_words_in_realm fetches too much data

The default behaviour of this function is to fetch all records for the realm, which is a lot of data for large companies. Assuming that usage of Alert Words is sparse, then filtering to exclude users who don't have any results in a much faster query.

 import ujson
+from django.db.models import Q

 @cache_with_key(realm_alert_words_cache_key, timeout=3600*24)
 def alert_words_in_realm(realm):
-    users = zerver.models.UserProfile.objects.filter(realm=realm, is_active=True).values('id', 'alert_words')
+    users = zerver.models.UserProfile.objects.filter(~Q(alert_words=ujson.dumps([]))).filter(realm=realm, is_active=True).values('id', 'alert_words')
     all_user_words = dict((user['id'], ujson.loads(user['alert_words'])) for user in users)

6. DevAuth feature makes Login page too slow

DevAuthBackend is pretty handy, but the login page will be pretty much unusable once you get a few thousand UserProfiles in the DB. Makes sense to cap the query at 100? Also sorting by email can be done on the DB as a minor improvement.

 def login_page(request, **kwargs):
     extra_context = kwargs.pop('extra_context', {})
     if dev_auth_enabled():
-        users = UserProfile.objects.filter(is_bot=False, is_active=True)
-        extra_context['direct_admins'] = sorted([u.email for u in users if u.is_admin()])
-        extra_context['direct_users'] = sorted([u.email for u in users if not u.is_admin()])
+        MAX_DEV_BACKEND_USERS = 100
+        users = UserProfile.objects.select_related('realm').filter(is_bot=False, is_active=True).order_by('email')[:MAX_DEV_BACKEND_USERS]
+        extra_context['direct_admins'] = [u.email for u in users if u.is_admin()]
+        extra_context['direct_users'] = [u.email for u in users if not u.is_admin()]

7. json_events_register - unused?

This function might not be used? At least that's what my old comment says (in my local copy).

zulip/zerver/views/__init__.py

Line 1157 in e957399

@authenticated_json_post_view

8. realm_user_count - replace python object count with DB count query

This will be much faster for large realms.

 def realm_user_count(realm):
-    user_dicts = get_active_user_dicts_in_realm(realm)
-    return len([user_dict for user_dict in user_dicts if not user_dict["is_bot"]])
+    return UserProfile.objects.filter(realm=realm, is_active=True, is_bot=False).count()

The text was updated successfully, but these errors were encountered:

timabbott · 2015-12-15T01:28:26Z

Cool, thanks for doing this investigation @dbiollo ! Are you intending to develop and submit PRs for any of these issues? I'm happy to provide advice on how to best address some of these...

timabbott · 2015-12-15T01:56:01Z

Here's some notes on these various items:

(1) Case-insensitive index on UserProfile.email.

I believe that the index in 0003_custom_indexes.py is already used for the query you reference; is the issue that we may need similar indexes elsewhere or are you seeing that index not being used?

(2) Realm.get_admin_users*

Zulip is not using the groups support of Guardian, so I think that would be a fine patch to add.

(3) Filtering UserProfile on is_active, is_bot

Those seem like reasonable indexes to add.

(4) Caching entire company directory can break memcache

Roughly how many users are required for this to become an issue? I think the places where this is used do require all the users in the company, so we'd either need to change the UI for larger realms or do something else reasonable (e.g. if this were the only issue we could get a multiplicative factor by compressing or encoding in something more efficient than JSON strings)

(5) alert_words_in_realm fetches too much data

alert_words is not a heavily used feature, so that approach would be reasonable.

(6) DevAuth feature makes Login page too slow

Capping the query at 100 seems very reasonable here.

(7) json_events_register - unused?

json_events_register is used as the handler for /json/get_events (see zproject/urls.py). However, we are in the process of removing the remaining pre-API format views, and it could be easily replaced with /json/events (I merged some similar changes to remove e.g. /json/send_message a few days ago).

(8) realm_user_count - replace python object count with DB count query

I think we may want to be thoughtful about this one -- the current code doesn't talk to the database because of the memcached entry discussed in (4). I think you're right that regardless the current check is inefficient, but count queries can be slow in postgres, so I'd definitely want to test this a bit before changing this to a count query.

timabbott · 2015-12-15T02:03:34Z

Also, @dbiollo if the way you're testing for these issues is something we could script and include in the automated tests that we run on every Zulip change, that would be awesome -- I always find it's best with performance work to have tools making sure issues don't regress in the future :)

dbiollo · 2015-12-16T00:50:34Z

Hey @timabbott, my plan is to eventually plugin this into performance tests. No Pull Requests expected for a while; though I have submitted the CLA for my company to approve. Hopefully by January :/ Feel free to incorporate any of the non-controversial changes.

RE: (3) - The indexes in 0003_custom_indexes are great but only apply to zerver_message & zerver_stream. The same case-insensitive index is needed for zerver_userprofile and probably zerver_realm, zerver_preregistrationuser given that there are case-insensitive queries of the form: "xxxx__iexact".

RE: (4) - It seems to break around 10-12K users.

Was wondering if you think that Streams & Realms should have some configurable or hard coded limits to subscription size / membership, at least until such time as the scale issues are all fixed? For example, a Realm with 10K users and some public streams with all of those 10K users will kill the system ... I've been adding LIMIT 1000 in a lot of places, which makes it responsive enough.

timabbott · 2015-12-16T06:47:02Z

OK, good to know; time permitting I'll probably get to several of these before January.

Regarding limits, I definitely think having (configurable) limits would be a valuable feature to help prevent people from accidentally getting themselves into serious trouble!

timabbott · 2015-12-26T01:41:49Z

@dbiollo I opened #392 with versions of the fixes for issues 1, 2, 3, 5, 6, 7, and 8; I'd love it if you could take a quick look at it and test it in your environment!

For (4), I think we may need that cache for performance reasons in an actually active realm. It could be valuable to look at whether it makes sense to e.g. gzip-compress the content; I should also note that if we merge #378, we won't have memcached's 1MB limit of object sizes.

timabbott · 2016-01-24T02:02:40Z

@dbiollo so looking at issue 4, I think there's a few things that would need to happen if we wanted to remove that cache:

(A) I pushed a small commit to https://github.com/timabbott/zulip/tree/tabbott-user-dicts removing some of the users of get_active_user_dicts_for_realm

(B) With that change, most of the zerver/lib/actions.py accesses can just use active_user_ids(realm), which we can reimplement to just doing a database query or having its own cache which would be fairly efficient as just as list of IDs.

(C) The bugdown use of it is for matching users by short or long name in @-mentions. When we solve #374, that should resolve this issue since we shouldn't need to be doing a lookup through all full names.

(D) The call in streams.py is legacy code predating the notifications_stream feature (instead sending a PM to every user!) that we should probably just remove at this point.

(E) We'd still need to use the cache for get_realm_user_dicts used in fetch_initial_state_data.

I've posted #442 to handle (A) and (D) since they're unconditional improvements to the code that don't involve a complex transition.

brainwane · 2016-08-13T18:35:33Z

@dbiollo How are things going? :)

In case you'd like to talk synchronously about this: We're hosting a livechat "office hour" at the open Zulip realm https://zulip.tabbott.net on Monday, August 15th, 10am-11am Pacific Time / 1pm-2pm Eastern Time / 10:30-11:30pm India time. Please come by if you want to talk about these fixes in realtime -- or about the experimental new desktop app, helping us figure out what new users need to know, replacing memcached with Redis, etc. We'd be happy to see you.

timabbott · 2017-02-17T07:39:32Z

Closing the remaining piece in favor of #3709

Thanks again for this testing work @dbiollo!

timabbott mentioned this issue Dec 26, 2015

Several small perf improvements for realms with large numbers of users #392

Closed

timabbott mentioned this issue Dec 26, 2015

Zulip for large environments #370

Closed

timabbott added the enhancement label Dec 28, 2015

timabbott mentioned this issue Jan 24, 2016

Clean up some code paths using get_active_user_dicts #442

Closed

timabbott added the area: misc label Oct 14, 2016

timabbott mentioned this issue Feb 17, 2017

Clean up use of get_active_user_dicts_for_realm #3709

Closed

timabbott closed this as completed Feb 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues & some fixes #375

Performance issues & some fixes #375

dbiollo commented Dec 15, 2015

timabbott commented Dec 15, 2015

timabbott commented Dec 15, 2015

timabbott commented Dec 15, 2015

dbiollo commented Dec 16, 2015

timabbott commented Dec 16, 2015

timabbott commented Dec 26, 2015

timabbott commented Jan 24, 2016

brainwane commented Aug 13, 2016

timabbott commented Feb 17, 2017

Performance issues & some fixes #375

Performance issues & some fixes #375

Comments

dbiollo commented Dec 15, 2015

1. Case-insensitive index on UserProfile.email.

2. Realm.get_admin_users*

3. Filtering UserProfile on is_action, is_bot

4. Caching entire company directory can break memcache

5. alert_words_in_realm fetches too much data

6. DevAuth feature makes Login page too slow

7. json_events_register - unused?

8. realm_user_count - replace python object count with DB count query

timabbott commented Dec 15, 2015

timabbott commented Dec 15, 2015

timabbott commented Dec 15, 2015

dbiollo commented Dec 16, 2015

timabbott commented Dec 16, 2015

timabbott commented Dec 26, 2015

timabbott commented Jan 24, 2016

brainwane commented Aug 13, 2016

timabbott commented Feb 17, 2017