Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues & some fixes #375

Closed
dbiollo opened this issue Dec 15, 2015 · 9 comments
Closed

Performance issues & some fixes #375

dbiollo opened this issue Dec 15, 2015 · 9 comments

Comments

@dbiollo
Copy link

dbiollo commented Dec 15, 2015

I've been testing out zulip with large data sets (1000's of users, streams, etc). While some of the performance problems require UI design changes, there are a few optimizations that can be made directly on the backend. Here the small ones I've found so far:

1. Case-insensitive index on UserProfile.email.

Code like the following does a O(n) lookup on the UserProfile table:

def get_user_profile_by_email(email):
    return UserProfile.objects.select_related().get(email__iexact=email.strip())

All it needs for indexed-based DB look up is an index like what is in 0003_custom_indexes.py:

        migrations.RunSQL("CREATE INDEX upper_user_email_idx ON zerver_userprofile ((upper(email)));",
                          reverse_sql="DROP INDEX upper_user_email_idx;"),

There might be other tables (prereg user?) that need this index. Alternatively, Postgres has a CITEXT column type for Case-Insensitive text.

2. Realm.get_admin_users*

Turns out that this authorization check is rather expensive when you have a lot of users. Primarily slow due to an OR with the group_permissions table. I could be wrong but it appears that Zulip doesn't use the group permissions of guardian. Best to check, but adding the "with_group_users=False" made this an order of magnitude faster:

     def get_admin_users(self):
         # This method is kind of expensive, due to our complex permissions model.
-        candidates = get_users_with_perms(self, only_with_perms=['administer'])
+        candidates = get_users_with_perms(self, only_with_perms=['administer'], with_group_users=False)
         return candidates

3. Filtering UserProfile on is_action, is_bot

Indexing these columns helps performance in a few cases, as it is a common filter. Needs corresponding migration file.

 class UserProfile(AbstractBaseUser, PermissionsMixin):
     # Fields from models.AbstractUser minus last_name and first_name,
     # which we don't use; email is modified to make it indexed and unique.
     email = models.EmailField(blank=False, db_index=True, unique=True)
     is_staff = models.BooleanField(default=False)
-    is_active = models.BooleanField(default=True)
-    is_bot = models.BooleanField(default=False)
+    is_active = models.BooleanField(default=True, db_index=True)
+    is_bot = models.BooleanField(default=False, db_index=True)
     date_joined = models.DateTimeField(default=timezone.now)
     is_mirror_dummy = models.BooleanField(default=False)

4. Caching entire company directory can break memcache

Memcache has a reasonable 1MB size limit per cached object. When you have a company with 1000's of users, the following caching code breaks, getting error 37 from Memcache. It's more scalable to refactor the code to not fetch all company users in one go, but at least to fix the broken memcache, the cache on this function could simply be disabled:

-@cache_with_key(active_user_dicts_in_realm_cache_key, timeout=3600*24*7)
+# @cache_with_key(active_user_dicts_in_realm_cache_key, timeout=3600*24*7)
 def get_active_user_dicts_in_realm(realm):
      return UserProfile.objects.filter(realm=realm, is_active=True) \
                                .values('id', 'full_name', 'short_name', 'email', 'is_bot')

5. alert_words_in_realm fetches too much data

The default behaviour of this function is to fetch all records for the realm, which is a lot of data for large companies. Assuming that usage of Alert Words is sparse, then filtering to exclude users who don't have any results in a much faster query.

 import ujson
+from django.db.models import Q

 @cache_with_key(realm_alert_words_cache_key, timeout=3600*24)
 def alert_words_in_realm(realm):
-    users = zerver.models.UserProfile.objects.filter(realm=realm, is_active=True).values('id', 'alert_words')
+    users = zerver.models.UserProfile.objects.filter(~Q(alert_words=ujson.dumps([]))).filter(realm=realm, is_active=True).values('id', 'alert_words')
     all_user_words = dict((user['id'], ujson.loads(user['alert_words'])) for user in users)

6. DevAuth feature makes Login page too slow

DevAuthBackend is pretty handy, but the login page will be pretty much unusable once you get a few thousand UserProfiles in the DB. Makes sense to cap the query at 100? Also sorting by email can be done on the DB as a minor improvement.

 def login_page(request, **kwargs):
     extra_context = kwargs.pop('extra_context', {})
     if dev_auth_enabled():
-        users = UserProfile.objects.filter(is_bot=False, is_active=True)
-        extra_context['direct_admins'] = sorted([u.email for u in users if u.is_admin()])
-        extra_context['direct_users'] = sorted([u.email for u in users if not u.is_admin()])
+        MAX_DEV_BACKEND_USERS = 100
+        users = UserProfile.objects.select_related('realm').filter(is_bot=False, is_active=True).order_by('email')[:MAX_DEV_BACKEND_USERS]
+        extra_context['direct_admins'] = [u.email for u in users if u.is_admin()]
+        extra_context['direct_users'] = [u.email for u in users if not u.is_admin()]

7. json_events_register - unused?

This function might not be used? At least that's what my old comment says (in my local copy).

@authenticated_json_post_view

8. realm_user_count - replace python object count with DB count query

This will be much faster for large realms.

 def realm_user_count(realm):
-    user_dicts = get_active_user_dicts_in_realm(realm)
-    return len([user_dict for user_dict in user_dicts if not user_dict["is_bot"]])
+    return UserProfile.objects.filter(realm=realm, is_active=True, is_bot=False).count()
@timabbott
Copy link
Member

Cool, thanks for doing this investigation @dbiollo ! Are you intending to develop and submit PRs for any of these issues? I'm happy to provide advice on how to best address some of these...

@timabbott
Copy link
Member

Here's some notes on these various items:

(1) Case-insensitive index on UserProfile.email.

I believe that the index in 0003_custom_indexes.py is already used for the query you reference; is the issue that we may need similar indexes elsewhere or are you seeing that index not being used?

(2) Realm.get_admin_users*

Zulip is not using the groups support of Guardian, so I think that would be a fine patch to add.

(3) Filtering UserProfile on is_active, is_bot

Those seem like reasonable indexes to add.

(4) Caching entire company directory can break memcache

Roughly how many users are required for this to become an issue? I think the places where this is used do require all the users in the company, so we'd either need to change the UI for larger realms or do something else reasonable (e.g. if this were the only issue we could get a multiplicative factor by compressing or encoding in something more efficient than JSON strings)

(5) alert_words_in_realm fetches too much data

alert_words is not a heavily used feature, so that approach would be reasonable.

(6) DevAuth feature makes Login page too slow

Capping the query at 100 seems very reasonable here.

(7) json_events_register - unused?

json_events_register is used as the handler for /json/get_events (see zproject/urls.py). However, we are in the process of removing the remaining pre-API format views, and it could be easily replaced with /json/events (I merged some similar changes to remove e.g. /json/send_message a few days ago).

(8) realm_user_count - replace python object count with DB count query

I think we may want to be thoughtful about this one -- the current code doesn't talk to the database because of the memcached entry discussed in (4). I think you're right that regardless the current check is inefficient, but count queries can be slow in postgres, so I'd definitely want to test this a bit before changing this to a count query.

@timabbott
Copy link
Member

Also, @dbiollo if the way you're testing for these issues is something we could script and include in the automated tests that we run on every Zulip change, that would be awesome -- I always find it's best with performance work to have tools making sure issues don't regress in the future :)

@dbiollo
Copy link
Author

dbiollo commented Dec 16, 2015

Hey @timabbott, my plan is to eventually plugin this into performance tests. No Pull Requests expected for a while; though I have submitted the CLA for my company to approve. Hopefully by January :/ Feel free to incorporate any of the non-controversial changes.

RE: (3) - The indexes in 0003_custom_indexes are great but only apply to zerver_message & zerver_stream. The same case-insensitive index is needed for zerver_userprofile and probably zerver_realm, zerver_preregistrationuser given that there are case-insensitive queries of the form: "xxxx__iexact".

RE: (4) - It seems to break around 10-12K users.

Was wondering if you think that Streams & Realms should have some configurable or hard coded limits to subscription size / membership, at least until such time as the scale issues are all fixed? For example, a Realm with 10K users and some public streams with all of those 10K users will kill the system ... I've been adding LIMIT 1000 in a lot of places, which makes it responsive enough.

@timabbott
Copy link
Member

OK, good to know; time permitting I'll probably get to several of these before January.

Regarding limits, I definitely think having (configurable) limits would be a valuable feature to help prevent people from accidentally getting themselves into serious trouble!

@timabbott
Copy link
Member

@dbiollo I opened #392 with versions of the fixes for issues 1, 2, 3, 5, 6, 7, and 8; I'd love it if you could take a quick look at it and test it in your environment!

For (4), I think we may need that cache for performance reasons in an actually active realm. It could be valuable to look at whether it makes sense to e.g. gzip-compress the content; I should also note that if we merge #378, we won't have memcached's 1MB limit of object sizes.

@timabbott
Copy link
Member

@dbiollo so looking at issue 4, I think there's a few things that would need to happen if we wanted to remove that cache:

(A) I pushed a small commit to https://github.com/timabbott/zulip/tree/tabbott-user-dicts removing some of the users of get_active_user_dicts_for_realm

(B) With that change, most of the zerver/lib/actions.py accesses can just use active_user_ids(realm), which we can reimplement to just doing a database query or having its own cache which would be fairly efficient as just as list of IDs.

(C) The bugdown use of it is for matching users by short or long name in @-mentions. When we solve #374, that should resolve this issue since we shouldn't need to be doing a lookup through all full names.

(D) The call in streams.py is legacy code predating the notifications_stream feature (instead sending a PM to every user!) that we should probably just remove at this point.

(E) We'd still need to use the cache for get_realm_user_dicts used in fetch_initial_state_data.

I've posted #442 to handle (A) and (D) since they're unconditional improvements to the code that don't involve a complex transition.

@brainwane
Copy link
Contributor

@dbiollo How are things going? :)

In case you'd like to talk synchronously about this: We're hosting a livechat "office hour" at the open Zulip realm https://zulip.tabbott.net on Monday, August 15th, 10am-11am Pacific Time / 1pm-2pm Eastern Time / 10:30-11:30pm India time. Please come by if you want to talk about these fixes in realtime -- or about the experimental new desktop app, helping us figure out what new users need to know, replacing memcached with Redis, etc. We'd be happy to see you.

@timabbott
Copy link
Member

Closing the remaining piece in favor of #3709

Thanks again for this testing work @dbiollo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants