Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #733 by refactoring the tagged query #744

Merged
merged 3 commits into from
Oct 1, 2019
Merged

Conversation

lachie
Copy link
Contributor

@lachie lachie commented Sep 20, 2019

Also adds a simple spec for .tagging.

Also adds a simple spec for .tagging.
Copy link
Contributor

@thomasdziedzic thomasdziedzic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and thank you for adding a test!

I'm guessing the sql for this will probably look something like:

select distinct s.title
from stories s
join taggings t
on s.id = t.story_id
where s.score >= 0
and t.tag_id in (1, 2)

But another way of writing the above would be:

select s.title
from stories s
where s.score >= 0
and s.id in (select story_id from taggings where tag_id in (1, 2))

Do you know how the performance comparison would look like?

@pushcx
Copy link
Member

pushcx commented Sep 21, 2019

I'm thrilled to see the spec, thanks especially for adding a test. And to add on to @thomasdziedzic's comment, I've slowly learned to be suspicious when I see distinct - when I've used it, I often later learn the hard way that I was papering over a misunderstanding in what the query was doing. I think the important thing about the second query is that it expresses a slightly different meaning. The first one is "give me the stories and their taggings", the second is "give me the stories in the set of stories with these tags". We have dupes because the first one is not exactly what we mean to do.

@lachie
Copy link
Contributor Author

lachie commented Sep 22, 2019

Hey,
Thanks for the feedback, I've switched it to the subquery version.

I didn't do any irl benchmarking, but here's a quick explain of both queries (the current query generated via #to_sql)

MariaDB [lobsters_dev]> explain SELECT DISTINCT `stories`.* FROM `stories` INNER JOIN `taggings` ON `taggings`.`story_id` = `stories`.`id` WHERE `stories`.`merged_story_id` IS NULL AND `stories`.`is_expired` = FALSE AND ((CAST(upvotes AS signed) - CAST(downvotes AS signed)) >= 0) AND `taggings`.`tag_id` IN (1, 2) ORDER BY `stories`.`created_at` DESC;
+------+-------------+----------+--------+-------------------------------------------------------------------------------+--------------------+---------+--------------------------------+------+--------------------------------------------------------+
| id   | select_type | table    | type   | possible_keys                                                                 | key                | key_len | ref                            | rows | Extra                                                  |
+------+-------------+----------+--------+-------------------------------------------------------------------------------+--------------------+---------+--------------------------------+------+--------------------------------------------------------+
|    1 | SIMPLE      | taggings | range  | story_id_tag_id,taggings_tag_id_fk                                            | taggings_tag_id_fk | 8       | NULL                           | 4    | Using index condition; Using temporary; Using filesort |
|    1 | SIMPLE      | stories  | eq_ref | PRIMARY,is_idxes,index_stories_on_is_expired,index_stories_on_merged_story_id | PRIMARY            | 8       | lobsters_dev.taggings.story_id | 1    | Using where                                            |
+------+-------------+----------+--------+-------------------------------------------------------------------------------+--------------------+---------+--------------------------------+------+--------------------------------------------------------+

MariaDB [lobsters_dev]> explain select s.title from stories s where (s.upvotes-s.downvotes) >= 0 and s.id in (select story_id from taggings where tag_id in (1, 2));
+------+--------------+-------------+--------+------------------------------------+--------------------+---------+--------------------------------+------+-----------------------+
| id   | select_type  | table       | type   | possible_keys                      | key                | key_len | ref                            | rows | Extra                 |
+------+--------------+-------------+--------+------------------------------------+--------------------+---------+--------------------------------+------+-----------------------+
|    1 | PRIMARY      | <subquery2> | ALL    | distinct_key                       | NULL               | NULL    | NULL                           | 4    |                       |
|    1 | PRIMARY      | s           | eq_ref | PRIMARY                            | PRIMARY            | 8       | lobsters_dev.taggings.story_id | 1    | Using where           |
|    2 | MATERIALIZED | taggings    | range  | story_id_tag_id,taggings_tag_id_fk | taggings_tag_id_fk | 8       | NULL                           | 4    | Using index condition |
+------+--------------+-------------+--------+------------------------------------+--------------------+---------+--------------------------------+------+-----------------------+
3 rows in set (0.001 sec)

I haven't used mysql in a minute so don't really have an intuition on which is better from eyeballing.

The subquery one is more elegant in any case, so here it is!

@lachie lachie changed the title Fixes #733 by adding .distinct to the tagged query Fixes #733 by refactoring the tagged query Sep 23, 2019
@pushcx
Copy link
Member

pushcx commented Sep 24, 2019

Yep, they're basically equivalent... except the big difference is that distinct prompts MySQL to do an extra sort pass, that's what the Using filesort note is about.

This looks like it'll execute two queries, so there's an extra round-trip to the server. Can you express this in one, perhaps with ActiveRecord#merge?

@lachie
Copy link
Contributor Author

lachie commented Sep 24, 2019

Hi,
I did a #to_sql to verify that its one query which resembles the subquery suggestion:

# assuming submitter, tag1, tag2
repo = StoryRepository.new(submitter)
tagged = repo.tagged([tag1, tag2])
puts tagged.to_sql
#=>
SELECT `stories`.* FROM `stories` 
WHERE `stories`.`merged_story_id` IS NULL 
AND `stories`.`is_expired` = FALSE 
AND ((CAST(upvotes AS signed) - CAST(downvotes AS signed)) >= 0) 
AND `stories`.`id` IN (SELECT `taggings`.`story_id` FROM `taggings` WHERE `taggings`.`tag_id` IN (3, 4)) 
ORDER BY `stories`.`created_at` DESC

Sorry, is that what you meant by one query?

@pushcx
Copy link
Member

pushcx commented Oct 1, 2019

Huh. Yeah, that's exactly what I meant, and I pulled your branch locally to test it out. I'm surprised that the Tagging.select doesn't trigger an independent query. Rails must be doing something very clever under the covers to recognize the query hasn't been evaluated yet and that it can fold it in.

Well, wonderful! Merging. Thanks so much for running this all the way down with a test.

@pushcx pushcx merged commit 225900e into lobsters:master Oct 1, 2019
pushcx pushed a commit that referenced this pull request Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants