I have these two requests which should perform basically the same:
SELECT f.repository_id, COUNT(*) as n
FROM files AS f
JOIN commit_files cf ON
f.repository_id=cf.repository_id AND
f.file_path=cf.file_path AND
f.blob_hash=cf.blob_hash AND
f.tree_hash=cf.tree_hash
JOIN refs ON
cf.repository_id = refs.repository_id AND
cf.commit_hash = refs.commit_hash
WHERE ref_name = 'HEAD'
GROUP BY f.repository_id
ORDER BY n DESC
and its NATURAL JOIN
equivalent
SELECT f.repository_id, COUNT(*) as n
FROM files AS f
NATURAL JOIN commit_files cf
NATURAL JOIN refs
WHERE ref_name = 'HEAD'
GROUP BY f.repository_id
ORDER BY n DESC
Unfortunately, while the first one finishes after a couple of seconds, the second one takes double that.
I analyzed their EXPLAIN
output and saw there's a tiny difference and wonder whether this could be the culprit.
For the first JOIN ON
version, the plan is:
Sort(n DESC)
└─ Project(files.repository_id, COUNT(*) as n)
└─ GroupBy
├─ Aggregate(files.repository_id, COUNT(*))
├─ Grouping(files.repository_id)
└─ Exchange(parallelism=96)
└─ SquashedTable(refs, commit_files, files)
├─ Columns
│ ├─ Column(repository_id, TEXT, nullable=false)
│ ├─ Column(file_path, TEXT, nullable=false)
│ ├─ Column(blob_hash, TEXT, nullable=false)
│ ├─ Column(tree_hash, TEXT, nullable=false)
│ ├─ Column(tree_entry_mode, TEXT, nullable=false)
│ ├─ Column(blob_content, BLOB, nullable=false)
│ ├─ Column(blob_size, INT64, nullable=false)
│ ├─ Column(repository_id, TEXT, nullable=false)
│ ├─ Column(commit_hash, TEXT, nullable=false)
│ ├─ Column(file_path, TEXT, nullable=false)
│ ├─ Column(blob_hash, TEXT, nullable=false)
│ ├─ Column(tree_hash, TEXT, nullable=false)
│ ├─ Column(repository_id, TEXT, nullable=false)
│ ├─ Column(ref_name, TEXT, nullable=false)
│ └─ Column(commit_hash, TEXT, nullable=false)
└─ Filters
├─ commit_files.repository_id = refs.repository_id
├─ commit_files.commit_hash = refs.commit_hash
├─ files.repository_id = commit_files.repository_id
├─ files.file_path = commit_files.file_path
├─ files.blob_hash = commit_files.blob_hash
├─ files.tree_hash = commit_files.tree_hash
└─ refs.ref_name = "HEAD"
While for the one with NATURAL JOIN
:
Sort(n DESC)
└─ Project(files.repository_id, COUNT(*) as n)
└─ GroupBy
├─ Aggregate(files.repository_id, COUNT(*))
├─ Grouping(files.repository_id)
└─ Exchange(parallelism=96)
└─ Project(files.repository_id, commit_files.commit_hash, files.file_path, files.blob_hash, files.tree_hash, files.tree_entry_mode, files.blob_content, files.blob_size, refs.ref_name)
└─ Filter(files.repository_id = refs.repository_id)
└─ SquashedTable(refs, commit_files, files)
├─ Columns
│ ├─ Column(repository_id, TEXT, nullable=false)
│ ├─ Column(file_path, TEXT, nullable=false)
│ ├─ Column(blob_hash, TEXT, nullable=false)
│ ├─ Column(tree_hash, TEXT, nullable=false)
│ ├─ Column(tree_entry_mode, TEXT, nullable=false)
│ ├─ Column(blob_content, BLOB, nullable=false)
│ ├─ Column(blob_size, INT64, nullable=false)
│ ├─ Column(repository_id, TEXT, nullable=false)
│ ├─ Column(commit_hash, TEXT, nullable=false)
│ ├─ Column(file_path, TEXT, nullable=false)
│ ├─ Column(blob_hash, TEXT, nullable=false)
│ ├─ Column(tree_hash, TEXT, nullable=false)
│ ├─ Column(repository_id, TEXT, nullable=false)
│ ├─ Column(ref_name, TEXT, nullable=false)
│ └─ Column(commit_hash, TEXT, nullable=false)
└─ Filters
├─ commit_files.commit_hash = refs.commit_hash
├─ files.repository_id = commit_files.repository_id
├─ files.file_path = commit_files.file_path
├─ files.blob_hash = commit_files.blob_hash
├─ files.tree_hash = commit_files.tree_hash
└─ refs.ref_name = "HEAD"
Is it possible that the extra Project
and Filter
right above the SquashedTable
can cause such a change in performance?