SQL interface to git repositories, written in Go. https://docs.sourced.tech/gitbase

Last update: Dec 25, 2022

Comments: 16

gitbase

gitbase, is a SQL database interface to Git repositories.

This project is now part of source{d} Community Edition, which provides the simplest way to get started with a single command. Visit https://docs.sourced.tech/community-edition for more information.

It can be used to perform SQL queries about the Git history and about the Universal AST of the code itself. gitbase is being built to work on top of any number of git repositories.

gitbase implements the MySQL wire protocol, it can be accessed using any MySQL client or library from any language.

src-d/go-mysql-server is the SQL engine implementation used by gitbase.

Status

The project is currently in alpha stage, meaning it's still lacking performance in a number of cases but we are working hard on getting a performant system able to process thousands of repositories in a single node. Stay tuned!

Examples

You can see some query examples in gitbase documentation.

Motivation and scope

gitbase was born to ease the analysis of git repositories and their source code.

Also, making it MySQL compatible, we provide the maximum compatibility between languages and existing tools.

It comes as a single self-contained binary and it can be used as a standalone service. The service is able to process local repositories and integrates with existing tools and frameworks to simplify source code analysis on a large scale. The integration with Apache Spark is planned and is currently under active development.

License

Apache License Version 2.0, see LICENSE

Owner

source{d}

https://github.com/src-d/gitbase

Comments

In memory caching lead to crash
Issue

In the context of doing topic modeling experiments, @m09 and myself tried to use Gitbase to parse all blobs in tagged references of a given repository, in order to extract all identifiers, comments and literals. However, we have not been able to successfully use Gitbase to do so, and have had to switch to doing the parsing client side.

The reason for that is that, when querying Gitbase, we see the following behavior:

An increase in memory usage.

No decrease after time goes by.

When all available memory is consumed, an increase in block I/O and a quasi stagnation of the memory consumed by Gitbase at 99.999 ... %, indicating heavy use of Swap memory.

Server crash if the query goes on for too long past that point.

We still see the same behavior when retrieving only the blob contents from Gitbase, however the memory consumed is not an issue, as it is much less then when parsing UASTs. We have inferred that there was some caching going one, and after talking about the issue on the dev-processing channel, we tried to disable the caching - however it changed nothing. Javi told us that the caching we had disabled was for go-git cache, so it is probably something else.

What we don't understand is why we cannot get rid of the behavior, i.e. why once a blob has been parsed and returned client side it seemingly remains in memory.

Steps to reproduce

Launch gitbase and babelfish containers:

docker run -d --rm --name bblfshd --privileged -p 9432:9432 -m 4g bblfsh/bblfshd:v2.14.0-drivers docker run -d --rm --name gitbase -p 3306:3306 --link bblfshd:bblfshd -e BBLFSH_ENDPOINT=bblfshd:9432 -m 2g -v /path/to/repos:/opt/repos srcd/gitbase:latest

With /path/to/repos pointing to a repository, for instance pytorch. Then, open two more terminals to monitor what's happening with docker stats, and run queries like this one for example for pytorch, using for example the mySQL client:

SELECT cf.file_path, cf.blob_hash, LANGUAGE(cf.file_path) as lang, uast_extract(uast(f.blob_content, LANGUAGE(cf.file_path), '//uast:String'), "Value") FROM repositories r NATURAL JOIN refs rf NATURAL JOIN commit_files cf NATURAL JOIN files f WHERE r.repository_id = 'pytorch' AND is_tag(rf.ref_name) AND lang ='Python'

You should see the memory usage of the gitbase container increase sharply until hitting 2 GB, then a heavy increase in BLOCK I/O, and finally the container will crash.

Empty results on a seemingly correct query

My goal: only see files from the ~equivelant~ of HEAD of PGA siva files (PGA's head is fuzzy, I know).

Download the same dataset:

pga list -l java -f json | head -n 100 | jq -r '.sivaFilenames[]' | pga get -i -o repositories

Index creation is very fast (there are about 125k rows in refs on my 185 siva repos):

CREATE INDEX refs_name_substr ON refs USING pilosalib (SUBSTRING(refs.ref_name,1,15));

My query (runs for 35 seconds and then returns empty results):

SELECT 
    files.repository_id,
    files.file_path
FROM files
NATURAL JOIN commit_files
NATURAL JOIN commits
NATURAL JOIN refs
WHERE 
    SUBSTRING(refs.ref_name,1,15) = 'refs/heads/HEAD';

See if results are returned on my WHERE clause:

 SELECT * FROM refs WHERE SUBSTRING(refs.ref_name,1,15) = 'refs/heads/HEAD';
+---------------+----------+-------------+
| repository_id | ref_name | commit_hash |
+---------------+----------+-------------+
| /home/mthek/projects/demo-vt/repositories/siva/latest/03a0faf87e411ee894be474ac0ebd8e48652df69.siva | refs/heads/HEAD/01612921-7835-16ee-b6a3-e3381810c049 | 7824ae7845d63d5dfae4165f75b14f71d476248f |
| /home/mthek/projects/demo-vt/repositories/siva/latest/03a0faf87e411ee894be474ac0ebd8e48652df69.siva | refs/heads/HEAD/016129f9-edb3-15eb-2d16-e7dac4cd41f6 | 7824ae7845d63d5dfae4165f75b14f71d476248f |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0573f918d9b0822e9ce30b8a8f8a92bbab17300f.siva | refs/heads/HEAD/01612921-765b-ff7c-a5f3-2e12701794fc | 71afe993bd14ee3232caf92b64c05b8514235890 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0573f918d9b0822e9ce30b8a8f8a92bbab17300f.siva | refs/heads/HEAD/016129f9-ebd0-630f-2ae8-8ab9d76198ca | 71afe993bd14ee3232caf92b64c05b8514235890 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/09eccb718faf3ac3d2ac08eeb3deb3d5a403d5fa.siva | refs/heads/HEAD/01612921-7787-3bdf-bbce-e4e525a410ab | c3c7b957295cb8b7d61acf53060bddff4a317505 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/09eccb718faf3ac3d2ac08eeb3deb3d5a403d5fa.siva | refs/heads/HEAD/016129f9-ed12-034f-948f-d7c3a78a727e | c3c7b957295cb8b7d61acf53060bddff4a317505 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/09eccb718faf3ac3d2ac08eeb3deb3d5a403d5fa.siva | refs/heads/HEAD/016129fa-1dd1-944c-44d4-344b03342aad | 0119b6f175a57d57501e4e94ba5f9eafe32a9359 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0a78a10ff25754b510a2423fb40a00cb02f1a44d.siva | refs/heads/HEAD/01612921-76b0-8517-7dd5-1a6745a234e0 | 0a979d145683b62eff62796acbc21ac8766088a0 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0a78a10ff25754b510a2423fb40a00cb02f1a44d.siva | refs/heads/HEAD/016129f9-ec2f-0c49-2bed-25f216edf2c3 | 0a979d145683b62eff62796acbc21ac8766088a0 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0a78a10ff25754b510a2423fb40a00cb02f1a44d.siva | refs/heads/HEAD/016129fb-9cf2-2fc6-1654-045352989fd1 | 5c1da606814d97eaba28d4b7206d126bc23627b3 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0a78a10ff25754b510a2423fb40a00cb02f1a44d.siva | refs/heads/HEAD/016129fd-c3a2-140d-41ab-fca5cc71f6cf | 0a979d145683b62eff62796acbc21ac8766088a0 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0a78a10ff25754b510a2423fb40a00cb02f1a44d.siva | refs/heads/HEAD/01612a00-10e8-3ff7-896f-a970071eecbc | 5c1da606814d97eaba28d4b7206d126bc23627b3 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0be86af052a0368c65020a248d5d13efd1ec74f9.siva | refs/heads/HEAD/016129f9-eb5d-9c72-652b-9d6fa13ad169 | e73dab44213ffa76b3d6a853aecb109929c3e2b5 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0be86af052a0368c65020a248d5d13efd1ec74f9.siva | refs/heads/HEAD/016129fe-9027-203d-aeb5-116fcb9bbf95 | e73dab44213ffa76b3d6a853aecb109929c3e2b5 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0c3197aa444d192843ccb62b8eb80b04eddd2322.siva | refs/heads/HEAD/016129fb-03d5-c928-c444-da8daa361e21 | a6c0d95184c8985423331fe916edee59378f61fe |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0e06624a50bd6d3c46c611c24c8a419e995ad81b.siva | refs/heads/HEAD/01612921-7654-a83e-9a25-bbfb30d9d9ef | 21df6c124e90c6312301bf4fdd61ae98c5486109 |
| /home/mthek/projects/demo-vt/repositories/siva/latest/0e06624a50bd6d3c46c611c24c8a419e995ad81b.siva | refs/heads/HEAD/016129f9-ebc8-ec4b-12c6-ef80c5d902c9 | 21df6c124e90c6312301bf4fdd61ae98c5486109 |
+---------------+----------+-------------+
17 rows in set (0.04 sec)

(number of rows seems low but that is probably because I need to check another pattern for head or master, but that isn't relevant to this bug)

Extra:

I went and checked this query on our staging environment with gitbase-playground, and it works perfectly:

SELECT 
    files.repository_id,
    files.file_path
FROM files
NATURAL JOIN commit_files
NATURAL JOIN commits
NATURAL JOIN refs
WHERE refs.ref_name = 'HEAD'

panic: runtime error: invalid memory address or nil pointer dereference

gitbase v0.19.0-beta4

query:

SELECT 
r.repository_id, SUM(ARRAY_LENGTH(SPLIT(b.blob_content, '\n'))) as lines_count
FROM refs r
NATURAL JOIN commit_blobs ct
NATURAL JOIN blobs b
WHERE r.ref_name = 'HEAD'
GROUP BY r.repository_id
ORDER BY lines_count DESC

Traceback:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x87fe52]

goroutine 6076 [running]:
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/plumbing/cache.(*ObjectLRU).Put(0xc000aa4ff0, 0x14440c0, 0xc01391da40)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/plumbing/cache/object_lru.go:64 +0x352
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/storage/filesystem.(*ObjectStorage).getFromUnpacked(0xc017bec868, 0xa9112fc6650d62b5, 0x83f2209191640cdc, 0xa7d9180f, 0x14440c0, 0xc01391da40, 0x0, 0x0)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/storage/filesystem/object.go:344 +0x39a
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/storage/filesystem.(*ObjectStorage).EncodedObject(0xc017bec868, 0x112fc6650d62b503, 0xf2209191640cdca9, 0xa7d9180f83, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/storage/filesystem/object.go:254 +0x3eb
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/plumbing/object.GetBlob(0x1441f00, 0xc017bec850, 0xa9112fc6650d62b5, 0x83f2209191640cdc, 0xa7d9180f, 0x650d62b5000081a4, 0x91640cdca9112fc6, 0xa7d9180f83f22091)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/plumbing/object/blob.go:23 +0x4e
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/plumbing/object.(*FileIter).Next(0xc01e6b4b40, 0x4211e8, 0xc00498dd10, 0x7efccf3a1cb3)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-git.v4/plumbing/object/file.go:100 +0x136
github.com/src-d/gitbase.(*squashCommitBlobsIter).Advance(0xc017becbd0, 0xc, 0xc00113fb01)
	/go/src/github.com/src-d/gitbase/squash_iterator.go:2754 +0x7c
github.com/src-d/gitbase.(*squashCommitBlobBlobsIter).Advance(0xc02245c5a0, 0x5c6fc722, 0x27c2150a)
	/go/src/github.com/src-d/gitbase/squash_iterator.go:3051 +0x49
github.com/src-d/gitbase.(*chainableRowIter).Next(0xc00b3d0370, 0x5777dccfb2, 0x2118820, 0x20, 0x25, 0xc00113fc80)
	/go/src/github.com/src-d/gitbase/squash.go:150 +0x37
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql.(*spanIter).Next(0xc02245c5f0, 0xc000063290, 0xc000062000, 0xc00113fcb0, 0x414d10, 0xc0044c4b90)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/session.go:346 +0x5d
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*trackedRowIter).Next(0xc0227dad20, 0x50, 0x44b4f8, 0x52307915bd55c, 0x27c2138b, 0x27c2138b0113fd68)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/process.go:145 +0x37
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*FilterIter).Next(0xc00bb17e00, 0x5777dcce26, 0x2118820, 0x3, 0x3, 0xc00113feb2)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/filter.go:105 +0x38
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql.(*spanIter).Next(0xc02245c780, 0xc00113fef8, 0x44b4f8, 0x52307915bd4b6, 0xc027c212db, 0x27c212db0113fe20)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/session.go:346 +0x5d
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*iter).Next(0xc0227dad40, 0x5777dccd80, 0x2118820, 0x4dd96c, 0xc02226c440, 0xc026896120)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/project.go:129 +0x38
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql.(*spanIter).Next(0xc02245c7d0, 0xc00113feac, 0x3, 0x2, 0x0, 0x0)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/session.go:346 +0x5d
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*exchangeRowIter).iterPartition(0xc01b52a5a0, 0x142d620, 0xc00b3d01b0)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/exchange.go:245 +0x251
github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*exchangeRowIter).start.func1(0xc01b52a5a0, 0xc0217396b0, 0x142d620, 0xc00b3d01b0)
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/exchange.go:170 +0x3f
created by github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan.(*exchangeRowIter).start
	/go/src/github.com/src-d/gitbase/vendor/gopkg.in/src-d/go-mysql-server.v0/sql/plan/exchange.go:169 +0x10d

Increase go-git cache size

Right now there is no way of changing the default cache size in go-git and its size too small (96 MiB). I've been doing tests changing this value and its performance improved a lot.

Repositories: linux (2013), numpy, tensorflow Number of rows: 395709 Query: SELECT count(*) FROM commits c NATURAL JOIN ref_commits r WHERE r.ref_name = 'HEAD';

Default cache: 1 row in set (54 min 22.17 sec) Cache size * 8: 1 row in set (20 min 43.69 sec)

Memory consumption is also not too big. gitbase used 1.3 GiB in this query.

We should add an option to go-git Open to select cache size.

Parsing C# doesn't work

Run this query on any C# repository:

SELECT UAST(f.blob_content, LANGUAGE(f.file_path, f.blob_content)) AS uast
FROM refs AS r
NATURAL JOIN commit_files
NATURAL JOIN files AS f
WHERE r.ref_name = 'HEAD' AND f.file_path REGEXP('.*.cs')
LIMIT 5

it will return empty uasts.

Couple of question / issues on the standalone installation of gitbase

I have done a standalone installation of the gitbase server.I started the server providing the git directory pat. When I launch the mysql client using the following command - mysql -q -u root -h 127.0.0.1

I get the mysql prompt . When I execute the following query -

mysql> select * from repositories; +---------------+ | repository_id | +---------------+ +---------------+

I get empty directory. So not sure how to troubleshoot this issues. As there is no much documentation around how to stop / purge the gitbase server.

Make mysqldump work with gitbase

Right now if you try to do a mysqldump, you will have the next error:

mysqldump --all-databases --port=3306 --host=localhost --protocol=tcp --user=root

mysqldump: Couldn't execute '/*!40100 SET @@SQL_MODE='' */': unknown error: syntax error at position 30 (1105)

Negation Expression on indexed column not working correctly

Executing the following query:

mysql> select count(*) from commits where commit_author_email='[email protected]' group by repository_id;
+----------+
| COUNT(*) |
+----------+
|     1213 |
+----------+
1 row in set (0,37 sec)

All appears to be good, but if we just want the count of commits that doesn't have that commit author email:

mysql> select count(*) from commits where commit_author_email!='[email protected]' group by repository_id;
+----------+
| COUNT(*) |
+----------+
|     6569 |
+----------+
1 row in set (1,00 sec)

The result appears to be not correct. The complete count result:

DEBU[2114] finished pilosa indexing                      duration=34m42.734502461s id=commits_author_email_idx mapping=30.782425411s pilosa=7.745289374s rows=2567829

So I suppose the second query should return 2567829-1213 = 2566616

Low perf on NATURAL JOINs

I have these two requests which should perform basically the same:

SELECT f.repository_id, COUNT(*) as n
FROM   files AS f
       JOIN commit_files cf ON
            f.repository_id=cf.repository_id AND
            f.file_path=cf.file_path AND
            f.blob_hash=cf.blob_hash AND
            f.tree_hash=cf.tree_hash
       JOIN refs ON
            cf.repository_id = refs.repository_id AND
            cf.commit_hash = refs.commit_hash
WHERE  ref_name = 'HEAD'
GROUP BY f.repository_id
ORDER BY n DESC

and its NATURAL JOIN equivalent

SELECT f.repository_id, COUNT(*) as n
FROM   files AS f
       NATURAL JOIN commit_files cf
       NATURAL JOIN refs
WHERE  ref_name = 'HEAD'
GROUP BY f.repository_id
ORDER BY n DESC

Unfortunately, while the first one finishes after a couple of seconds, the second one takes double that. I analyzed their EXPLAIN output and saw there's a tiny difference and wonder whether this could be the culprit.

For the first JOIN ON version, the plan is:

Sort(n DESC)
 └─ Project(files.repository_id, COUNT(*) as n)
     └─ GroupBy
         ├─ Aggregate(files.repository_id, COUNT(*))
         ├─ Grouping(files.repository_id)
         └─ Exchange(parallelism=96)
             └─ SquashedTable(refs, commit_files, files)
                 ├─ Columns
                 │   ├─ Column(repository_id, TEXT, nullable=false)
                 │   ├─ Column(file_path, TEXT, nullable=false)
                 │   ├─ Column(blob_hash, TEXT, nullable=false)
                 │   ├─ Column(tree_hash, TEXT, nullable=false)
                 │   ├─ Column(tree_entry_mode, TEXT, nullable=false)
                 │   ├─ Column(blob_content, BLOB, nullable=false)
                 │   ├─ Column(blob_size, INT64, nullable=false)
                 │   ├─ Column(repository_id, TEXT, nullable=false)
                 │   ├─ Column(commit_hash, TEXT, nullable=false)
                 │   ├─ Column(file_path, TEXT, nullable=false)
                 │   ├─ Column(blob_hash, TEXT, nullable=false)
                 │   ├─ Column(tree_hash, TEXT, nullable=false)
                 │   ├─ Column(repository_id, TEXT, nullable=false)
                 │   ├─ Column(ref_name, TEXT, nullable=false)
                 │   └─ Column(commit_hash, TEXT, nullable=false)
                 └─ Filters
                     ├─ commit_files.repository_id = refs.repository_id
                     ├─ commit_files.commit_hash = refs.commit_hash
                     ├─ files.repository_id = commit_files.repository_id
                     ├─ files.file_path = commit_files.file_path
                     ├─ files.blob_hash = commit_files.blob_hash
                     ├─ files.tree_hash = commit_files.tree_hash
                     └─ refs.ref_name = "HEAD"

While for the one with NATURAL JOIN:

Sort(n DESC)
 └─ Project(files.repository_id, COUNT(*) as n)
     └─ GroupBy
         ├─ Aggregate(files.repository_id, COUNT(*))
         ├─ Grouping(files.repository_id)
         └─ Exchange(parallelism=96)
             └─ Project(files.repository_id, commit_files.commit_hash, files.file_path, files.blob_hash, files.tree_hash, files.tree_entry_mode, files.blob_content, files.blob_size, refs.ref_name)
                 └─ Filter(files.repository_id = refs.repository_id)
                     └─ SquashedTable(refs, commit_files, files)
                         ├─ Columns
                         │   ├─ Column(repository_id, TEXT, nullable=false)
                         │   ├─ Column(file_path, TEXT, nullable=false)
                         │   ├─ Column(blob_hash, TEXT, nullable=false)
                         │   ├─ Column(tree_hash, TEXT, nullable=false)
                         │   ├─ Column(tree_entry_mode, TEXT, nullable=false)
                         │   ├─ Column(blob_content, BLOB, nullable=false)
                         │   ├─ Column(blob_size, INT64, nullable=false)
                         │   ├─ Column(repository_id, TEXT, nullable=false)
                         │   ├─ Column(commit_hash, TEXT, nullable=false)
                         │   ├─ Column(file_path, TEXT, nullable=false)
                         │   ├─ Column(blob_hash, TEXT, nullable=false)
                         │   ├─ Column(tree_hash, TEXT, nullable=false)
                         │   ├─ Column(repository_id, TEXT, nullable=false)
                         │   ├─ Column(ref_name, TEXT, nullable=false)
                         │   └─ Column(commit_hash, TEXT, nullable=false)
                         └─ Filters
                             ├─ commit_files.commit_hash = refs.commit_hash
                             ├─ files.repository_id = commit_files.repository_id
                             ├─ files.file_path = commit_files.file_path
                             ├─ files.blob_hash = commit_files.blob_hash
                             ├─ files.tree_hash = commit_files.tree_hash
                             └─ refs.ref_name = "HEAD"

Is it possible that the extra Project and Filter right above the SquashedTable can cause such a change in performance?

surprising performance issue

I just ran this query on top of github.com/golang/go:

  SELECT
  	LANGUAGE(t.tree_entry_name, b.blob_content) as lang,
	t.tree_entry_name as name,
       b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash

This finishes in 0.65s :tada:

Unfortunately this other request takes forever:

  SELECT
	t.tree_entry_name as name,
        b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash
   WHERE LANGUAGE(t.tree_entry_name, b.blob_content) = 'go'

Trying to see whether I could find a workaround I wrote this second query:

SELECT name, code
FROM 
(
  SELECT
	LANGUAGE(t.tree_entry_name, b.blob_content) = 'go' as lang,
	t.tree_entry_name as name,    
        b.blob_content as code
  FROM refs r 
       JOIN commits c ON r.commit_hash = c.commit_hash
       JOIN commit_trees ct ON c.commit_hash = ct.commit_hash
       JOIN tree_entries t ON ct.tree_hash = t.tree_hash
       JOIN blobs b ON t.blob_hash = b.blob_hash
) as blobs
WHERE lang = 'go'

Both of these requests take too long for me to wait.

Gitbase doesn't work on windows with mounted directory for indexes
Gitbase 0.19.0.

Screenshots because it's hard to copy-past from remote windows console.

Client:

Server:

The error:

unable to save the index open file: open ...: invalid argument

But the file is actually created on host file system:

I tried to read the source code and see where the error can come from. I found this line: https://github.com/pilosa/pilosa/blob/f2994736585a8aafc2f2c47c3698b7acd3b95373/fragment.go#L199 which looks like the right place.

I tried to reproduce it by creating simple script and running it inside docker with mounted directory:

package main import ( "fmt" "os" ) func main() { asPilosa := "/mounted/asPilosa" _, err := os.OpenFile(asPilosa, os.O_RDWR|os.O_CREATE|os.O_APPEND, 0666) if err != nil { fmt.Printf("open file as pilosa: %s\n", err) } }

but no luck. It creates the file without error.

It's super inconvinient to debug inside remote desktop. So I didn't go further by rebuilding gitbase with extra debug messages.
The link of community-edition has broken

The community-edition in README is broken that I can't reach, is there have something go wrong and forget update documentation? or my un stable network
Java JDBC Connection Error: unknown error: expecting "EOF" but got 'V' instead

hi I use below code docker run -itd --name git_base --env GITBASE_PASSWORD=root -p 3344:3306 -v /Users/code/test:/opt/repos srcd/gitbase:latest and java jdbc connection gitbase mysql Connection connection=JDBCUtils.getConnection("jdbc:mysql://127.0.0.1:3344/gitbase","root","root"); but print error Exception in thread "main" java.sql.SQLException: unknown error: expecting "EOF" but got 'V' instead at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:996) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2526) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484) at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1446) at com.mysql.jdbc.ConnectionImpl.loadServerVariables(ConnectionImpl.java:3828) at com.mysql.jdbc.ConnectionImpl.initializePropsFromServer(ConnectionImpl.java:3268) at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2278) at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2064) at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:790) at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:44) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.mysql.jdbc.Util.handleNewInstance(Util.java:377) at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:395) Is there any way to solve it？ thanks

Missing function in Gitbase DB (MariaDB)

Hey,

when I try to run the following example in the db

SELECT repository_id, file_path,
       JSON_UNQUOTE(JSON_EXTRACT(bl, "$.linenum")),
       JSON_UNQUOTE(JSON_EXTRACT(bl, "$.author")),
       JSON_UNQUOTE(JSON_EXTRACT(bl, "$.text"))
FROM   (SELECT repository_id, file_path,
               EXPLODE(BLAME(repository_id, commit_hash, file_path)) AS bl
        FROM   ref_commits
               NATURAL JOIN blobs
               NATURAL JOIN commit_files
        WHERE  ref_name = 'HEAD'
               AND NOT IS_BINARY(blob_content)
        ) as p
WHERE  JSON_EXTRACT(bl, "$.text") LIKE '%// TODO%';

I get the following error

ERROR 1105 (HY000): unknown error: A function: 'blame' not found.

I'm new to source{d} and using the community edition. Could you guys point me in the right direction. For some reason SHOW FUNCTION STATUSis working either so I'm having problems debugging this.

Natural join seems to eliminate rows which it shouldn't

MySQL [gitbase]> select blob_hash, repository_id from blobs natural join repositories where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
+------------------------------------------+-------------------------------------+
6 rows in set (14.90 sec)

MySQL [gitbase]> select blob_hash, repository_id from blobs where blob_hash in ('93ec5b4525363844ddb1981adf1586ebddbc21c1', 'aad34590345310fe813fd1d9eff868afc4cea10c', 'ed82eb69daf806e521840f4320ea80d4fe0af435');
+------------------------------------------+-------------------------------------+
| blob_hash                                | repository_id                       |
+------------------------------------------+-------------------------------------+
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/python-driver     |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/javascript-driver |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/enry               |
| aad34590345310fe813fd1d9eff868afc4cea10c | github.com/bblfsh/ruby-driver       |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/gitbase            |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/gitbase            |
| 93ec5b4525363844ddb1981adf1586ebddbc21c1 | github.com/src-d/go-mysql-server    |
| ed82eb69daf806e521840f4320ea80d4fe0af435 | github.com/src-d/go-mysql-server    |
+------------------------------------------+-------------------------------------+
8 rows in set (0.13 sec)

also note that removing the natural join makes things go much faster- it was my understanding that normally we want to join with repositories to benefit from some specific optimizations (although I'm guessing that filtering with blob_hash makes those optimizations moot).

Schema introspection (SHOW FULL COLUMNS ...) became very slow

In gitbase 0.20 schema introspection is fast and full.

MySQL Connector/J JDBC metadata call that gets all columns for all tables at once metaData.getColumns("gitbase", "", "%", "%") is converted to calls like the following for each table SHOW FULL COLUMNS FROM `commit_trees` FROM `gitbase` LIKE '%'"

In 0.23 and 0.24-rc the above queries are very slow (several minutes) and even fail for some tables completely in 0.23 (0.24 seems to fix that).

The above prevents from using gitbase in DB tools like JetBrains DataGrip.