Identifying Repositories

February 18, 2018

A natural first step is to use a git repository’s clone uri to uniquely identify it — but don’t!

What is a clone uri? It’s what you pass to the git clone command. So if you’re using github and copying directly out of the UI you’ll end up with one these:

$ git clone https://github.com/sghill/ci-samples.git
$ git clone git@github.com:sghill/ci-samples.git

I went down this painful road once when building a repository-aware system — now I like having a repository key instead. Here’s an example RepositoryKey in java:

public class RepositoryKey {
    private final String hostKey;
    private final String owner;
    private final String name;
    
    @Override public String toString() {
        return hostKey + "/" + owner + "/" + name;
    }
}

Why Bother?

Let’s look at a few scenarios where a system can easily end up with duplicate data by using a repository’s clone uri as the unique identifier.

Scheme Differences

Git repository hosts generally let users clone in https or ssh. Typically teams will stick to one format - but if nothing enforces that, any repository-aware system is going to end up with distinct repositories that refer to the same exact thing:

id uri
1 https://github.com/sghill/ci-samples.git
2 git@github.com:sghill/ci-samples.git
3 ssh://git@github.com/sghill/ci-samples.git

One solution to this is to translate all types of clone uri into the preferred scheme.

User Included?

If we’ve decided to translate every clone uri to the ssh scheme, we can still end up with duplicate data because of differences in the user portion of the uri:

id uri
1 ssh://git@github.com/sghill/ci-samples.git
2 ssh://github.com/sghill/ci-samples.git

For this problem, we could add one more step to our translation, ensuring there is always a user present.

Casing

Now let’s imagine someone who uses our system has a script that always capitalizes the first letter of the owner’s name. Git happily cloned the same repository, but we’ve ended up with duplicate data again:

id uri
1 ssh://git@github.com/sghill/ci-samples.git
2 ssh://git@github.com/Sghill/ci-samples.git

We can avoid this by adding a downcase step to our translation layer or querying our datastore case-insensitively.

Cnames

We host our open-source stuff on github, our internal code is running a different git host. It was hosted at scm-internal.buildlab.io, but now the preferred host is scm.internal.buildlab.io. Of course breaking backwards compatibility wasn’t in anyone’s interest, so the old one is a cname to the new one and both continue to function.

That we’re using a clone uri works quite nicely for supporting multiple git hosts, but we still end up with duplicate data when users access the same repository over two different hosts:

id uri
1 ssh://git@scm-internal.buildlab.io/sghill/ci-samples.git
2 ssh://git@scm.internal.buildlab.io/sghill/ci-samples.git

We’re going to need another step in the translation layer to map old hostnames to the preferred hostname.

Testing Internal Deploys

Our internal git hosting infrastructure gets updated from time to time, and we need to see how the new version will perform before deploying it to production. Luckily our repository-aware system can give us uris to real-world repositories, but they’ll all be pointing to the production environment - and we definitely don’t need to load test production while it’s under production load.

To get around this we can replace all the hostnames of the given clone uris before we clone them in our load test to ensure they act on the correct environment.

Help

Lastly, sometimes we’ll end up a link to the web ui and it’d be helpful to easily get more context about this repository from our system:

Hey, I’m having some trouble with sghill/ci-samples on github

If everything we have is based off of clone uris, we’re going to have to translate this web ui link into our preferred clone uri before querying our system about it.

A Better Way

By using the repository’s clone uri as a unique identifier for it, we’ve ended up with a translation layer that looks like this:

  1. detect format (https, ssh, none, user included, web ui)
  2. locate parser & parse input into hostname, owner, name
  3. downcase hostname, owner, name
  4. lookup preferred hostname by given hostname
  5. locate preferred formatter & format new clone uri
  6. store the results

That works, but suppose we replaced step 5 with looking up a host key by host name - something like gh for github.com and internal for scm.internal.buildlab.io. We get a lot of benefits from this:

  • We’re saving data the way we talk about the data - gh/sghill/ci-samples is able to uniquely identify the thing we’re talking about in conversation and in our system
  • We can construct uris trivially for any internal environment we have
  • It’s simple to find all the repositories at a host key
  • It’s simple to find all the repositories owned by someone
  • If we decided to switch schemes, hosts, ports or anything else for any reason, that’d be a very simple change

Profile picture

Written by @sghill, who works on build, automated change, and continuous integration systems.