GCube Git Migration

From Gcube Wiki
Jump to: navigation, search

During the BlueBRIDGE project, we started an activity for migrating the gCube source code from the Subversion (SVN) technology to Git. An analysis of the requirements and the design of a proper process and tools to governate the migration has been conducted. However, due to resources constraints, the actual migration has not been implemented.

This document collects the output of all the work done and wants to be an helpful starting point for the future implementation of the migration of gCube source code from SVN to Git.


The migration will affect the tools and the procedures currently used in the gCube development, integration and release. We started our analysis by considering some basic requirements that the final solution should address.

Gradual implementation

The migration should be designed to allow a gradual migration of gCube components allowing for continuing the normal development, integration and release activities even if some components have been already migrated and some others not yet. In this way, no major delays will be introduced in the gCube release scheduling. To address this requirement, the procedures and the tools used in the integration of gCube must support both SVN and Git repositories and allow to seamless integration between components stored in SVN and Git.

History of files

The migration should keep the history of the source code. This means that not only the latest version of each source file should be available in the new Git repository, but also all the single commits of each single source file with, possibly, the associated meta-data (i.e. date, author, commit message) should be preserved.

Branching schema

In the current SVN repository, each component has multiple branches: a) the development branch (i.e. trunk), b) one release branch for each release and, optionally, c) one or more private branch (or feature branch). This allows developers to organize better their code and to maintain multiple versions of their component in parallel. The same functionality should be preserved when migrating to Git.

GitHub publication

Currently the source code of each gCube release (composed by the source code of all components that constitute the release) is replicated in a repository on GitHub. The current solution takes the source code to publish on GitHub from the source packages built by Maven during the builds and published on Nexus. This was the only viable solution to publish the source code from SVN, but it has the disadvantage that the history and the authors of the source files are not kept when copied on GitHub. After the migration, since the technology of gCube source code repositories and GitHub will be the same (Git), it should be possible also to publish on GitHub the authors and the history of each source file.

Git repositories management software

The software that will be used to manage the Git repositories should guarantee at least the following functionalities:

  • integration with the research infrastructures LDAP: this will maintain the developers identity in the new Git repositories;
  • REST interface to create new repositories: this will increase the automation of the migration operation;
  • management of roles: the source code should be readable to everybody without authentication, but only authorized users should be able to make changes;
  • web interface for browsing the source code and the commits;

Additionally, the tool should also implement collaboration functionalities like fork of repositories, pull requests, code review. This would greatly improve the collaboration between gCube developers and external community developers.

Git Repository Layout

Currently the gCube source code is maintained in a single SVN repository available at http://svn.research-infrastructures.eu/public/d4science/gcube/. The repository has the following layout:

  • trunk for the development version of components
  • branches for the release branches of components
  • private for feature branches of components
  • tags to keep a copy of gCube source code for each gCube release

More information on the current SVN layout can be found in the gCube wiki.

Switching to Git, the single repository layout will be abandoned in favor of a per-component repository layout. Each component will have its own Git repository that will contain all the branches for that component. This approach is not only more in-line with the Git philosophy, but, mostly important, more practical.

In fact, while SVN allows for checkout and commit of single files/folders, Git (by design) does not provide this functionality. This mean that each operation (e.g. checkout of source code to execute a build) involves the entire repository. Given the size of gCube system (hundred of components and more than 1GB of source code), it is not practical to have all the source code in a single repository. In addition, Git allows to assign permissions only at repository level and having a per-component repository layout can be useful if we decide to assign permissions on component basis.

The following table maps the current SVN layout with the new proposed Git layout also considering the workflow that we want to adopt (see Updates to Release Procedure).

type SVN Git
development branch the developments are all located under the /trunk directory organized per component (e.g. /trunk/common/common-client) the development branch
release branch the releases are all located under the /branches directory organized per component and per version (e.g. /branches/common/common-client/1.1) a branch named accordingly with the version (e.g. 1.1)
feature branch users place their feature branches in the /private directory organized per component (e.g. /private/common/common-client/my-feature-1) a branch named accordingly with the feature (e.g. my-feature-1)
tags tags are copies of the source code under the /tags directory organized per gCube release, component and version (e.g. /tags/gcube-4.5.0/common/common-client/1.1) a tag of the most recent commit in the release branch is created with the full version of the release (e.g. 1.1.0-4.5.0)

Import of source code from SVN

A strong requirement of the migration to Git is to maintain the history of the source code for each component. To achieve this, it is not sufficient to simply create an empty Git repository and copy the latest version of files from SVN.

Since this is a very common requirement when migration from SVN to Git, there is a command line tool distributed directly with Git to address this problem. This was the base for our importing tests described below. There are also few alternatives reported at the end of the section.

git-svn is a tool that given a SVN repository URL is able to discover all the files and all the commits in the repository and replicate them in a new Git repository. It is very complete and highly customizable. It supports different repository layouts, can distinguish between branches and tags and tries to follow the history of each file even if it has been copied or moved in different locations at certain point in the repository history. It is also able to keep user identities (by providing a mapping file of SVN and Git users) and can ignore specific directories and/or revisions. It is distributed under the open source GPLv2 license.

The tests we executed highlighted that given the size (both in terms of number of files and number of revisions) of the gCube SVN repository the import can take several time (hours) even for a single component because the tool will try to scan the entire repository looking for branches and tags of that component. However, using some options, it is possible to limit the tool to only scan the folders where we know the source files are and only for the revisions we know the source files exist. This reduces drastically the importing times (from hours to minutes). However, it has to be considered that this depends highly from the "age" of the component. In fact old components, for which we need to scan more revisions, will be slower that new components (with fewer revisions to scan) and from the number of commits/releases existing.

The following command imports the code of the home-library component (trunk and branches) from revision 82000 to the current one:

> git svn clone http://svn.research-infrastructures.eu/public/d4science/gcube/trunk/Common/
home-library/ --trunk http://svn.research-infrastructures.eu/public/d4science/gcube/trunk/Common/home-library/ --branches http://svn.research-infrastructures.eu/public/d4science/gcube/branches/common/home-library/ --authors-file=users.txt --no-metadata --no-minimize-url --revision 82000:HEAD

This test took about 10 minutes (tried in July 2017).

This command can be used (setting the correct parameters) for importing all gCube components. To facilitate the users, it can be wrapped in a more high-level script that is able to determine (or, at least, to guess) the values for almost all command parameters:

  • the min and max revisions can be found by running svn log and extract the revision of the first and the latest commits;
  • the branches and tags directory can be guessed from the trunk directory using the gCube naming conventions;
  • the authors-file can be generated upfront (see here) and it is the stame for all the components.

In addition the wrapper script can create and publish the new Git repository using the repository software APIs.

Alternative tools

git-svn-migrate is just a wrapper script around git-svn and does not add any functionality useful for us.

SubGit is a java-based importer that promises to be faster than git-svn. It is not an open source product, but the import functionality can be used for free (while the mirror functionality has to be purchased). However, from the small-scale tests we executed on few components there were no relevant improvements with respect to the git-svn times.

Integration with release tools

Git is a very popular software and it is already supported by all the development tools used by gCube developers (e.g. Eclipse, IntellijIDEA) on all platforms. Its usage is also very well documented and several tutorials and instructions on how to use Git can be found online.

Support for Git has been also added and tested in ETICS. The commands to checkout the code from Git are quite straightforward. Supposing that the URL of the repository is kept in the vcsroot property and the branch name in tag, the checkout command to fetch the latest version of the code can be:

git clone --depth 1 ${vcsroot} --branch ${tag} ${moduleName}

while for checking out a specific commit:

git clone ${vcsroot} --branch ${tag} ${moduleName} && cd ${moduleName} && git reset 104491f60c14f09124806272b88a605d5a324735 --hard

However, ETICS partially misses the support for Git in the module for the automatic synchronization of the model. This should be added to fully exploit with the automation that ETICS offers.

The only other tool dependent form SVN is the distribution script used to create a tag on SVN with all the software for the new gCube releases. However, this tool will be abandoned because, once the code will be on Git, there will be no necessity to create a tag on SVN since this functionality will be fully replaced by the publication of the release on GitHub.

Updates to Release Procedure

gCube follows a strict release procedure that defines also how developers should use SVN to develop and release their software (e.g. creation and naming of release branches). This procedure will, of course, be updated to match the (slightly) different Git functionalities.

Of particular interest in this context is to have a look at the different workflow that developers uses with Git. There are several of them documented: Centralized workflow, Integration-Manger workflow, Dictator and Lieutenants Workflow, Feature Branch Workflow, GitFlow, GitHub Flow and others.

From a first analysis comparing these workflows with the one currently used (with SVN) by gCube, GitFlow is the closest one and therefore is the one that most probably would match the gCube requirements.

The essential points of this workflow (in comparisons with our current workflow) are:

  • there is a development branch where the new developments are done, like in our current workflow;
  • a new branch is created for each new release, like in our current workflow;
  • integration issues are solved in the release branch, like in our current workflow;
  • the release branch is deleted after the release is closed, differently from our current workflow where the release branch remain and is used for patches and future releases;
  • when a patch is required, it is done in a new branch, differently from our current workflow where the patches are done in the release branch. However, in both cases, the patches are merged back in the development branch;
  • a tag is created after each release/patch before deleting the branch. This is equivalent (has the same objective) of the tags done at the end of each release in our current workflow.

The diagram below describes the main steps of GitFlow.


We are not obliged to adopt GitFlow in toto, we will probably build our own workflow, based on GitFlow with some minor modifications like:

  • release branches are not deleted: they are used to “generate” different component releases for each gCube release (e.g. 1.0.0-3.10.0, 1.0.0-3.11.0, …) and for patch releases;
  • before tagging a release a new commit with “resolved” pom.xml and distro files is done. This commit is not merged back to develop;
  • patches (hotfix) are branched from the release branch;
  • all components tags are merged into the gCube repository (this would realize the GitHub publication we do with ad-hoc scripts now).

A better and more in depth analysis will be done on this before defining the final workflow that gCube will adopt.

Migration Strategy

The migration of the entire code base of gCube from SVN to Git is a very big and time consuming task given the size of gCube. Making this migration on a single shot for the entire gCube system, would stop the development of the system since all developers should be committed to migrate their (plus the orphan) components.

A more meaningful strategy is to articulate the migration in different phases to distribute the effort of migrating the component over a large (months) time period. A possible four-phases approach could be:

  1. Initially, the migration will be done on "volunteering" basis. If a developer wants to migrate her component, she can do it following the new procedures and using the new tools. This early adopter group will be useful also to collect feedback on the tools and procedures before applying them on large-scale;
  2. Then, developers will be forced to use Git for new component only, while for new releases of old components they can continue to use SVN;
  3. Then, developers will be forced to use Git for any component they release, also old ones;
  4. Finally, developers will be forced to migrate to Git any remaining component in the gCube release (even if there would be no need of releasing them).

Of course, the migration can start only after all the tools and procedures are ready to manage Git-based components. In addition, all the old tools and procedures (SVN-based) will remain in place and co-exists with the new ones until all the gCube components have been migrated.