Rewriting Git History with Signed Commits
Rewriting Git History with Signed Commits
by Dan Manges

Rewriting Git History with Signed Commits

When we prepared the open source release of ABQ and the open source release of Captain, we decided to rewrite our git history.

We didn't want to make the existing issues and pull requests public, because we didn’t intend for the discussions and notes in them to be public. However, the commits contained references to pull requests, and we didn’t want those references to be incorrect. For example, this commit references pull request #111, but our new repository wouldn’t have that pull request.

commit f28181c271ff82f65a6fbad370187f7cd6852faa
Author: Ayaz <[email protected]>
Date:   Thu Sep 8 09:19:01 2022 -0500

    Adds TLS support (#111)

Additionally, we had been using a squash and merge strategy on our repository. At one point in time the default for that strategy was to include the pull request description in the commit message, and that unfortunately made some of our commit messages quite verbose and poorly formatted.

We decided we’d rewrite the commits to look like this:

commit a1fd708249a7c72baf16ae70784804124aedaf07
Author: Ayaz <[email protected]>
Date:   Thu Sep 8 09:19:01 2022 -0500

    Adds TLS support

    Original PR: 111
    Original Commit: f28181c271ff82f65a6fbad370187f7cd6852faa

The script to rewrite the commit history was fairly straightforward, but our team members have commit signature verification enabled, which posed a challenge. To ensure that we produced verified commits, we needed each person on our team to rewrite their respective commits. We ended up writing a multiplayer git history rewrite script, including signed commits. Each person ran the script like this:

# dan
$ ruby rewrite_git_history.rb abq [email protected]

# tommy
$ ruby rewrite_git_history.rb abq [email protected]

We all ran the script at the same time during standup one day, and a few minutes later, we had a completely new git history including signed commits!

If you want to skip to the completed code, here’s the full script as a gist.

High Level Process

The high level process we followed was to:

  • iterate over the original commit history in reverse order (oldest commit first)
  • build a new commit message by parsing and reformatting the original commit message
  • use the git tree from the original commit with git commit-tree to make a new commit
    • the git tree specifies the file contents of the commit – using it ensured the contents of the commits were identical to the original commits
  • maintain a mapping of old commit hashes to new commit hashes
    • to make sure the script could resume, we also wrote some code to parse the mapping from any commits that had already been rewritten
  • use the mapping of commit hashes to map the parent commits for each commit
    • with the combination of consistent git trees and consistent commit parents, our commit history and graph would remain identical other than the new commit messages and signatures

To facilitate all of us running the script at the same time, the script:

  • had each person pass in the email addresses associated with their commits
    • one of us added the bot email addresses to our input
  • if the next commit to be rewritten was theirs, we’d make a new commit using git commit-tree with the -S option to produce a signed commit
  • if the next commit to be rewritten was somebody else’s, the script would periodically run a git fetch to see if the new commit had been pushed yet.

Here’s some commentary on the script implementation details.

Args and Global State

# pass in the repository as the first argument
REPO = ARGV[0] || raise("need repo")

# pass in a list of email addresses for
# commits this person should rewrite
EMAILS = ARGV[1...]

# we had to map some old email addresses to new email addresses
# to ensure the signing keys matched the email address on the commit
EMAIL_MAP = {
  "[email protected]" => "[email protected]"
}

# maintain a mapping of old->new commits
MAPPED_COMMITS = {}

# a set of new commits
NEW_COMMITS = Set.new

The Main Loop

Dir.chdir("../#{REPO}") do
  # make sure the repository is up to date!
  `git fetch`
  raise "fetch failed" unless $?.success?

  # get a list of commit hashes (%H) with oldest first
  commits = `git log --reverse --pretty='%H' origin/main`.split("\n")
  raise "git log command failed" unless $?.success?

  # check to see if the main-oss branch has been created yet
  # we have to push the first commit using a different command syntax
  `git branch -r | grep main-oss`
  first_commit = !$?.success?

  # import commits that have already been rewritten
  # so that this script can be re-executed and resume where it left off
  unless first_commit
    import_new_commits
  end

  # the main loop!
  commits.each do |commit|
    # if this commit has already been rewritten, move on
    if MAPPED_COMMITS.include?(commit)
      first_commit = false
      next
    end

    # get the author of the next commit to rewrite
    author = `git show --no-patch --pretty=%ae #{commit}`.strip

    # check if the next commit is for
    # the person who ran the script
    if EMAILS.include?(author)
      # rewrite the commit
      new_commit = import_commit(commit)

      # we have to push differently for the first commit
      if first_commit
        `git push origin main:main-oss`
        raise "push failed" unless $?.success?

        `git push -f origin #{new_commit}:main-oss`
        raise "pushed failed" unless $?.success?
      else
        `git push origin #{new_commit}:main-oss`
        raise "pushed failed" unless $?.success?
      end
    # waiting for somebody else to rewrite this commit
    else
      puts "Waiting for #{author} to rewrite commit #{commit}..."
      loop do
        sleep 3
        puts "Fetching..."
        # check to see if the main-oss branch has been updated
        `git fetch 2>&1 | grep main-oss`
        break if $?.success?
      end
      import_new_commits
      # make sure the commit we were waiting on got imported!
      unless MAPPED_COMMITS.include?(commit)
        raise "expected #{commit} to be imported"
      end
    end
    first_commit = false
  end
end

Committing

# %H = commit hash
# %h = abbreviated commit hash
# %T = tree hash
# %P = parent hashes
# %an = author name
# %ae = author email
# %aD = author date, rfc2822 style
# %cn = committer name
# %ce = committer email
# %cD = committer date, rfc2822 style
# %s = subject
# %b = body
def import_commit(original_commit_hash)
  #            0  1  2  3   4   5   6   7   8   9 10 11 12
  format = %w[%H %h %T %P %an %ae %aD %cn %ce %cD %s %H %b]
  # fetch these attributes for the given commit
  # separating the attributes with newlines
  results = `git show --no-patch --pretty=#{format.join("%n")} #{original_commit_hash}`.split("\n")
  raise "failed to show #{original_commit_hash.inspect}" unless $?.success?

  # if results[11] isn't the expected original commit, something went wrong
  unless results[11] == original_commit_hash
    raise "format error: #{results}"
  end

  # take the parents for this commit, and
  # determine the new commit hashes
  mapped_parents = results[3].split(" ").map do |commit|
    MAPPED_COMMITS.fetch(commit) # will raise if is not mapped
  end

  # passing the new commit message as a file makes shell syntax easier
  commit_message_file = build_commit_message(results[10], results[0], results[12...].join("\n"))

  # map the author email to a new email if necessary
  email = EMAIL_MAP[results[5]] || results[5]

  command = [
    "env",
    "GIT_AUTHOR_NAME='#{results[4]}'",
    "GIT_AUTHOR_EMAIL='#{email}'",
    "GIT_AUTHOR_DATE='#{results[6]}'",
    "GIT_COMMITTER_NAME='#{results[4]}'",
    "GIT_COMMITTER_EMAIL='#{email}'",
    "GIT_COMMITTER_DATE='#{results[9]}'",
    "git commit-tree",
    mapped_parents.map { |parent| "-p #{parent}" },
    # don't sign bot commits
    # we had one person on one team add the bot emails to their ARGV
    (results[5].include?("[bot]") ? "" : "-S"),
    "-F #{commit_message_file}",
    # results[2] is the original commit tree, which stays the same
    results[2]
  ].flatten.join(" ")
  STDERR.puts "Rewriting #{original_commit_hash}"
  puts command
  new_commit = `#{command}`.strip
  puts new_commit
  raise "commit failed" unless $?.success?
  MAPPED_COMMITS[original_commit_hash] = new_commit
  NEW_COMMITS << new_commit
  new_commit
end

Building New Commit Messages

def build_commit_message(subject, original_commit, body)
  # strip the PR number from the commit message
  # this is the default format in "squash and merge" commits
  if subject =~ /^(.+) (#(d+))$/
    sanitized_subject = $1
    original_pr = $2
  # we had a few normal merge commits too
  elsif subject =~ /^Merge pull request #(d+) (.+)$/
    original_pr = $1
    sanitized_subject = "Merge pull request #{$2}"
  else
    original_pr = nil
    sanitized_subject = subject
  end
  sanitized_subject = sanitized_subject.gsub(/#(d+)/) { "PR #{$1}" }

  # maintain credit to co-authors!
  co_authors = body.split("\n").select { |line| line.start_with?("Co-authored-by") }

  result = []
  result << "#{sanitized_subject}\n"
  result << "\n"
  result << "Original PR: #{original_pr}\n" if original_pr
  result << "Original Commit: #{original_commit}\n"
  if co_authors.any?
    result << "\n"
    co_authors.each { |co| result << co }
  end

  "#{TMP_DIR}/#{SecureRandom.uuid}".tap do |file|
    File.open(file, "w") { |f| f.write result.join }
  end
end

Importing Existing Mapped Commits

def import_new_commits
  new_commits = `git log --reverse --pretty='%H' origin/main-oss`.split("\n")
  raise "git log command failed" unless $?.success?

  new_commits.each do |new_commit|
    next if NEW_COMMITS.include?(new_commit)

    # because we maintain a reference to the original commit
    # in the new commit message, we can parse the git log for
    # the new branch to fetch the existing mapping
    original_commit = `git show --no-patch --pretty=%b #{new_commit} | grep 'Original Commit'`.split(" ")[2].strip
    raise "failed to get original commit" if !$?.success? || original_commit.empty?
    next if MAPPED_COMMITS.include?(original_commit)

    puts "#{original_commit} rewritten to #{new_commit}"
    MAPPED_COMMITS[original_commit] = new_commit
    NEW_COMMITS << new_commit
  end
end

Full Script

Here’s the full script as a gist.

Alternative Approaches

We also could have handled the new repository having missing pull requests by creating issues on the new repository that indicated that the repository had been migrated, and issue #111 refers to a historical, private issue. Since we also wanted to clean up our commit messages, rather than create those placeholders, we decided we’d rewrite the history instead.

Connect with our team

We spend most of our time at RWX solving problems related to builds and tests. We publish open source tools for Captain and ABQ and are happy to chat anytime. Say hello on Discord  or reach out at [email protected]

Enjoyed this post? Share it!