Anyone made a High Availablility Setup for TeamCity?

Basically, I'm thinking of this:

  • Image at least 2 hosts with a working TC master image, both configured in a "deactivated" state (what's the best way to do that?).
  • Run a daemon using the following pseudocode:
      loop {
        rsync();
        if (master()) {
            activate();
        } else {
            deactivate();
        }
        sleep();
      }
  • The idea is to replicate the data store via rsync, and test via DNS whether we are the active master or a slave, and reconfigure teamcity accordingly.


Has anyone implemented such a scheme? Is it even sound?

2 comments
Comment actions Permalink

I understand passive node as an instance where TeamCity process is not running, and specifically - it does not modify data directory and SQL.

RSync would work, but it doesn't guarantie 100% data safety. Consider an example:

  • nodes are synchronized
  • sleep interval is started
  • active node modifies data on disk
  • disk on active node becomes corrupted

In such case latest changes are lost, because they are not copied to passive node before failover.
Usually it's solved by putting data directory on RAID storage shared between cluster nodes.

Also, is your SQL DB clustered?

0
Comment actions Permalink

So the reason I prefer rsync over a shared cluster or mirroring is for two reasons:

  • If something corrupts data, I have a shot at fixing it if I catch it prior to the next rsync. The rsync is effectively a backup.
  • I can use the "upgrade is a failover" model safely with a reversion path. If the new version of Teamcity makes non-compatible changes to the data store, and for some reason I need to roll back, I can simply switch back to the old service and let rsync fix the "corruption" caused by the upgrade.


The downside is indeed that in case of a failure, there might be data loss for any transaction that happened between the last rsync and the failure. I'm not sure how you would provide a save for that even on a shared storage...

So I think that we're going to try a solution that uses a combination of LVM snapshots and rsync. Quiesce Teamcity, take lvm snapshot, restart Teamcity, rsync to standby, remove snapshot, repeat.

So two questions:

  • Is there a way to quiesce teamcity via a script or API call without having to shut it down completely?
  • Is the minimal set of data to be rsync'ed documented someplace (most of what I've seen seems to imply I should simply copy the whole data dir - I'd rather not have to replicate the mercurial clones in the cache)
0

Please sign in to leave a comment.