Teamcity Disaster Recovery

Created March 22, 2019 13:51

We're implementing a DR solution consisting of HA MySQL servers using always on availability groups along with an NFS file share to host our data folders. Our DR sql server currently is configured to asynchronous commit mode so there's always a chance for lag. The NFS file share is not realtime which is an issue we're addressing so its possible we could lag an hour between whats on the primary share vs secondary. My question is, whats the fault tolerance between the database and fileshare? Would teamcity experience catastrophic failures if there's an hour difference between the database and share? I'm going under the assumption we'd see strange behavior but would like to know what the expectations are specifically between the database and data folder.

1 comment

Denis Lapuente

Created April 02, 2019 17:14

Hi Josh,

the data directory holds projects, builds configurations and some caches. Some issues that might arise might be corrupted caches (if both TeamCity and the copy attempt to write at the same time), deleted data (in case changes are done to the projects/configurations after the servers are hot-swapped but before that directory syncs up), issues with artifacts (by default stored in that folder as well) both saving and loading, etc.

Our recommendation here would be to transition via the read-only mode: https://confluence.jetbrains.com/display/TCD18/Configuring+Secondary+Node

We are working on providing a write-enabled HA service, but as of now, we are providing HA via a secondary read-only mode that will only allow to see the status but not to make changes to it. This would allow builds to continue, users to review what is needed, simply new actions will not be started. You could enable this node while the data directory sync happens, then bring back the regular node once the sync is finished.

Please sign in to leave a comment.