We've encountered an odd error in the last couple of days with all of the agents in a particular pool. At some point on the 14th they all shut down and were essentially corrupted. The agents were installed fresh after recovery failed and they ran into the same issue a day later. The agents were previously running fine for around a year previous in that configuration.
We are running TeamCity 8.1 (build 29879) with 8 agents in the pool with issues. These agents are run on two seperate machines (4 on each). The agents and server are running on Windows 2008 R2 SP1. The agents are installed as services running under user accounts.
What we've found:14th April, all agents drop off with what appears to be an upgrade. They refuse to start. This is because all of the configuration information, logs, and several binaries (like the launchers) are now missing. When running the start up using the launcher it compains that the wrapper.conf is missing. I recovered some of the files using Recuva and replaced them. After the wrapper.conf was replaced it moves to a missing conf in JRL/Lib/i386. After this is replaced it stalls on a missing file in <root>/conf folder. Once all the config files are recovered and replaced the agent does start up. It shows another issues in the logs with the work folder directory.map but moves past this. Some builds still fail on VCS checkout which I can't resolve (changed from SSH to username/password etc). The checkout works fine on the box when tested manually. This is the case for all of the agents.
As I can't recover the agents, I replace all of them with new ones on the same machines. These work until the next day when the same thing happens. This time there is some log information which imples that an update was attemtped. The log then stops after this line:
[2015-04-15 13:19:29,382] INFO - ent.impl.upgrade.AgentExitCode - Agent exited. Restart agent, buildAgent.properties file has been changed
[2015-04-15 13:19:29,397] INFO - buildServer.agent.AgentMain2$2 - Closing jetbrains.buildServer.agent.AgentMain2$2@ee336f: startup date [Tue Apr 14 23:09:18 BST 2015]; root of context hierarchy
Looking through the logs some of the agents are showing what looks like odd behavior. We are seeing agents referencing temp and work folders that are for agents on other machines (and does not tie up with the setting in the agents conf file). We've also seen repeated update attempts happening every half second or so.
As far as I'm aware, there have been no changes to the servers that TeamCity or the agents sit on.I did notice some of the new agents did incorrectly share a port on the same machine. This wasn't the case on the original agents.
Googling etc hasn't shown me anything that is close to the problem. Anyone have any ideas?