Teamcity cancelling very long test runs and causing chaos
Hi all,
About a month ago we moved our Teamcity setup over into AWS and since then keep getting our test runs cancelled.
Our tests run all night - starting at midnight and end around 9AM using several agents, some change the date/time, some tests run for 30 minutes or more. When each test has finished various clean up has to be done to get the m/c ready for the next test.
At around about 5:00AM Team City server will cancel one or more of the runs - It isn't doing any clean up and leaving the m/c in a crappy state, which then kills subsequent runs/builds.
Looking at the agent log, we can see that the agent is losing/unable to connect to the server and then at some point is deciding the build agent is being shutdown and needs to stop the build.
[2013-02-19 16:43:25,712] INFO - r.agent.impl.AgentLogProxyImpl - Failed to perform remote command 'pingAndReregister' for build with id 152077, error: java.lang.Exception: Unable to register on server, not pingable: java.lang.Exception: Unable to register on server, not pingable (enable debug to see stacktrace)
[2013-02-19 16:43:30,723] WARN - jetbrains.buildServer.AGENT - Ping problem: jetbrains.buildServer.xmlrpc.RemoteCallException: Call 'http://10.200.8.12/RPC2', method 'buildServer.ping' failed: java.net.NoRouteToHostException: No route to host: connect
[2013-02-19 16:43:30,723] INFO - r.agent.impl.AgentLogProxyImpl - Failed to perform remote command 'pingAndReregister' for build with id 152077, error: java.lang.Exception: Unable to register on server, not pingable: java.lang.Exception: Unable to register on server, not pingable (enable debug to see stacktrace)
We see lots of these, then eventually:
[2013-02-19 16:43:34,924] INFO - jetbrains.buildServer.AGENT - Received "stop force" command from launcher. Will stop currently running build and exit
[2013-02-19 16:43:34,925] INFO - nt.impl.BuildRunAgentStateImpl - Stopping build on agent. Reason: agent shutdown (Build agent shutdown)
[2013-02-19 16:43:34,925] INFO - uildStages.BuildStagesExecutor - There is no currently running stage to interrupt
[2013-02-19 16:43:34,925] INFO - .agent.impl.BuildRunActionImpl - Interrupting build finish stage jetbrains.buildServer.agent.impl.buildStages.finishStages.PublishArtifactsFStage
[2013-02-19 16:43:34,925] INFO - jetbrains.buildServer.AGENT - Starting agent shutdown sequence, reason: Stop command called
At this point the server presumably restarts the run, whilst we still have components running and doesn't reset the date/time leaving it running at some random time.
NOTE: I've turned off hanging build detection
I was wondering if anyone else had seen anything similar or know something useful that may help?
Many thanks.
Please sign in to leave a comment.
Losing network connectivity should not cause the agent to stop. But the `Received "stop force" command from launcher` means that the agent process was stopped by an external process. This could be a user or an automated process terminating the agent, or a JVM crash.
So it looks like there are two issues: a networking issue and the build agent termination/crash. Please investigate the networking issue as any other networking problem. Regarding the termination/crash, check if the user stops the agent process. If that's not the case, feel free to upload all agent logs + any hs_err_pid* files on the agent + all teamcity-server.log* files from the server side to https://uploads.jetbrains.com/ and specify the upload ID. I will have a look and see if I can pinpoint the issue.