Teamcity cancelling very long test runs and causing chaos

Hi all, 

About a month ago we moved our Teamcity setup over into AWS and since then keep getting our test runs cancelled.

Our tests run all night - starting at midnight and end around 9AM using several agents, some change the date/time, some tests run for 30 minutes or more. When each test has finished various clean up has to be done to get  the m/c ready for the next test.

At around about 5:00AM Team City server will cancel one or more of the runs - It isn't doing any clean up and leaving the m/c in a crappy state, which then kills subsequent runs/builds.

Looking at the agent log, we can see that the agent is losing/unable to connect to the server and then at some point is deciding the build agent is being shutdown and needs to stop the build.

[2013-02-19 16:43:25,712]   INFO - r.agent.impl.AgentLogProxyImpl - Failed to perform remote command 'pingAndReregister' for build with id 152077, error: java.lang.Exception: Unable to register on server, not pingable: java.lang.Exception: Unable to register on server, not pingable (enable debug to see stacktrace)
[2013-02-19 16:43:30,723]   WARN -    jetbrains.buildServer.AGENT - Ping problem: jetbrains.buildServer.xmlrpc.RemoteCallException: Call 'http://10.200.8.12/RPC2', method 'buildServer.ping' failed: java.net.NoRouteToHostException: No route to host: connect
[2013-02-19 16:43:30,723]   INFO - r.agent.impl.AgentLogProxyImpl - Failed to perform remote command 'pingAndReregister' for build with id 152077, error: java.lang.Exception: Unable to register on server, not pingable: java.lang.Exception: Unable to register on server, not pingable (enable debug to see stacktrace)

We see lots of these, then eventually:

[2013-02-19 16:43:34,924]   INFO -    jetbrains.buildServer.AGENT - Received "stop force" command from launcher. Will stop currently running build and exit
[2013-02-19 16:43:34,925]   INFO - nt.impl.BuildRunAgentStateImpl - Stopping build on agent. Reason: agent shutdown (Build agent shutdown)
[2013-02-19 16:43:34,925]   INFO - uildStages.BuildStagesExecutor - There is no currently running stage to interrupt
[2013-02-19 16:43:34,925]   INFO - .agent.impl.BuildRunActionImpl - Interrupting build finish stage jetbrains.buildServer.agent.impl.buildStages.finishStages.PublishArtifactsFStage
[2013-02-19 16:43:34,925]   INFO -    jetbrains.buildServer.AGENT - Starting agent shutdown sequence, reason: Stop command called

 

At this point the server presumably restarts the run, whilst we still have components running and doesn't reset the date/time leaving it running at some random time. 

NOTE: I've turned off hanging build detection

I was wondering if anyone else had seen anything similar or know something useful that may help?

Many thanks.

 

 

0
1 comment
Hi! The error `java.net.NoRouteToHostException: No route to host: connect` indicates a networking problem between the build agent and the TeamCity server. If the network connection drops temporarily, the agent should be able to reconnect. 

Losing network connectivity should not cause the agent to stop. But the `Received "stop force" command from launcher` means that the agent process was stopped by an external process. This could be a user or an automated process terminating the agent, or a JVM crash.

So it looks like there are two issues: a networking issue and the build agent termination/crash. Please investigate the networking issue as any other networking problem. Regarding the termination/crash, check if the user stops the agent process. If that's not the case, feel free to upload all agent logs + any hs_err_pid* files on the agent + all teamcity-server.log* files from the server side to https://uploads.jetbrains.com/ and specify the upload ID. I will have a look and see if I can pinpoint the issue.
0

Please sign in to leave a comment.