EC2 cloud agent stopped mid-build
I saw a similar post made a few years back: https://teamcity-support.jetbrains.com/hc/en-us/community/posts/8158871090450-Teamcity-Shutting-Agents-down-with-a-build-running
But this morning, in the middle of a build on a cloud agent, the agent was stopped and a warning appeared on the build “The build agent has become disconnected”. It had been shutdown as shown in the logs. I started it back up and TC automatically restarted the build for me.
The timings in the log make me think that the build was not registering with the agent in TC so it thought it was idle for 30 minutes causing the shutdown. The only thing I did that might be unusual is that the build was started with the 3 dots button so I could select the cloud agent specifically to test a build (rather than it being triggered automatically).
On a related note, I disabled it for maintenance because I assumed that would stop it being shutdown by Team City while I am testing but this didn't help either. It seems like disabling it for maintenance should not allow it to be automatically shutdown.
Please sign in to leave a comment.
See the info here:
It shows idle for 30 minutes even though a build is in-progress. See here too:
It says no agents running builds even though you can clearly see a build running. Maybe because the agent is disabled, it is not checking but this is wrong.
Same thing just happened again even though it is not disabled any more so can't be that. Still showed “No agents running builds” like the screenshot.
By the way, this is a precreated instance that should be started or stopped on-demand. I have previously seen it start when needed so that seems to work OK.
Could you kindly provide the following log files and a screenshot of your cloud profile, as mentioned in the ticket here: https://teamcity-support.jetbrains.com/hc/en-us/community/posts/8158871090450-Teamcity-Shutting-Agents-down-with-a-build-running
- Full build log
- teamcity-server.log
- teamcity-clouds.log
- teamcity-agent.log (covering the timeframe when the issue occurred)
- Screenshot of your cloud profile configuration
You can upload the files via https://uploads.jetbrains.com/. Once uploaded, please share the exact ID with us.
Best Regards,
Tom
Hi Tom: 2024_09_09_zoiBvWNvCGqt5YPYX7ehB7
Note that I have disabled the idle timeout on the agent but it still shuts down 30 minutes after it was started. I started it at 10:46 on the 9th September. After it had started up, it incorrectly (my mistake) picked up a random build rather than the one I was going to run, this started at 10:57 and finished at 11:03. In the agent log, you can see that it received the stop command at 11:16. In this case, the build was quick enough to finish before the stop command was issued whereas the original problem was that this also happened even if it was in the middle of the build.
I can't find the teamcity-clouds.log if you need it, please let me know where to find it.
Thanks
Hi Luke,
I haven't found the teamcity-server.log in the log files you provided. However, I found information in the teamcity-clouds.log indicating that the agents were stopped due to an "Idle timeout reached," which is puzzling since you have disabled the related settings.
By the way, could you please restart the TeamCity server and check it again? Could you please share the teamcity-server.log with us for further investigation? Additionally, after uploading, please provide us with the exact ID.
[2024-09-09 10:46:11,293] INFO [prio executor 5] - .server.impl.CloudEventsLogger - Cloud instance start succeeded: profile 'AWS EC2'{id=amazon-1, projectId=_Root}, Agent 1
[2024-09-09 10:46:15,639] INFO [uled executor 2] - .instances.StoppedInstanceTask - Instance has changed status from stopped to Starting: Amazon Instance{instanceId=Agent 1, imageId=Agent 1, amazonImageId=i-00fd3bf508xxx, amazonInstanceId=i-00fd3bf5086xxx, status: Starting}, profile 'AWS EC2'{id=amazon-1, projectId=_Root}
[2024-09-09 10:46:20,640] INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'starting' state, profile 'AWS EC2'{id=amazon-1, projectId=_Root}, Agent 1
[2024-09-09 10:46:55,649] INFO [uled executor 1] - .server.impl.CloudEventsLogger - Cloud instance entered 'running' state, profile 'AWS EC2'{id=amazon-1, projectId=_Root}, Agent 1
[2024-09-09 11:16:48,556] INFO [uled executor 2] - te.IdleTimeoutTerminateFactory - Will stop instance Amazon Instance{instanceId=Agent 1, imageId=Agent 1, amazonImageId=i-00fd3bf50xxx, amazonInstanceId=i-00fd3bf5086xxx, status: Running}, profile 'AWS EC2'{id=amazon-1, projectId=_Root}, because no agent came for 30 minutes (instance started '2024-09-09 10:46:13.000')
[2024-09-09 11:16:48,556] INFO [uled executor 2] - mpl.CloudInstancesProviderImpl - Instance 'Agent 1' is marked for termination after current build finishes
[2024-09-09 11:16:48,556] INFO [uled executor 2] - l.instances.StopInstanceAction - Terminating instance 'Agent 1', profile 'AWS EC2'{id=amazon-1, projectId=_Root}, reason: "Idle timeout reached"
[2024-09-09 11:16:48,556] INFO [uled executor 2] - r.impl.DBCloudStateManagerImpl - Image: AmazonImageInstance{id=Agent 1, amazonId=i-00fd3bxxxxx}, profile: profile 'AWS EC2'{id=amazon-1, projectId=_Root} was marked to CONTAIN agent
[2024-09-09 11:16:49,157] INFO [8 Stop Instance] - r.impl.DBCloudStateManagerImpl - Image: Agent 1, Instance: Agent 1, profile=amazon-1 is marked with state: stopped.
[2024-09-09 11:16:51,215] INFO [uled executor 4] - .server.impl.CloudEventsLogger - Cloud instance entered 'scheduled to stop' state, profile 'AWS EC2'{id=amazon-1, projectId=_Root}, Agent 1
[2024-09-09 11:16:56,216] INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopping' state, profile 'AWS EC2'{id=amazon-1, projectId=_Root}, Agent 1
[2024-09-09 11:17:56,238] INFO [uled executor 3] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'AWS EC2'{id=amazon-1, projectId=_Root}, Agent 1
Best Regards,
Tom
Hi Tom,
I restarted (from the diagnostics menu) and it still behaves the same. New upload: 2024_09_10_iH7q8k2VRfq8qEFQnAFD11
There are a few times it has done it today, I keep starting it up and forgetting and then starting it again to get the logs. I think the most recent start with a shutdown was 15:06/15:36. I have since started it again but it is still running and inside the 30 minute window.
Thanks
Luke
Hi Luke,
In the teamcity-server.log file, I found an issue indicating that the cloud agent is not properly registered.
[2024-09-10 15:08:59,907] INFO - jetbrains.buildServer.AGENT - Failed to resolve agent host name: java.net.UnknownHostException: xxxx:49908: invalid IPv6 address literal
[2024-09-10 15:09:09,046] INFO - jetbrains.buildServer.AGENT - Failed to resolve agent host name: java.net.UnknownHostException:xxxx:49908: invalid IPv6 address literal
[2024-09-10 15:09:09,154] INFO - jetbrains.buildServer.AGENT - Failed to resolve agent host name: java.net.UnknownHostException: xxxx:49908: invalid IPv6 address literal
[2024-09-10 15:09:21,159] INFO - jetbrains.buildServer.AGENT - Failed to resolve agent host name: java.net.UnknownHostException: xxx:49908: invalid IPv6 address literal
[2024-09-10 15:10:24,840] WARN - jetbrains.buildServer.SERVER - Agent 'Agent 1-i-00fd3bf5xxxx' attempts to register build runner(s) [octopus.generic] which are not registered on the server. Check server installation is not corrupted or related plugins register build runners (run types) on the server correctly.
[2024-09-10 15:10:24,841] INFO - jetbrains.buildServer.AGENT - Agent has been registered: "Agent 1-i-00fd3bf50xxxx" {id=68, protocol=unidirectional, host=[xxxxxxx:49907]:9090, version: 160695/160695-md5-2864844448a6c8f4382e7aee13c8bdf4, agentTypeId=70, pool=Default, registered since 2024-09-10 15:10:24.815}, not running a build
[2024-09-10 15:14:55,215] INFO - jetbrains.buildServer.AGENT - Failed to resolve agent host name: java.net.UnknownHostException: xxxxxx:49907: invalid IPv6 address literal
Please try the following steps to create a new AMI:
1. Start an instance from the AMI manually.
2. Connect the agent to the server (you might need to specify serverUrl in buildAgent.properties file).
3. Authorize the agent and make sure it has upgraded and reconnected back to the server.
4. Go through steps 5 and 6 here: https://www.jetbrains.com/help/teamcity/setting-up-teamcity-for-amazon-ec2.html#Create+an+AMI
Check the teamcity-server.log to ensure that there are no "Failed to resolve agent host name: java.net.UnknownHostException" errors at this time.
Hi Tom, I'm not sure what you're asking me to do here since this is what I already did. This is not from an AMI, it is a fresh installation of Windows + TC Build Agent which I planned to create an AMI from. I already installed the agent, connected it to the server, waited for it to upgrade etc. rebooted and it all appeared fine as per the instructions.
I can see the error you posted but I don't really understand what that means and how doing the same thing again would resolve it?
Sorry for any confusion. I just want to ensure that your cloud agent is properly registered.
Best Regards,
Tom
Hi Tom, I tried that and I get the same thing.
I think it was different this time because I'm sure before that the instance was listed as a separate agent as well as a cloud instance, which might be because I didn't clean it up properly - those instructions are in the “Create AMI” section which I wasn't doing. I think the confusion is that it says something like “If you only want to use this as an instance, finish now” but actually, it means if you want it as a build agent in the normal list, whereas if you want it as an “instance” as opposed to an “AMI” you do have to connect, update etc. clean up and then shut it down.
Anyhow, same error in the server log although the message is confusing (I didn't notice it in your post because you masked the IP addresses) but it says, “Failed to resolve agent host name: java.net.UnknownHostException: 1.2.3.4:49908: invalid IPv6 address literal” which is not correct because, of course, it is not an IPv6 address although it does contain a port number with a colon which might be confusing something? Since I don't set this, the cloud integration picks it up, it might be a bug in the code that starts the instance and registers it - assumes it is ipv6 or something?
I disabled ipv6 on the VM just to make sure it wasn't getting confused but it gives the same error.
Thanks
It seems that you has manually authorized the agent, since it's an on-demand instance but that will affect our system because then the cloud profile doesn't actually get to be involved in the authorization process and therefore doesn't know if the instance has managed to connect back to the system.
Could you please try to deauthorize and manually remove the agent? Upon the next startup, it should register through the normal process.
Best Regards,
Tom
Hi Tom,
I already did that. I completely removed it from TC, deleted the auth code, added it back to the cloud profile and it authorized without being manually authorized but isn't working.
I've put enough time into this now, I will need to park it. I don't know why it isn't working and I did follow the instructions but I will have to try something else now. Thanks for the help.
Sorry that it didn’t work out. If you decide to revisit it or need further assistance in the future, feel free to reach out.
Best Regards,
Tom