Why does TeamCity terminate an EC2 instance with an actively running build after "idle" time expires?

We have one portion of our build process that performs an intensive database load process that can take anywhere from 30 minutes to 2+ hours depending on which client's data we're loading for analysis.  I was working on trying to offload this processing to the cloud.

I have configured an EC2 image to run these builds, however I believe that the "Terminate Instance Idle Time" setting is misleading, and I believe the sensible approach to EC2 integration is competely missing from TeamCity.  

First, because of TeamCity's behavior of terminating the instance when the "idle" time expires, I have had to bump up the setting to exceed our longest anticipate load time (this is not so great for our 30 minutes loads that now will have to incur the cost of our longest load). For this particular run, I bumped the setting up to 1h:50m because I felt that should be enough time for the process to finish and I wanted to avoid a 3rd hour of EC2 charges.  However, I was watching the build agent server as it made it all the way through our extended load process and then was in the stage where it was backing up the database after which is was to copy the image to an S3 bucket.  Once the build exceeded the "idle" timeout period, TeamCity unceremoniously terminated the server instance right in the middle of the build!  Honestly, this makes absolutely no sense to me.  Why would you shut down a server on which a build is currently running because the "idle" time expired?  You could control EC2 charges by using a "hanging build timeout" setting.  That way I could set it to a large value (like 4h) to protect myself from excessive EC2 charges.  In any case, my build agent was not "idle" by any reasonable interpretation.

Furthermore, since Amazon charges partial hours as full hours, what sense does it make even to have an "idle" timeout period?  It seems to me that TeamCity knows when it fired up the server and out of the box it should optimize server up time around each 1 hour period.  If it fires up an EC2 instance to run a build that finishes in 15 minutes, it shouldn't shut it down at all, even after 30 minutes of true idle time.  It should leave that server up until 55 minutes have expired, and then terminate it ONLY if there is no other build running.  It should never terminate a server on which a non-hanging build is currently running.  If it is running a build during the expiration of a 1 hour time period, it should leave the server running until it approaches the end of the next hour (at 1h 55m, for example). Only then if the server is truly idle should the instance be terminated.

Also, I saw it mentioned elsewhere, but it would have saved me a ton of time if I could have created EC2 instances to be used as Cloud Build Agents and then told TeamCity about the instances instead of the images.  This forced me down a relatively painful path of needing to createa separate persistent volume for the build agent's work/temp volumes, and to write a script that executes at startup to attach the volume (and then troubleshoot a variety of problems such as the volume not coming online at all, or coming online in Windows as "read only").

Geoff

3 comments
Comment actions Permalink

Hello,

Please attach server/EC2-running-agent logs to analyze why agents are being disconnected.

There is an issue to make TeamCity EC2 support be EC2 billing aware, please vote for it:
http://youtrack.jetbrains.net/issue/TW-9680

Makeing TeamCity take an running EC2 machine as agent. Could you please elaborate it a little bit. Do you mean EBS images?

0
Comment actions Permalink

I had the "Terminate Instance Idle Time" value set to 110 minutes (i.e. 1h 50m).

Here's a combined sequential view of ALL of the log entries for the event I'm describing, taken from the build log and the teamcity-clouds.log file.  I've not removed any log entries other than where identified (for brevity).



Entries from teamcity-clouds.log

[2011-10-17 23:37:46,954]   INFO [inProfile.html ] - .server.impl.CloudEventsLogger - Profiles reloaded: count=1, integration is enabled
[2011-10-17 23:43:07,003]   INFO [louds/tab.html ] - .instances.StartInstanceAction - Starting cloud instance: profile 'Amazon Build Agents'{id=cp1}, AmazonImage{id=ami-abc12345}, hash=**hashvalue**, reason=User Geoff M started agent from web UI
[2011-10-17 23:43:07,003]   INFO [louds/tab.html ] - .server.impl.CloudEventsLogger - Cloud instance entered 'scheduled to start' state, profile 'Amazon Build Agents'{id=cp1}, Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=null}
[2011-10-17 23:43:07,003]   INFO [louds/tab.html ] - .server.impl.CloudEventsLogger - Cloud instance start succeeded: profile 'Amazon Build Agents'{id=cp1}, Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=null}
[2011-10-17 23:43:07,820]   INFO [ents'{id=cp1} 1] - clouds.amazon.AmazonCloudImage - Amazon instance started: instanceId=z-ami-abc12345-5, amazonInstanceId=i-16a8a176, imageId=ami-abc12345, userData=CloudInstanceUserData{myAgentName=<none>, myAuthToken='cloud', myServerAddress='http://deerwood:8080', myProfileDesctription='profile 'Amazon Build Agents'{id=cp1}', myIdleTimeout=110 min, myCustomAgentConfigurationParameters={cloud.amazon.agent-name-prefix=EC2, system.cloud.profile_id=cp1, teamcity.cloud.instance.hash=**hashvalue**}}
[2011-10-17 23:43:07,820]   INFO [ents'{id=cp1} 1] - r.impl.DBCloudStateManagerImpl - Image: ami-abc12345, Instance: i-16a8a176 is marked with state: running.
[2011-10-17 23:43:11,618]   INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'starting' state, profile 'Amazon Build Agents'{id=cp1}, Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=i-16a8a176}
[2011-10-17 23:43:46,621]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'running' state, profile 'Amazon Build Agents'{id=cp1}, Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=i-16a8a176}

(Amazon instance starts, attaches an EC2 volume containing our working copy, connects to our build server via VPN, and then starts communicating with Team City, and the build starts)


Build Log Entries

[23:48:53]: Checking for changes

[23:48:57]: Clearing temporary directory: D:\BuildAgent\temp\buildTmp
[23:48:57]: Checkout directory: D:\BuildAgent\work\998fa8475e1b493b
[23:48:57]: Updating sources: server side checkout... (18s)
[23:48:55]: [Updating sources: server side checkout...] Building incremental patch for VCS root: TSDS (trunk)
[23:49:15]: [Updating sources: server side checkout...] Repository sources transferred: 2.56Mb total
[23:49:15]: [Updating sources: server side checkout...] Average transfer speed: 215.26Kb per second
[23:49:15]: [Updating sources: server side checkout...] Updating D:\BuildAgent\work\998fa8475e1b493b
[23:49:15]: [Updating sources: server side checkout...] Removing D:\BuildAgent\work\998fa8475e1b493b\src\Application\Tsds.Presentation.Services.UI.AspNet\Student
[23:49:16]: [Updating sources: server side checkout...] Removing D:\BuildAgent\work\998fa8475e1b493b\src\Application\Tsds.Presentation.Model.UI.AspNet\Student
[23:49:16]: Publishing internal artifacts (1s)
[23:49:17]: [Publishing internal artifacts] Sending build.start.properties file
[23:49:16]: Starting: C:\Windows\system32\cmd.exe /c C:\Windows\SysWOW64\WindowsPowerShell\v1.0\powershell.exe -NonInteractive -NonInteractive -NoProfile -ExecutionPolicy Unrestricted -Command - <D:\BuildAgent\temp\buildTmp\powershell5173731198278977365.ps1 && exit /b %ERRORLEVEL%
[23:49:16]: in directory: D:\BuildAgent\work\998fa8475e1b493b\src\scripts\activities\build

... (Log entries removed for brevity -- a lot of SSIS packages being executed here)...


[01:31:25]: Executing task: ExecuteLongitudinalPackages
[01:31:25]: Executing package D:\BuildAgent\work\998fa8475e1b493b\src\Etl\Tsds.Etl.Tsds\CampusLongitudinalMetrics.dtsx ...
[01:33:14]: Microsoft (R) SQL Server Execute Package Utility
[01:33:14]: Version 10.50.1600.1 for 32-bit
[01:33:14]: Copyright (C) Microsoft Corporation 2010. All rights reserved.
[01:33:14]: Started:  1:31:25 AM
[01:33:14]: DTExec: The package execution returned DTSER_SUCCESS (0).
[01:33:14]: Started:  1:31:25 AM
[01:33:14]: Finished: 1:33:14 AM
[01:33:14]: Elapsed:  109.638 seconds
[01:33:14]: Package executed successfully.
[01:33:14]: Executing task: BackupAndArchiveDatabase
[01:33:14]: Backing up database 'XYZ' to C:\MSSQLBackup ...


... and then 22 seconds after starting the backup of the database I just spent the better part of 2 hours loading, and exactly 1h 50m since the instance started...  BAM!  Goodbye!  Apparently, my server was "idle" despite being in the middle of a build.


Entries from teamcity-clouds.log

[2011-10-18 01:33:36,139]   INFO [uled executor 2] - l.instances.StopInstanceAction - Terminating instance: Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=i-16a8a176}, profile 'Amazon Build Agents'{id=cp1}, reason: idle timeout of 110 minutes reached


[2011-10-18 01:33:36,139]   INFO [uled executor 2] - r.impl.DBCloudStateManagerImpl - Image: AmazonImage{id=ami-abc12345}, profile: profile 'Amazon Build Agents'{id=cp1} was marked to NOT_CONTAIN agent
[2011-10-18 01:33:37,202]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'scheduled to stop' state, profile 'Amazon Build Agents'{id=cp1}, Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=i-16a8a176}
[2011-10-18 01:33:42,203]   INFO [uled executor 4] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopping' state, profile 'Amazon Build Agents'{id=cp1}, Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=i-16a8a176}
[2011-10-18 01:34:42,209]   INFO [uled executor 3] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'Amazon Build Agents'{id=cp1}, Amazon Instance{instanceId=z-ami-abc12345-5, imageId=ami-abc12345, amazonInstanceId=i-16a8a176}
[2011-10-18 02:34:36,062]   INFO [ents'{id=cp1} 1] - r.impl.DBCloudStateManagerImpl - Image: ami-abc12345, Instance: i-16a8a176 is marked with state: stopped.

... then finally, the first person in in the morning cancelled the build because the EC2 instance was long gone and the build was hung.



[06:36:41]: Build cancelled



As for your second question about having Team City use an EC2 instance, I was referring to how much easier it would be to integrate Team City with EC2 if it didn't terminate the instance every time.  It forced me to go down a path of building in a custom startup script that would make use of the EC2 web services to attach and detach my TeamCity build agent working directory from a separate volume that is not terminated with the instance (which in itself was a tricky thing to figure out).  Our working copy is quite large, and it makes no sense to transfer all that data over to the EC2 build agent every time we go to do a build.  If we could just tell Team City that we have these 2 or 3 build agent instances (instead of AMI images), and have Team City then start up an existing (but Stopped) instance, that would be SOOO much easier.  Then when Team City was done, just have it STOP the instance, not TERMINATE it.  Terminating an EC2 instance destroys any state change fro the image.  By merel stopping the instance, all the data on that instance would then be persisted for the next time it was started and we wouldn't have to jump through any hoops to achieve that effect.  That's all it would be.  An option to take an EC2 Instance and to Start and Stop the instance, rather than taking an EC2 Image and Starting and Terminating it.  Really, it doesn't seem like it would be too difficult to implement that feature, and it would really lower the barrier to entry for EC2 integration!

Geoff

0
Comment actions Permalink

TeamCity does not treat your build agent as EC2 agent. You may see it by message that instande does not contain agent. In that case EC2 instance will be terminated after timeout.

In the build agent logs (please attach them) you may find lines from Amazon plugin that must be bundled with the build agent.
Build agent uses "http://169.254.169.254/2009-04-04" url to fetch amazon instance metadata parameters, please check your VPN does not blocks this EC2 service.

0

Please sign in to leave a comment.