Stuck Build Queue

Permanently deleted user

Created May 05, 2011 19:32

This is related to:javascript:;

Our build queue is starting to get "stuck" for long periods time when developers check in code or trigger builds with snapshot dependencies. A tell-tale sign that the queue is stuck is when the "Time to Start" says "Delayed" instead of a time. By "stuck" I mean builds are queued, there are available idle agents, and the build can be run in parallel with other builds in progress.

Our dependency tree in our main "Build" project is fairly complex and a bit tangled. Below is a simplified example (same graph from above post):

In our setup, we only have VCS build triggers on the top level build configurations. All build configurations are in the same "build" project. In this example, that would be I, K, G and L. If code is checked into a low level dependency, in this example A and/or C, many builds are queued. In our current production setup, this can result in 200 or so builds being queued up. Individual build configurations may be queued up more than 10 times for a single code checkin. In our "build" project, all snapshot dependencies have the "Do not run new build if there is a suitable one" and "Only use successful builds from suitable ones" options checked.

When only one code checkin occurs, TeamCity resolves build dependencies fairly quickly. However, if someone else makes a code check shortly after the first code checkin, the queue again explodes. The more this happens, the queue appears to be "stuck" for longer periods of time as TeamCity figures out what to build next. This is further complicated by the fact that we use TeamCity to deploy our software to our production and testing environments.

In our setup we have dedicated build agents running as special privileged users for each of our environments. For example, our Production Environment has 6 dedicated build agents for deploying software that run as a user with production level privileges. These dedicated agents can not be used by anything in our main "Build" project since they have production level access.

To maximize the number of agents available for our main "build" project and Development environment (our most open testing environment), these projects share the same agents. Therefore, if a development ("dev") environment deployment is triggered (it is schedule for every two hours) and someone makes a code checkin, all of those queued builds will be allowed to run on the same 8 agents. The dev deployment can happen independent of the main code builds (no snapshot dependencies), but often deployments are left sitting in the queue even though there are available agents. Our Production Deploy, however, has a completely sepearate and dedicated pool of agents.

So today, we had 5 code checkins in about 5 minutes (three of them were to lower level projects). This was followed by a schedule Dev deployment. Then one of our sysadmins triggered a production deployment to a desktop machine he was setting up. Very little happened for almost two hours. Occasionally builds occurred, but very infrequently. During this two hour period, there were several other code checkins, which made the queue even larger and slower. Even though the production deploy has 6 idle available dedicated agents and has the highest priority in the build queue, it took 42 minutes before it started. Most of the time, the agents for our build and dev deployment were idle.

During this two hour period, load on our quad core Windows Server 2008 R2 TeamCity web server machine was around 30%, almost all of it due to the TeamCity tomcat instance. Tomcat was using about 750MB of RAM. The machine is 64bit with 16GB of Ram and tomcat is running as a 32bit process. The queue has over 300 items in it. I tried to Archive the build project, but after 5 minutes there were still 280 items in the queue and the archive had not finished. I ended up restarting the TeamCity web service and the archive occurred much faster. I then re-triggered the builds in our "build" project again. Luckily, code checkins were at a minimum following the restart, so TeamCity pushed though the queue in an adequate amount of time. Restarting seemed to help, but not eliminate the problem of items being stuck in the queue.

I uploaded thread dumps, debug log files, and memory snapshots to ftp.intellij.net/.upload. The filename is: Chatham-TeamCityServerData-20110505.zip

Any help in improving the performance of our snapshot dependencies would be greatly appreciated!

Please sign in to leave a comment.