Sporadic one minute stall after 'Agent is running under JRE'

Permanently deleted user

Created September 03, 2015 13:30

Hi,

When running any job in our project, there is roughly a 20% chance that there will be a one minute stall near the start of the job. This presents in the logs like this:

[13:19:10] : TeamCity server version is 8.1.5 (build 30240)
[13:19:10] : Skip checking for changes - changes are already collected
[13:19:10] : Agent time zone: Europe/London
[13:19:10] : Agent is running under JRE: 1.7.0_51-b13
[13:20:10] : Publishing internal artifacts
[13:20:10] : [Publishing internal artifacts] Sending using ArtifactsCachePublisher
[13:20:10] : [Publishing internal artifacts] Sending using WebPublisher
[13:20:10] : Clearing temporary directory: D:\Teamcity\temp\buildTmp

Notice the jump from 13:19:10 to 13:20:10 after the "Agent is running under JRE" log line.

The same job, run against the same changelist, on the same agent shortly afterwards produces the following log:

[13:27:13] : TeamCity server version is 8.1.5 (build 30240)
[13:27:13] : Skip checking for changes - changes are already collected
[13:27:13] : Agent time zone: Europe/London
[13:27:13] : Agent is running under JRE: 1.7.0_51-b13
[13:27:14] : Publishing internal artifacts
[13:27:14] : [Publishing internal artifacts] Sending using ArtifactsCachePublisher
[13:27:14] : [Publishing internal artifacts] Sending using WebPublisher
[13:27:14] : Clearing temporary directory: D:\Teamcity\temp\buildTmp

As you can see, there is only 1s between "Agent is running" and "Publishing internal artifacts".

This obviously impacts us on short-running jobs since this 1m dead time can eclipse the actual running time of the job itself.

Some observations:

I've not managed to identify a pattern to what is causing the stall, other than if it does occur it is always exactly one minute.
It doesn't appear to be only related to moving from the spin-up onto publishing internal assets, I have other examples of logs where the next thing after "Agent is running" is "NUnit report watcher" for some of our test jobs.
It happens on personal and non-personal builds.
It isn't just the first run of a job on an agent, I've seen the same job (same CL, same agent) run OK first time, then stall second, then OK third.
A quick straw-poll of one job over the past two days shows 23 of the 40 agents in the pool exhibiting the problem at some point.

This feels like a timeout of some sort, does TC query a web connection or otherwise immediately after outputting the log message? More generally, is this a known issue? I couldn't find any reference to anything related to this.

Thanks,
David

2 comments

Alina Mishina

Created September 04, 2015 11:05

Hi David,

To investigate this case we need additional information. Could you collect agent thread dumps while the build is in hanging state? Please attach collected thread dumps together with teamcity-agent.log file.

Permanently deleted user

Created September 17, 2015 07:46

The problem was due to the "Performance Monitor" build feature. Removing this from the jobs removed all the stalls. On closer inspection, for jobs with pauses, we got no PerfMon data anyway.

JetBrains support suggested we upgrade to the latest TC (9.1 at time of writing) and see if the problem has been fixed. The issue for TC 8.1.5 is being tracked here: https://youtrack.jetbrains.com/issue/TW-42478

Please sign in to leave a comment.