Multiple builds stuck after publishing artifacts

Permanently deleted user

Created July 07, 2015 10:54

Hi,

We are experiencing some issues with a subset of our builds; every so often, a build will hang indefinitely after having run all steps successfully.

The last entries in the build log always read:

[09:06:21][Publishing artifacts] Publishing using [ArtifactsCachePublisher]
[09:06:21][Publishing artifacts] Publishing using [WebPublisher]
[09:06:22][Publishing artifacts] Publishing using [ArtifactsCachePublisher]
[09:06:22][Publishing artifacts] Publishing using [WebPublisher]

The artifacts we've specified for collection are dealt with slightly earlier, and are available as expected (even before the build is marked finished).

We've only experienced these issues on one specific agent - other machines have not exhibited the problem. Efforts to find out why this might be the case have been in vain.

We're unable to retrieve a thread dump from the agent as the co-ordinator marks it as idle after all build steps are complete. However, a server thread dump reveals the following:

"TC: 09:58:59 Finishing build: 191738; 09:58:59 Task started; Normal executor 3" group="main" prio=5 tid=119 nid=119 waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@220d2489
   by "TC: 09:31:56 Task started; Scheduled executor 3"
   java.lang.Thread.State: WAITING
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown Source)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(Unknown Source)
    at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
    at jetbrains.buildServer.buildFailures.BuildFailurePeriodicalChecker.stopPeriodicalCheck(BuildFailurePeriodicalChecker.java:12)
    at jetbrains.buildServer.serverSide.impl.RunningBuildImpl.finish(RunningBuildImpl.java:207)
    at jetbrains.buildServer.serverSide.impl.RunningBuildsManagerImpl.finishBuild(RunningBuildsManagerImpl.java:123)
    at jetbrains.buildServer.serverSide.impl.RunningBuildsManagerImpl$2$1.run(RunningBuildsManagerImpl.java)
    at jetbrains.buildServer.util.NamedThreadFactory.executeWithNewThreadName(NamedThreadFactory.java:89)
    at jetbrains.buildServer.serverSide.impl.RunningBuildsManagerImpl$2.run(RunningBuildsManagerImpl.java)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
"TC: 09:06:25 Finishing build: 191734; 09:06:25 Task started; Normal executor 2" group="main" prio=5 tid=117 nid=117 waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@59ea022a
   by "TC: 08:43:15 Task started; Scheduled executor 1"
   (same call stack)
"TC: 09:31:05 Finishing build: 191736; 09:31:05 Task started; Normal executor 8" group="main" prio=5 tid=314 nid=314 waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@227c29c2
   by "TC: 09:07:15 Task started; Scheduled executor 2"
   (same call stack)

This tallies with what we see on the projects page (in this case, 3 running builds belonging to the same agent):

Sometimes the builds will complete after several hours of overtime; sometimes we have to manually remove them from the "running" database table as they 'cannot stop'. The latter is not enjoyable!

A complete reinstall of the agent software unfortunately did not help.

We are running TeamCity Enterprise 9.0.4 (build 32407). Any help would be much appreciated.

Thanks,

Jon

9 comments

Alina Mishina

Created July 08, 2015 13:19

Hi Jon,

Please attach teamcity-server.log and teamcity-agent.log from the affected agent. Also several thread dumps of the hanging agent process. You can use jstack util: https://confluence.jetbrains.com/display/TCD9/Reporting+Issues#ReportingIssues-Threaddumptakingapproaches.

Permanently deleted user

Created July 09, 2015 09:53

Hi Alina,

Here are the agent (named 'PROST') and server logs starting from around the time I triggered the build in question. Some data has been anonymised (replaced with "...") but the build we're debugging is labelled 'build_of_interest'.

I've also attached the public build log along with several thread dumps from the agent, which were taken once the build appeared to be finished and the artifacts were available (see timestamps). I suspect the issue is server-side so I'm not sure they'll reveal much, but I hope I'm wrong!

Note the 2 hour delay before "Build finished" appears. It's worth pointing out that the agent was disabled, so no other builds were permitted to run on it during this time.

Thanks,
Jon

Attachment(s):
thread_dumps.zip
teamcity-server.log.zip
teamcity-agent.log.zip
build_log.txt.zip

Nikita Skvortsov

Created July 09, 2015 12:21

Jon,

Indeed, this is server side issue regarding build failure conditions.

Could you please attach a full server thread dump and tell what "Buil failure conditions" are configured for the build?

Permanently deleted user

Created July 09, 2015 12:43

Here's the full dump from my first post where 3 builds were not completing. I don't have another for last night's bug repro but I can get a new one if necessary.

The failure conditions are:

server_threads.txt.zip

Permanently deleted user

Created July 09, 2015 12:49

-double post-

Permanently deleted user

Created February 07, 2017 08:11

did this ever go anywhere? im seeing this exact behavior

Permanently deleted user

Created February 07, 2017 09:49

In the end we disabled and removed that agent, no solution was found.

Alina Mishina

Created February 09, 2017 15:32

Hello James,

Could you please attach agent and server logs? You can use "Submit a request" button on the top to send all the files privately.

Permanently deleted user

Created February 09, 2017 16:09

hi alina, i submitted a separate ticket and was able to resolve my issue. there was a rogue wildcard .* at the beginning of a regex failure condition that was causing my server CPU to max out. once i removed this, all was well.

Please sign in to leave a comment.