Builds not assigned to agents

Hello,

I am using teamcity for around a year and in last month i am experiencing problems. I am running TeamCity Enterprise 2019.1 (build 65998) in docker (official docker image from jetbrains), with AWS RDS aurora as a database server. Agents are dynamic, they are spawned on AWS 'on demand' and then they die.

During a problem all builds are pending in Build Queue and teamcity shows that there is plenty available agents and one agent which is processing some job with failures (or job which was aborted). But in real, those idle agents do not exist - there are no servers spawned in AWS as agents (and old agents are already down). The only solution i found for unblocking builds is server restart. What is also interesting - java process is completely stuck and cannot be killed, i have to reboot server for that. 

There are multiple messages in docker log stating: "

etbrains.buildServer.AGENT - Agent "builder-XXXXX"{id=0} tries to register while it is already registered as "builder-XXXXXX" {id=YYYY}

etbrains.buildServer.AGENT - Agent "builder-XXXXX"{id=0} tries to register while it is already registered as "builder-XXXXXX" {id=YYYY+1} "

and so on. 

There are also some errors logged earlier with:

"

jetbrains.buildServer.commitPublisher.PublisherException: Bitbucket Server publisher has failed to connect to https://git.xxxxxx/ repository"

Anyone experienced such problems, what can i search for?

0
4 comments

Hi Mateusz,

 

There seems to be a few different issues, they might be connected but I think we should approach them one by one instead of directly as a whole, as they might as well be unrelated.

-All builds are pending with available agents: This builds might be set to only run on some agents, or might only be compatible with those agents. If there are available compatible agents, they should be assigned the builds. Of course, if the agents aren't really connected they can't be picking up builds, so this is probably tied to the next part.

-"Ghost" agents, connected to the server but not having machines up for them. This is a very strange situation. The server checks regularly the agents and marks as disconnected those that don't reply to the heartbeat.

-Server's java process being locked and requiring a full server restart. This is the first time I've seen this behavior. This is probably linked to some of the other issues here, probably just a symptom.

-Agents not spawning. There is a number of reasons for this to happen, actually. If the number of enabled, active agents on the server has already reached the number of licensed agents (3 by default), then new agents won't be spawned until at least one of the licenses is freed. If the limit hasn't been reached, there might be other issues trying to spawn the agents, which would probably be reflected on the server logs, teamcity-server.log and teamcity-cloud.log.

 

Now, with this in mind, the next step would be start looking for culprits. I'd recommend taking the following steps once the server reaches that point where it says there are connected agents that are this kind of "ghost" agents, the machines are already down.

-Check their agent page on the server side, try to take a thread dump from the agent (you can do it from the server page). If it succeeds, then that means that there is actually a live agent in there. If the machines were already down, that shouldn't happen

-Check the server logs for error messages. This seems to be a server-centric issue, so looking at logs from the agents is probably irrelevant. The teamcity-server.log and teamcity-cloud.log will probably contain information about the server itself that will be related to what the agents are doing.

-If the java process on the server is blocked, please take some thread dumps from the server and forward those to us: https://www.jetbrains.com/help/teamcity/reporting-issues.html#ReportingIssues-ServerThreadDump.

1

Hello Denis,

Thanks for assessing this problem. The issue started to reoccur more frequently, so i have already created support issue (TW-60831). It seems that it might be caused by big log files from jobs and failure conditions. I hope it will be fixed with 2019.1.1 soon, but i am not 100% confident as this blocked processes are really strange to me.

As far as i understand it, all my teamcity threads responsible for some functionality got blocked for some reason and are waiting for system call answer. Due to this agents list does not get updated and agents are removed but still shown as working. I tried to strace those processes, but as i suspected there is nothing happening - they just wait for system answer. The only possibility to see what exactly they are calling is to run whole teamcity with strace, which seems quite big - so i will wait for patch and check if it helps.

0

Hi Mateusz,

 

thanks for sharing, I've just checked your issue on the tracker and seen works has been done in there, so unless you think there is anything else to discuss on the side, I think we can leave it there. Any further comments to this specific problem should be better added to the issue in the tracker to let the devs know directly.

0

Sure,

Thank you again for looking at it.

0

Please sign in to leave a comment.