Is there an internal property to increase number of build messages an agent send to the server per request

Answered

Created November 17, 2020 20:52

Typically an agent send 100 build messages per server update request, however this means that if more messages are being generated within the time frame of 2 build messages request that the server will lag behind and that at the end the agent is stalled till the server receives the full build log.

We now have a agent that is very slow in responding to these request (~5sec interval), but has no throughput issue (I can transfer files at a rate >50MB/s).
As some of our builds produce way more than 100 messages per second I would like to know if there is an internal property (teamcity.buildMessagesProcessing.? that works on 2017.1.5) to increase this for this specific agent.

To give an idea; a build that takes 20min to build is idle >40min because the build log transfer takes so long...

6 comments

Fedor Rumyantsev

Created November 18, 2020 16:01

Hello Hans,

No, the message batch size is not configurable; however, even the higher rates (like 100 messages per second) should be handled normally in general. A similar behavior (delayed message processing) could be observed if there is an issue with the agent (e.g. high CPU/memory usage so the queue is slow to handle) or with the server (which is slow to respond that the message batch is processed). Is the agent in question the only one affected? If so, could you please:

1) enable the XML RPC logging (https://www.jetbrains.com/help/teamcity/viewing-build-agent-logs.html#Specific+Debug+Logging);
2) rerun the build so that the logs are populated;
3) upload the agent logs to the https://uploads.jetbrains.com and share the upload ID with me?

Hans Van Leemputten

Created November 18, 2020 16:34

Hi Fedor,

Yes this is the only one that I know about out of a pool of 30 similar machines. Most machines are located in the same room (this one is) and connected to one of the two switches (gigabit) in that room. Note that for testing purposes I installed a second agent on the PC, it to suffered from the same problem.

Upload id: 2020_11_18_5kS7Efm4utm2VK3A (file: TC_logs.zip)

A build just finished, next to the logs you request (the XML RPC logging was already on) I included the build log I downloaded from the build configuration and added a screenshot of the build duration history graph of that config.

Just look in the teamcity-build.log at 2020-11-18 16:05:37,572 and you will see the massive stall.

Let me know if something is missing

Fedor Rumyantsev

Created November 18, 2020 21:09

Hello Hans,

Thank you, the issue is perfectly illustrated. I see there is a delay between log upload request and response from server of 5 seconds on average. Given this is the only agent affected, I would suggest to start with checking the agent side; when the build is started again and the message processing is ongoing while the build step itself is already over, could you please collect a set of thread dumps on the agent machine in question (please see this instruction for the details: https://www.jetbrains.com/help/teamcity/reporting-issues.html#Agent+Thread+Dump)? A set of 5-10 dumps taken within 10 seconds of each other would be sufficient to see what the agent is busy with.

To have a full picture, could you please do the same on the server side while the issue is present (https://www.jetbrains.com/help/teamcity/reporting-issues.html#Server+Thread+Dump)? TeamCity will automatically take thread dumps when certain threads are delayed; if there are logs/threadDumps-<date> folders on server side which cover the time of issue, please feel free to send them instead.

I see you are using 2017.1.5 which is outdated; while chances are this is unrelated, I will check if there were any improvements to the build log upload logic within newer versions (or known issues topical on this version) and will get back to you.

Hans Van Leemputten

Created November 19, 2020 19:22

Hi Fedor

I uploaded a series of agent side thread dumps I took while the issue occurred some days ago: Upload id: 2020_11_19_KroQkWL5wwiQohNp (file: threadDumpsAgent.zip)
If you diff them you will see that only first and second dump have real differences.

I don't have the server side dumps, but if of value to you I could redo the test.
However, I don't think it is needed as I can trigger the same behavior by browsing our TeamCity server using Chrome. In the building debugger of Chrome I can see that the TTFB is 5sec for each page that needs to be fetched from the server, but only if the URL that includes the domain name, not when I for example use the IP-Address. So I reconfigured the agent for now to use the fix IP till our IT network department can further investigate the issue.

I did see one thing that I hope is fixed in newer versions of TeamCity; In the case like this where the server is running far behind wrt processing log message stopping the build always causes problems. It is unable to stop until it is back in sync with message processing or it just hangs forever (both cases are very frustrating, if you press stop you expect it to abort within a few min). We even had a case where the stop on max duration feature plainly did not work and lead to a situation where only a manual reboot of the system worked.

Fedor Rumyantsev

Created November 21, 2020 15:40

Hello Hans,

Yes indeed, the agent is barely busy resource-wise, so this looks like a network access issue indeed. Are there any proxies between server and agents? With the behavior you have described (TTFB = 5s for any HTTP(S) calls), I would expect that all agents would be affected, though. Is the issue specific to this agent machine or does it reproduce elsewhere?

Speaking of the build waiting to flush the messages prior to cancellation; I believe this behavior was still topical on the 2019.2 (e.g. if the agent was close to OOM and build cancel was issued, agent would still wait to flush). Then again, the alternative is either to confirm the build cancellation and proceed with the build messages flush in the background (but if the flush itself is the cause of issue, this can be disruptive) or the loss of unreported messages; as generally agent is quick enough to flush the data, similar behavior is considered to be a warning to check the agent state.

That being said, if you could share an approach that could work for you, I would gladly register a feature request regarding the matter (you can also make one at any time on our issue tracker: https://youtrack.jetbrains.com/issues/TW).

Hans Van Leemputten

Created December 14, 2020 11:28

The big delays are caused by a combination of internet connectivity and Virusscanner. The machine in question is not allowed to access the internet, only the internal network, but it has a virusscanner. The virusscanner is out date as it can't download the updates from the internet, which seems to be causing it to delay incoming requests till I guess it timeouts on getting a connection with the internet... or so. Some modifications have been done and it improved a lot, but still the full domain URL access is slower (500ms TTFB vs 20ms).

Please sign in to leave a comment.