Detailed explanation of cloud agents allocation strategy and license usage.

Created February 18, 2013 09:46

I try to search any info about cloud agents allocation algorithm, but no success. So i try to ask questions here:

Can be agents allocation algorithm be published? Flow diagram will be great.
1. On which conditions cloud agents are spawns?
2. On which conditions cloud agents are shutdowns?
3. Does queue affects decision which agent should be spawn? If so, can be this part of decision makin algorithm be published?
How does cloud agents affects license usage?

I'm considering writing plugin for custom Cloud Provider, but lack of information about allocation strategy, prevent me from making final decision about suitability of my idea?
My goal is to minimize expenses on licenses and in same time provide a constant time build spent in queue for each. If right now there is need in more agents of configuration A, then Teamcity shutdowns agents with less needed configurations, and spawn more with A configuration.
Please clarify is it achiveable.

5 comments

Eugene Petrenko

Created February 21, 2013 12:37

Hello,

Thank you for question. I try first cover questions here.

From the side of cloud plugin there is no need to know anything about the algorithm you questiond. The idea is that TeamCity server knows how to do the task. The goal of plugin is to provide server with necessary info and implement start/stop operations.

The question is how TeamCity server decides to start/stop cloud agents. I need several definitions first.
- Let say we call agent starting if server called plugin to start agent, but the agent is not yet connected to the server to be able to start the build (or upgrading, or OS/VM is starting or so on)
- Every clould plugin provides a number of agents that could be startred in addition to currently running agents. We call is startable agents.
- We call agent available iff it is either real agent that is connected or a cloud agent that is started and connected. Such agent is ready to run a build (and it is not running a build)

There is a queue processing algorhith in TeamCity that is able to assign queued builds to available build agents with respect to agent requirements and constratins.

Decision algorhitm is stateless. It started every X seconds to do the following:
- first we assign all queued builds to available agents.
- next we assign the rest of builds to starting agents.
- the rest of builds are assigned to startable agents (and TeamCity calls plugin to start those agents with respect to available number of agent licenses)

TeamCity stores agent properties for startable/starting agents. To get properties TeamCity had to start agent from newly added CloudImages in order to let it report all properties back to the system to let described algorhitm work. TeamCity stores properties per CloudImage that is returned from plugin. We assume CloudImage represents same-image virtual machines that have mostly same parameters.

For every started cloud agent TeamCity tracks idle time. Once idle time is bigger than specified, the command to terminate is issued to plugin.

Note, we assume cloud plugin calls returns immidiately. In most cases this mean cloud plugin should implement internal background update/command logic.

TeamCity automatically authorizes created cloud agents. Once cloud agent is removed, TeamCity removes associated build agent and thus free the license. See jetbrains.buildServer.clouds.CloudInstanceUserData#setAgentRemovePolicy for more details.

We have LocalCloud and VMWare plugins published with opened sources as reference. Please let me know if you like me to cover some points with more details.

Andrey Larionov

Created February 22, 2013 01:26

Thanks, Eugene, for detailed explanation.

My point of interest is cutting license costs, and providing equal (or almost equal) "in queue" time across all the build configuration. So from your response i learned, what my goal maybe achivable. If i set idle timeout to 1 second or so. It will guarantee what idling agent will not "eat" license before timout reached, so most recent build will start as soon as previous build ends and license a freed. But it may cost additional time resource on permanent spawning.
Alternatively i could meassure time i need to spawn agent and build a model, which allow me to examine most effective parameters such a idle timeout. Am i right?
How do you think, is it possible to implement such statistic collector and analyzer to create dynamicaly adjustable strategy of spawning/terminating cloud agents inside TeamCity build scheduller/coordinator?
Technicaly it's a almost mathematical problem, so build well verified solution is the only question of efforts. But such feature mades TeamCity not only productive effective, but also cost effective in compare to competitors.
Also this feature could be evolved in a advice generation system. When the model is evaluated, we could mutate input parameters to find optimal set of them (simply by minimax, or using heuristics). So TeamCity could suggest admin like: "If your enlarge pool1 by 2 agents and reduce pool2 by 1 and pool3 by 1, you could achive decrease of average time in queue by 2 min 45 sec approx."

What do you think of it? Maybe this idea will be positively estimated by some of product guys?

Yegor Yarko

Created February 22, 2013 10:36

Andrey,

> My point of interest is cutting license costs

Do you mean TeamCity licensing costs? Not sure you will be able to cut them dramatically if you compare your development and maintenance time with the agent licenses costs + convenience for the users.
If this is the case, could you please describe your case to make it clear why just paying licensing fee is not the best approach?

Just a note: cutting licensing costs by workarounding currently implemented license-related limitations in TeamCity might not be relaible enough approach as at some point we might change those limitations built into the product.

> How do you think, is it possible to implement such statistic collector and analyzer to create dynamicaly adjustable strategy of spawning/terminating cloud agents inside TeamCity build scheduller/coordinator?

Such ability does not seem to exist right now for external plugins. This is something which can be done by us and it probably deserves dedicated investigation, though of questionable effort/effect ratio.

> Technicaly it's a almost mathematical problem, so build well verified solution is the only question of efforts.

At the same time the task can require too much computational power for a reliable solution to be of practical use.
We do conduct a research project related to queue/agents assignment algorythm and so far it does not seem an easy task given our actual problem domain.

Andrey Larionov

Created February 26, 2013 13:47

Lets imagine situation in which i have many TeamCity server. And i decide to merge them into one, to simplify maintaince. The resulted instance hs more then 1000 build configurations. If i use static pools configuration, there is no way to dynamicaly fully automagic redistribute usage of licenses across different build configurations. So i must write some analytic tool, which will monitor "in queue time" for all the builds and notify me when it's dispersion of this value is too high.
After triggering, i should open Administrative Console of teamcity and make some changes on rebalancing pools, to reduce the dispersion of "in queue time" value. All this is required my attention and having more then 1000 build configuration and bursts in product development (especially now, whe developers starting use of DVCS branched builds), will cause this events to occure on variety of build configurations.
As for me this is a waste of time of TeamCity administrator. And maintaining this scenario requires almost dedicated engineer.

So thinking about it, i realize what if i have Private Cloud Provider, i could utilze licenses more effectively, and maintain "in queue time" dispersion in reasonable range amost automatic, without administrator intervention. Of course if teamcity has suitable algorithm behind cloud agents allocations.

Summary: Right now im realising of possibility and outcome of merging different TeamCity servers into one, to improve license utilization and automate and make maintaining "in queue time" dispersion effortless. So yes. Even there will not be significant license cost reduction, i think we are should recieve at least reduction of "in queue time" and reducing TeamCity administration attention on maintaining SLA of build process.

Sorry for my Pidgin English. I'm hurry.

Yegor Yarko

Created February 26, 2013 20:54

Andrey,

> If i use static pools configuration, there is no way to dynamicaly fully automagic redistribute usage of licenses across different build configurations.

Could you please explain why do you need that?
If you want to have shared agents, you get it out of the box.
If you want to have dedicated agents for projects, you can configure projects to use their own agent pools and assign agent to the pools accordingly.
You can also combine both approaches to have reserved agents and shared ones by assigning projects to several agent pools.

Please sign in to leave a comment.