VMWare plugin on TeamCity 8.0.4 does not work

Dear support

After the upgrade to TeamCity 8.0.4, the VMWare plugin does not work anymore

We use

VMware cloud support
Allows to run build agents in VMware cloud
SNAPSHOT-201008101252


for the error log, see attached log file.

On your TeamCity VMWare plugin site (http://teamcity.jetbrains.com/project.html?projectId=TeamCityPluginsByJetBrains_VMWare), the "installer" for 8.0.X currently fails; are there any plans to fix this?

Kind regards
Michael

Attachment(s):
StartVMWareImages.log.zip
22 comments
Comment actions Permalink

I fixed compilation. Please try the new one available at http://teamcity.jetbrains.com/project.html?projectId=TeamCityPluginsByJetBrains_VMWare.

We are now considering improvement of this plugin to make it more valuable. It would help us, if you can brielfy describe why and what for are you using VMWare integration and what features you'd like to have.

Thanks.

0
Comment actions Permalink

Hi Sergey

Thank you for the fix and the fast reply.

We use this cool feature to test our application on different OS without having to spend our TeamCity licenses on dedicated machines.

There are several tests we perform:

  • Install on a clean environment
  • Install on top of a previous release
  • Install under different versions of a "hosting software" (our software can be registered under this "hosting software")


A cool feature would be to have snapshots. What I currently have to do every day before the tests are run is to clean up the VMWare files with a script so that they are virgin again.

So far, this approach just worked fine and since it is a proof of concept, I am more than happy with this.

Regards
Michael

0
Comment actions Permalink

Thank you for your feedback. It's helpful.

0
Comment actions Permalink

Hi Sergey,

I have been using this plugin extensively and have also modified it to run images on vmware workstation instead of player... and a plugin that runs images on vsphere/vcenter.  For the latter, I have had to modify the code extensively ...
We have been using this on teamcity 8.0.1 and have about 30 agent licenses.

I have been having a few issues with the way either this plugin works or teamcity manages the vm's listed under the profile created by the plugin.  These problems have mainly been on how the VM doesn't close after the build is finished.  

In the cloud configuration, we have set it up to close the image upon the end of the build and idle timeout is set to lowest (1 min) as we don't want to run another build on the same VM unless the VM reverts to a snapshot (on poweroff) so that we have a clean build environment again.  

However it seems that if we have a few builds in queue that will use an image that's already running a build, the second build launches BEFORE the vm shuts down ... and then the VM suddenly decides to shut down in the middle of the build ... and then the plugin gets stuck in a loop to start the vm / stop the vm / until it eventually catches the vm in a stopped mode and starts it and finally runs the build or it just gives up and gives us weird build errors due to lost networking (because VM was shutting down).  We get errors such as: buildAgent.runBuild: java.net.SocketException: Connection reset

The problems mainly occurs when we launch several short builds that only last 50-60 seconds...

Is there a way for this teamcity to NOT launch a second build on a VM?  What is surprising is that it seems to be doing that despite the fact that we have it configured to shutdown the image upon completion of the first build.  

I've been trying to study the code to see where I could modify it to fix it but I am stumped.  

Secondly, I had to make some changes when I started modifying the code to get it to compile properly, mainly changes to do with dependencies such as the xmlstream library... i'm wondering if the changes you have made to get it to compile for 8.0.4 will break my plugin which has had many changes (especially for the version that launches images on vsphere/vcenter) if we upgrade to 8.0.4 ... ..

any help on this would be appreciated ..

Amit M.

0
Comment actions Permalink

Hi Amit,

1) There's an option in cloud configuration called "Terminate Instance: After first build finishes". However, the way it would work relies on implementation. This checkbox enables calling jetbrains.buildServer.clouds.CloudClientEx#terminateInstance after build finishes. If this method works synchronously (i.e. waits until machine is actually terminated or, at least, becomes unavailable), then no other builds will be run of that VM. Otherwise, you might face the same behaviour as you mentioned above.

2) My compilation changes are quite minor and only apply to library changes, so there's nothing new in the code.

Thanks.

0
Comment actions Permalink

Seems your vmware plugin does implement CloudClientEx in a class called VMWareCloudClientEx  --- and the terminateInstance method is effectivly defined there.   I do have to say that I can't find any place where this method is actually called, but as you pointed out, if the configuration setting "Terminate Instance: .... " is checked, it should be called.  

I have not modified how the instances are handled at all, would you be kind enough to verify the code and see if terminateInstance is actually implemented the way you would expect it to?  It would be of greate help to us if we can fix this bug as we do use this plugin extensively.

Thank you in advance for your help.


Amit M.

0
Comment actions Permalink

Amit, any teamcity cloud-plugin, such as VMWare one, must contain implementations of certain API/Adapter calls that Teamcity core functionality uses to manage cloud instances, images and agents.

There's no suprise that you can't find any usages of jetbrains.buildServer.clouds.vmware.VMWareCloudClientEx#terminateInstance, because the checkbox itself and corresponding behaviour is a core functionality.

0
Comment actions Permalink

Thank you again for your prompt replies Sergey, it's great to hold a discussion with you.   I do have to say that I wasn't surprised of not finding any calls or usages of jetbrains.buildServer.clouds.vmware.VMWareCloudClientEx#terminateInstance as I figured it would be Teamcity's core that would be calling the function if the checkbox is selected, but at the same time, I also have to say that I'm not sure whether the code written corresponds to something that is right and allows us to get rid of the behaviour we are seeing at the moment.

To put it in other words, our problem is that the behaviour we see is a problem, an issue with the plugin that most probably does not terminate the instance properly once a build is finished allowing all kind of errors to happen and allowing teamcity to push builds on this instance while it's terminating.  How can we fix this?  Can you guys have a look at the way this method is implemented in the plugin to see if there's something that can be changed so that we can avoid seeing this behaviour.  

Currently we were trying to a workaround to disable the agent when the build finishes, so that it terminates properly with whatever timeouts we set and the once the instance is terminated, the agent will be reenabled automatically when the next build starts and instance boots up again.  This is a workaround, but I would much rather see if the plugin's code can't be fixed so that we don't have to do this.


Thanks for all your help in fixing this issue.

Amit M.

0
Comment actions Permalink

You might want to take a look at jetbrains.buildServer.clouds.vmware.VMRemoteHost#terminateInstance. As you can see here, the teamcity server doesn't wait until the stop command is executed completely. It just queues the command for execution and exits the method. That's why new builds can start on agent while it's preparing for the termination. I think if you make the stop command synchronous the problem should go away.

Thanks.

0
Comment actions Permalink

Hi Sergey,

First of all, thank you for continuously pointing me to the right direction.  I see that the terminateInstance method in VMRemoteHost.java that you mentionned in your post doesn't really terminateMachine right away but rather submits the command to be executed after setting the status as "SCHEDULED_TO_STOP".  As my java skills are not that great, I wanted to confirm this would work before implementing the change and testing it out:

So instead of scheduling to stop, I should simply call myVMRun.terminateMachine(), set the status to stopped, and dispose (that is do the same thing as jetbrains.buildServer.clouds.vmware.VMRemoteHost#dispose which disposes the image, clears and shuts down the executor...

Basically something like:
try{
     myVMRun.terminateMachine(instance.getImage().getInfo());
} finally {
     instance.setStatus(InstanceStatus.STOPPING);
     dispose()
}

instead of whatever is there right now ..

does this seem like the right changes to make to the method?

Thank you for your response in advance,



Amit M.

0
Comment actions Permalink

I would recommend you to wait for the Future result returned from executor. Implementing it this way wouldn't break existing functionality tied to error processing during the VIX API call.

Sorry, but I cannot provide any code here, since I'm not responsible for its behaviour on your side.

0
Comment actions Permalink

Hi Sergey,

Yet again, thank you for your prompt reply.  

I will try to figure out how to get the executor to do this synchronous .. using .wait() at the end probably of the block can help me wait for the future event I assume.

I do want to point out something though:
You said you can't provide any code here since you are not responsible for its behaviour on our side, but if I'm not mistaken, this code is written by Teamcity developers, so if this code is written by your collegues (Eugene Petrenko?), why was it written this way knowing that such implementation of "terminateInstance()" can have a behaviour such as the one we are observing at the moment?  I'm just curious if there's a reason for it to be implemented as such.


Thanks again for the valuable assistance you have provided.

I'll try this:
try {
          myExecutor.submit(instance.wrapPermanentError("terminate instance", new Runnable() {
            public void run() {
              try {
                myVMRun.terminateMachine(instance.getImage().getInfo());
              } finally {
                instance.setStatus(InstanceStatus.STOPPING);
              }
            }
          })).wait();
      } catch (InterruptedException e) {
          e.printStackTrace();  
      }

0
Comment actions Permalink

I think you should try this, because the one you mentioned won't work as expected:

    final Future<?> submit = myExecutor.submit(instance.wrapPermanentError("terminate instance", new Runnable() {
      public void run() {
        try {
          myVMRun.terminateMachine(instance.getImage().getInfo());
        } finally {
          instance.setStatus(InstanceStatus.STOPPING);
        }
      }
    }));
    try {
      submit.get(5, TimeUnit.MINUTES); // don't wait for too long, if something goes wrong.
    } catch (Exception e) {
      // process exception here
    }

BTW, your case is tracked here: http://youtrack.jetbrains.com/issue/TW-33526.

Thanks.

0
Comment actions Permalink

Thank you again for your prompt reply and the code snippet.

I'll build the plugin again and test it to see how it behaves and get back to you with results.


Sincerely,

AM

0
Comment actions Permalink

So my tests show that the implemented changes work in a way that you don't see any new builds embarking on an image that's in the process of shutting down, however, you do see canceled builds.

So if I queue 6 small builds on one single agent:

- build config 1 runs
- build 2, 3, 4, 5 and 6 are still queued
- once build config 1 finishes, agent terminates
- build config 2 gets canceled form queue and requeued automaticaly with the note:
     Canceled with comment: Agent removed
     Call http://teamcityserver:9090/RPC2 buildAgent.runBuild: java.net.SocketException: Connection reset
- build config 2 runs
- build config 3 runs without a problem
- build config 4 gets canceled and requeued automaticaly with the note:
     Canceled with comment: Agent removed
- build config 4 runs
- build config 5 gets canceled with comments: Canceled with comment: Agent removed
- build config 5 is requeued automatically and runs
- build config 6 gets canceled with comments:  Canceled with comment: Agent removed
- build config 6 is requeued automatically and runs


So, one thing we didn't see after the plugin was updated with the new "terminateInstance()" method is that builds don't start on terminating machines.  I guess now the thread actually waits for the new runnable thread that's terminating the instance to end.. and then when it tries to continue running, it realizes that the agent is no longer there, requeues and restarts the agents again.

Is this new behaviour expected and related to the new issue you created on youtrack with regards to teamcity not caring for the agent status and queuing build on them?

I have to say while this may be a bit annoying when looking at build history to see all those canceled builds, it's still better than having red builds because the agent shut down in the middle of the second build that started running on an agent that was shutting down in the first place.


Thanks for your time & consideration.


AM

0
Comment actions Permalink

Unfortunately, there's no simple way in this method to wait until teamcity realizes that agent is stopped, so I can only recommend adding some sleep timeout here (like 1 minute) to let it happen.

0
Comment actions Permalink

I understand that this may be a limitation that can't be fixed through this method...

I do have another question for you:  since I implemented the proposed code change and reinstalled the plugin, i have been noticing that teamcity has slowed down a lot.. is it possible that the changes in the plugin are causing teamcity to consume more CPU or memory?  

Thanks

0
Comment actions Permalink

I just realized, that the adding any kind of delay to terminateInstance may cause Teamcity to slow down, so please revert any changes you've made.

It looks like there's no robust workaround for what you need, so I fixed the issue. Please try replacing cloud-server.jar with the attached one.



Attachment(s):
cloud-server.jar
0
Comment actions Permalink

Thanks for the fix.  I think the problem of short builds may have been fixed but I do have another problem.

1.  I reverted the changes I made in the plugin and rebuilt it against TeamCity 8.0.1 (the version that I'm using, build27435)
2.  reinstalled the plugin again, changed the cloud-server.jar file in teamcity and restarted teamcity

Everything restarted fine.  When I tried clicking on the "cloud" tab though in the agents section, I got an error:

Unexpected Error

This was not supposed to happen. Please provide the error details to your TeamCity server maintainer.
If you maintain this TeamCity installation please report this error to JetBrains.

Error message: Property 'canTerminate' not found on type jetbrains.buildServer.clouds.server.web.beans.CloudTabFormInstanceInfo
TeamCity: 8.0.1 (build 27435)
Operating system: Linux (2.6.18-274.el5, amd64)
Java: 1.7.0_03-b04 (Oracle Corporation)
Servlet container: Apache Tomcat/7.0.25


Show stacktrace

==================
I'm not sure if this is due to my being on 8.0.1 .. and I should probably upgrade the teamcity version to 8.0.5 to see if it worked.  Either way, I decided to queue up my small builds to see what happens despite not being able to access the cloud tab.  My 6 builds started running. Teamcity did an excellent job on the first five... there was no "cancelled" build that i saw previously, no builds that embarked on a machine that was terminating... though when it reached the final build, it simply got stuck in a loop and kept booting the VM and would turn it off immediately and the last build just stayed in queue with "No compatible agents".  .. after about 10-15 minutes of rebotting the VM again and again, TC finally ran the build on my agent.. and didn't pollute the build history with "cancelled" or red builds.

I will do an upgrade of teamcity and then try these changes again after rebuiling the plugin against TeamCity 8.0.5 as well.  If the cloud tab doesn't give me this error, I'll push these changes to our production teamcity instance.

Last strange thing I noticed was that my "Unauthorized" agent tab filled up with this agent every time it started and restarted:
After my 6 builds were run, I had 6 Unauthorized agents (even though the cloud VM agent was turned off) and they were named
WM-name-of-the-agent-1 ..... WM-name-of-the-agent-6

Agent Authorize Is connected Last communication date Inactivity reason
Default pool
WM-BM-CS-Salt-20131016-1
10.0.33.30
Unauthorized
Disconnected 22 Nov 13 15:29 Cannot access agent
WM-BM-CS-Salt-20131016-2
10.0.33.30
Unauthorized
Disconnected 22 Nov 13 15:32 Cannot access agent
WM-BM-CS-Salt-20131016-3
10.0.33.30
Unauthorized
Disconnected 22 Nov 13 15:36 Cannot access agent
WM-BM-CS-Salt-20131016-4
10.0.33.30
Unauthorized
Disconnected 22 Nov 13 15:39 Cannot access agent
WM-BM-CS-Salt-20131016-5
10.0.33.30
Unauthorized
Disconnected 22 Nov 13 15:42 Cannot access agent
WM-BM-CS-Salt-20131016-6
10.0.33.30
Unauthorized
Disconnected 22 Nov 13 16:05 Cannot access agent


This had never happened before.  Once the cloud instance was closed, it wouldn't end up in the "unauthorized" agent tab.  And of course, the 1,2,3,4,5,6 at the end.... wouldn't be there normally either as it would run an instance, terminate it and rerun the same instance again without increasing the integer i.

Just wanted to share my findings.  Will report back possibly on Monday after the upgrade is done.

Thanks again!

AM

0
Comment actions Permalink

You are correct assuming that the problem with cloud tab is fixed in 8.0.5. Actually the property used in the code appeared in 8.0.5. That's why you see that problem.

I unauthorize agents after the build to free up agent licenses to be able to run more builds. I slightly changed the behaviour and created a new package. Please try new ones and let me know if it helps. I also updated other libraries. so please replace all attached ones.

Thanks.



Attachment(s):
cloud-interface.jar
cloud-shared.jar
cloud-server.jar
0
Comment actions Permalink

Hi Sergey,

First of all, I want to thank you for continuing the support on this issue and creating a new patch for me.  

Now, this is what was done:

- Upgraded TC to version 8.0.5
- Applied the patch (modified all three libs in the teamcity 8.0.5 directory with the ones you supplied)

After patching and restarting Teamcity:
- I was able to access the cloud page without an issue, no error occured like in 8.0.1

- The vmware plugin communicated well with teamcity.  
- vmware plugin functionality tested well, start/stop/running a single build, all worked well and the teamcity agent upgraded was fine

The big test was to queue my 6 small builds with this patched version of 8.0.5:
- No cancelled builds appeared
- No red builds appeared
- VM closed after each build, the numbering was perfect (all 6 builds are based on a template, so the numbering showed nothing was skipped, they went from 30 to 35)
- No agents accumulate in the Unauthorized tab.  I noticed that while the build was ending, agent showed as Connected but Disabled with a note saying "agent has been unauthorized".

Conclusion:  All seems to be working with TC 8.0.5, patched with the modified libs that you provided. No red builds, no cancelled builds, VMs open and close at each build.

Do you foresee that this patch will be in 8.0.6?  And do you have an ETA on the release of 8.0.6?  Instead of upgrading our Live/Prod TC instance to 8.0.5 and then patching it, we would prefer upgrading it straigh tto 8.0.6 if the wait is not too long.  


Thank you for all your help again.

Sincerely,

AM

0
Comment actions Permalink

Hi Amit, good to hear that we were finally able to fix the issue. I will commit the change to the 8.0.6 branch. Not sure when it is going to be released, because so far we don't have enough fixed issues since 8.0.5 to make a new minor release.
Your teamcity installation should notify you when new version is available (you'll see the "get new version" message in the page footer).

0

Please sign in to leave a comment.