Sudden change in build time and test results

We have been using TeamCity 7.1.2 hosted on EC2 for about a year with no problems since we set up our current configuration.  We have been building our Java 1.7 project with Ant with no changes to build process or configuration for over six months.  Everything has been very stable and predictable until the first build of October 24, when we started seeing these changes, very consistently, build after build:

* ~200 unit tests that check the double values produced by lengthy numerical calculations against hard-coded expected results started to fail
* Execution time for passing tests increased, sometimes dramatically: in the worst case, three of our tests went from ~45s to ~15min!
* The following sanity-check test has started to fail (although other similar tests of BigDecimal calculations still pass):
       BigDecimal exponent = BigDecimal.valueOf(10);
        BigDecimal result = BigDecimalMath.exp(exponent, 20);
        result = result.setScale(12, BigDecimal.ROUND_DOWN);
        result = BigDecimalMath.pow(result, 4);
        assertEquals(Math.exp(40), result.doubleValue(), 0);

These changes are very consistent from build to build: the failing tests have consistently calculated the same values all week, and while execution time varies a bit between builds, the same tests have turned up in top 20 sorted by execution time, and the variation in run time for each of these tests has been low.  We cannot reproduce these changes on our development machines.  Rolling back recent changes to our source code doesn't make any difference.  There have been to changes to our TeamCity, EC2, Ant, etc configurations for months!

Any ideas for isolating the problem and getting our build back to normal?


Comment actions Permalink

We've turned up one key difference between the older, passing builds and the newer failing ones: as of 10/24/2013, the build logs indicate that the TeamCity agent was run using java-1.7.0-openjdk- instead of java-1.7.0-openjdk-  It turns out RedHat recently released a security update which has been applied to our RedHat-based AMIs.  We haven't yet confirmed that this is the why our most calculation-intensive unit tests are either failing or taking ~20 times longer than usual to converge, but at this point it looks like the most likely candidate.

Comment actions Permalink

Thanks for the explanation, Audrey.

Such issues can be easely reproduced outside of TeamCity. That may help localizing the root cause faster.

Comment actions Permalink

We have indeed confirmed locally that Java causes certain of our tests to behave differently than under previous jvm versions.

Since our users are unlikely to upgrade their jvm in the near future, we'd prefer to handle this by building with the previous jvm version.  We have been experimenting with creating new AMIs with Oracle's jdk installed (to match the deployment environment), but have yet to get a successful build.  If we install 7u45 and use alternatives to set it as the default, build logs show it being used (unfortunately, Oracle has the same problem as OpenJVM).  When we use the same procedure to install 7u25, the build continues to use OpenJDK 7u45.  Note that in both cases, using java -version on the command line confirms that the Oracle jdk should be used as the default.

How can we explicitly configure which installed version of java to use for the build?

Comment actions Permalink

In particular, it looks like the build agent itself is started up using the correct instance of java, but it then fires off worker processes that are using the wrong instance:

1686 ?        Sl     0:00 /usr/bin/../bin/java -ea -cp ../launcher/lib/launcher.jar jetbrains.buildServer.agent.Launcher -ea -Xmx384m -Dteamcity_logs=../logs/ -Dlog4j.configuration=file:../conf/teamcity-agent-log4j.xml jetbrains.buildSe
1708 ?        Sl     1:03 /usr/java/jdk1.7.0_25/jre/bin/java -ea -Xmx384m -Dteamcity_logs=../logs/ -Dlog4j.configuration=file:../conf/teamcity-agent-log4j.xml -classpath /home/ec2-user/agent/lib/xpp3-1.1.4c.jar:/home/ec2-user/agent/lib/
2171 ?        Sl     0:54 /usr/lib/jvm/java-1.7.0-openjdk- -Dagent.home.dir=/home/ec2-user/agent -Dagent.ownPort=9090 -Dant.home=/home/ec2-us
2862 ?        Sl     4:20 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090
2884 ?        Sl     4:33 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090
2902 ?        Sl     4:30 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090
2949 ?        Sl     2:55 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090
3066 ?        Sl     4:13 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090
4379 ?        Sl     0:07 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090
4400 ?        Sl     0:08 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090
4424 ?        Sl     0:05 /usr/lib/jvm/java-1.7.0-openjdk- -server -Xmx512m -Dagent.ownPort=9090

The first two processes listed are correct; how do we configure the other nine?

Comment actions Permalink

We fixed that last issue by adding abuild parameter to our project called env.JAVA_HOME.  It feels like a hack (shouldn't the build pick up environment variables as configured on the agent machine?), but it gets the job done.

Comment actions Permalink

Some build runners have JDK home path setting to specify Java version explicitly.
But your solution works too.


Please sign in to leave a comment.