TeamCity agent 2024-07-02 docker in docker stopped working
Since installing the latest TeamCity server update + updating Ubuntu 24.04 with the latest docker version (27.2.0, build 3ab4256), the agents were no longer working (we had them locked to 2022.10.3-linux-sudo).
But when upgrading the agents to 2024-07-02-linux-sudo, they cannot access docker anymore.
The container starts fine:
teamcity-agent2-1 | /run-services.sh
teamcity-agent2-1 | /services/run-docker.sh
teamcity-agent2-1 | * Starting Docker: docker
teamcity-agent2-1 | ...done.
teamcity-agent2-1 | Docker daemon started
teamcity-agent2-1 | * Docker is running
teamcity-agent2-1 | /run-agent.sh
But at one point, I can see the following errors:
teamcity-agent2-1 | [2024-09-03 08:00:51,881] INFO - ains.buildServer.util.FileUtil - Unable to remove directory /opt/buildagent/temp on unix platform, try to fix permissions and repeat delete operation
teamcity-agent2-1 | [2024-09-03 08:00:51,918] WARN - ains.buildServer.util.FileUtil - Unable to remove directory /opt/buildagent/temp using command line: rm -Rf /opt/buildagent/temp
teamcity-agent2-1 | execution code: 1
teamcity-agent2-1 | std out:
teamcity-agent2-1 | std err: rm: cannot remove '/opt/buildagent/temp': Device or resource busy
Docker config looks like this:
agent2:
image: jetbrains/teamcity-agent:2024.07.2-linux-sudo
networks:
- teamcity
volumes:
- ./agent2_conf:/data/teamcity_agent/conf
- agent2_docker_volume:/var/lib/docker
environment:
- SERVER_URL=http://server:8111
- DOCKER_IN_DOCKER=start
privileged: true
restart: unless-stopped
Could this be related to the docker version, or any other obvious configuration issue?
Uploaded the full agent log with id: 2024_09_03_boq9dQAELRZ1eUTRAaTWMY
Please sign in to leave a comment.
Eventually, the agent will register to the server, but the agent is not compatible with our build configurations due to:
Unmet requirements:
Do you run TeamCity in a Kubernetes cluster by any chance?
Best regards,
Anton
Hi Anton,
No, the server and agents are configured in a docker-compose.yml file (and the docker config shared above is part of docker-compose.yml)
So the good news is, starting another agent with this command is working properly:
Now trying to find out what is missing from my docker compose config, I suspect it's the user that is configured to run the container.
Please let me know if you find anything.
Best regards,
Anton
Still don't have a clue how to solve this with docker compose. Adding user: 0 does not solve the problem.
Perhaps something changed regarding privileged with recent docker compose changes.
Looks like the problem was specifically in the volumes that were bound to /var/lib/docker, and needed to be purged / recreated. Perhaps this was related to a docker update on the host system.
Fully removing all volumes and recreating the containers solved the problem.
Best regards,
Anton
Apparently, the problems are still active:
- I cannot SSH into agent containers that are running
- After restarting a container, docker is unavailable again in the container (docker.server.osType empty again)
- After deleting volumes and creating a new container, the problems are resolved.
One thing that we are doing, is running a docker prune inside the buildagents (to avoid massive amounts of disk space piling up):
But when I do this manually, the container still is responsive / docker in docker is still working (even after a restart).
Just to make sure we're on the same page, the above symptoms are reproduced only when using docker compose? If you start the agent with docker run, it works correctly?
Best regards,
Anton