Aspiring Architect: Top 15 Docker Runtime Issues - Reproduce

In this post we will go through top 15 runtime issues you might face in managing docker containers. Re-create those issues on your docker environment , debug step by step and resolve them.

For this i have created a Github repo which got a docker-compose file which will re-create the issue and a README file which will describe what exactly the issue we created, and step by step guide to debug the issue, and how to fix it. We got 15 folders one for each runtime issue.

Github Repo : https://github.com/pratappilaka24/Top-15-Docker-Runtime-Issues

Below are the Runtime issues we will be looking at.

OOM (Out-of-Memory) Killer
CPU Throttling
Disk IO Exhaustion
DNS Resolution Failure
Volume Permission Denied
Port Binding Conflicts
Health Check & Restart Loops
Docker Socket Exposure
Zombie Processes
Container Logs Exhaustion
Network MTU Mismatch
Environment Variable Secret Leakage
Timezone Mismatch Inside Containers
Open File Limit Exhaustion
SELinux Blocking Container Operations

I have a RHEL 10 vm which i am using as host and mounted the shared drive with all the samples on it.

Lets start recreating the runstime issues one by one.

01. OOM (Out-of-Memory) Killer — Container Killed by Kernel

The Linux OOM (Out-of-Memory) Killer terminates processes (or containers) when the system or cgroup memory limit is exhausted. In production, this manifests as containers suddenly disappearing with exit code **137** (SIGKILL). This is one of the most common and impactful production incidents.

Reproduce by navigating to the said folder and run "Docker-compose up".

You can see that container momory gradually allocated for a single data operation. Finally killing the container due to 137 exit code whch is famously known as OOM Killer.

Resolution: Implemnt memory soft limit on the container thus always limiting the memory use of operatons within the container. Please reffer to README file for best parctices and more ways to handle the issue.

02. CPU Throttling — Container Starved of CPU Causing Latency Spikes

CPU throttling occurs when a container's CPU usage exceeds its assigned quota (set via `cpus` or `cpu_period`/`cpu_quota`). The Linux CFS (Completely Fair Scheduler) throttles the container, causing tasks to queue up. In production this appears as **sudden latency spikes**, **slow API responses**, or **timeouts** even when the host has plenty of idle CPU

We have implemnted a good and bad containers with CPU limits being too constarint on bad one. You can literally see difference in the latency for a batch operation in both.

you see theone with out bounds on CPU responded in 6 milli seconds, the one on bounds responded in 98 milli seconds.

Resolution: Its good to set bounds , only by understanding the optimum value for it. Please reffer to README file for best parctices and more ways to handle the issue.

03. Disk I/O Exhaustion — Container Saturating Storage

A container performing unbounded disk writes (log flooding, bulk data processing, database dumps) can saturate the storage subsystem, causing I/O wait to spike across the entire host. Other containers and host processes become slow or unresponsive. This is a classic "noisy neighbor" problem in shared production environments.

You can see the I/O spike causing the response time spiking in io_hog container goint up to 1.31 seconds while normal continer being 10 milli seconds.

Resolution: Setup bulkIO mimits while defining the container, so that the spike doesnt impact guest or host responsiveness. Please reffer to README file for best parctices and more ways to handle the issue.

04. DNS Resolution Failure Inside Containers

Containers fail to resolve hostnames — either external domains (e.g., `google.com`) or internal service names (e.g., `db`, `redis`). This causes connection errors, timeouts, and application failures. Common causes include: wrong DNS server configured, `/etc/resolv.conf` issues, broken Docker network DNS, SELinux/firewall blocking DNS traffic, or systemd-resolved conflicts.

DNS settings arew rongly set in resolv.conf.

Issue 1: Ext servers are configured to 192.0.2.1 whch is a fake TEST-NET address in broken container, while it was set to 8.8.8.8 which is google.com in working one.

Issue 2: search is configured to internal.example.com on broken container. This would have worked if we have local domain setup with that name. But its not the case here, so it doesnt work and i cannot reach any other apps like DB, deployed on this container. The DNS is corrcet on working container so it can reach google.com or the localy deployed DB instance successfully.

05. Volume Permission Denied — Container Cannot Write to Mounted Volume

Containers fail with `Permission denied` when accessing bind-mounted directories. On RHEL 10, this is almost always caused by:

1. **SELinux label mismatch** — host directory has wrong security context

2. **UID/GID mismatch** — container process user doesn't match host directory owner

3. **Host directory permissions** — too restrictive (e.g., mode 700)

Step1: creating necessary mount points on host.

but remeber that we are creating the mount points as root and nginx-data-nolabel has both r-x permissions for user and app-data-root has all access to only root.

Now run "docker compose up -d" and check the logs.

We hosted the containers app-uid-mismatch and app-correct with UID 1001, where as the mounts has only no write permissions for user. Thus both of them get PERMISSION DENIED error while writing the files.

Resolution: Correct the mermissions for the mount points , use 644 instead of 755 and 700 for the mount points. Also make sure you dint mount those on caontainer with :ro (read-only) access. Its a best practice to mount the volumes with :z for shared mounts and :Zif you want to isolate the volumes for just one container. Please reffer to README file for best parctices and more ways to handle the issue.

06. Port Binding Conflicts — Container Fails to Start (Address Already in Use)

A container fails to start because another process or container is already listening on the same host port. This produces the error: `bind: address already in use`. Extremely common in production when deploying multiple versions, after a failed deploy leaves a zombie container, or when system services use the same port.

Run "docker compose up -d" and you see below error.

Instance 2 failed to start because the host port is already used for one of the other instance.

Resolution: Make sure you use seperate host ports in docker container definitions. Please reffer to README file for best parctices and more ways to handle the issue.

07. Health Check Failures & Container Restart Loops

Containers enter a restart loop when health checks consistently fail. This can happen due to: incorrect health check commands, endpoints not responding, insufficient `start_period` for slow-starting apps, or actual application failures. Restart loops waste resources, generate noise in logs, and can cascade into broader failures.

Run "docker compose up", you see

We intentionally made the unhealthy_app contianer send a fault 8 out of 10 chances, thus causing the faulty health check. Slow_app will start after a delay of 30 seconds, but the health check is set to start after 5 seconds, thus following health checks will mark it as unhealthy, eventhogh the conatiner is healthy.

see, the health check was coming bac with 200 for slow_start_app, but the health check meanwhile marked as unhealthy.

Resolution: Choose the start period and intervals appropriately , so that your health checks kick-in before the container is ready to serve. Please reffer to README file for best parctices and more ways to handle the issue.

08. Docker Socket Exposure — Critical Container Security Vulnerability

Mounting `/var/run/docker.sock` into a container gives that container **full, unrestricted root-level access to the host**. Any process in the container can start new containers, mount host directories, execute commands as root on the host, and escape the container entirely. This is one of the most severe container security misconfigurations in production.

move to 08 folder and run "docker compose up -d"

and then follow the comands as shown below.

From socker_exposed_app container, guest has un restricted access to host root namespace. Guest can check other containers, mount volumes at thier will and can perform any malacious actions.

Resolution: Be very cautious of socket exposure. Altenratively use dind_alternative runs " docker inside a docker " image where an privelaged guest container cannot see host containers and environment.

Please reffer to README file for best parctices and more ways to handle the issue.

10. Log Flooding — Container Logs Exhausting Host Disk Space

By default, Docker stores container logs in JSON files at `/var/lib/docker/containers/<id>/<id>-json.log` with **no size limit**. A single verbose container can fill the entire host disk, causing all containers to fail, the host to become unresponsive, and data corruption. This is a silent killer in production — systems run fine for weeks then suddenly all fail simultaneously.

cd 10 folder, run "docker compose up -d log-flooder", wait for 2 mins and then run "docker exec log_flooder df -h /var" command.

you will see already 41% of space is flooded with logs, if we dint stop it in time, it will fully flood /var/ and that will create issues like user cannot login, nor sudo.

Resolution: Setup log-rotation options in the container definition. Please reffer to README file for best parctices and more ways to handle the issue.