History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: QB-4267
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Ido Harel
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
QuickBuild

Race condition in Kubernetes NodePort agent connection causes sporadic "Connection refused" errors

Created: 14/Apr/26 11:07 AM   Updated: Tuesday 02:11 PM
Component/s: None
Affects Version/s: None
Fix Version/s: 16.0.7

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown


 Description  « Hide
**Product:** QuickBuild 16.0.0
**Component:** Kubernetes Cloud Plugin (`com.pmease.quickbuild.plugin.cloud.kubernetes`)
**Severity:** High — causes sporadic build agent connection failures in Kubernetes environments

---

## Summary

When using the Kubernetes cloud profile with "Expose Service Via Node Port" enabled, agents sporadically fail to register with the server. The server attempts to ping the agent on the container port (8811) instead of the dynamically assigned NodePort, resulting in `java.net.ConnectException: Connection refused`.

---

## Steps to Reproduce

1. Configure a Kubernetes cloud profile with "Expose Service Via Node Port" enabled
2. Run multiple builds that trigger on-demand agent launches
3. Observe server logs — some agents fail with "Error connecting" on port 8811

---

## Error Log

```
ERROR com.pmease.quickbuild.RemotingSerializerFactory - Error invoking hessian method.
com.caucho.hessian.client.HessianRuntimeException: Error connecting 'http://<agent-ip>:8811/service/node'
Caused by: java.net.ConnectException: Connection refused
```

---

## Root Cause Analysis

The issue is a race condition between the `launchNode()` async thread and the agent's connect sequence.

### File: `KubernetesNodeLauncher.java` — method `launchNode()` (line ~316)

Current execution order:

1. Create Pod (`createResource(podDef, custom)`)
2. Wait for pod to get a hostIP (`while (getHostIP(agentName) == null)`)
3. Create NodePort Service (`createResource(serviceDef, custom)`)
4. Wait for nodePort (`while (getNodePort(agentName) == null)`)
5. Return `LaunchResult` with nodePort

### File: `DefaultBuildEngine.java` — method `launchNode()` (line ~1755)

The `launchNode()` call runs **asynchronously** via `Quickbuild.getInstance().getExecutor().execute(...)`. The Token is initially created with `port=0` and only updated after `launchNode()` returns (line ~1783):

```java
token.setPort(result.getPort()); // Only set after LaunchResult returns
TokenManager.instance.save(token);
```

### The Race

The agent pod starts at step 1 and makes its Phase 1 connect call (with `agentNodeId=null`) while steps 3-5 are still in progress. At that point, `token.getPort()` is still `0`.

### File: `ConnectServlet.java` — method `connect()` (line ~262-270)

The Phase 1 connect handler falls back to the container port when `token.getPort()` is 0:

```java
int accessPort = token.getPort(); // 0 — not yet updated
if (accessPort == 0)
    accessPort = agentPort; // Falls back to 8811 (container port)
result.setSessionToken(String.valueOf(accessPort) + ":" + String.valueOf(accessOverSSL));
```

The agent receives `sessionToken="8811:false"` and uses port 8811 for Phase 2 connect. The server then tries to ping back `http://<host-ip>:8811/service/node` (line ~208), which is the container port — not reachable from outside the Kubernetes cluster. Only the NodePort (e.g., 31234) is externally accessible.

---

## Proposed Fix

In `KubernetesNodeLauncher.java`, reorder `launchNode()` to create the NodePort Service **before** creating the Pod:

1. Create NodePort Service
2. Wait for nodePort allocation
3. Create Pod
4. Wait for hostIP
5. Return `LaunchResult` with nodePort

This is safe because a Kubernetes Service does not require the target Pod to exist at creation time — it simply has no endpoints until the Pod matches the selector and becomes ready. This eliminates the race entirely because the nodePort is known before the agent can possibly start.

### Changed Method (Pseudocode)

```java
public LaunchResult launchNode(String launchData, boolean testLaunch) {
    String agentName = "buildagent-" + UUID.randomUUID().toString();
    createNamespace(getAgentNamespace());
    
    // ... prepare podDef and custom ...

    // FIX: Create Service FIRST to get NodePort before agent starts
    String nodePort = null;
    if (getExposeServiceViaNodePort() != null) {
        // ... create serviceDef ...
        createResource(serviceDef, serviceCustom);
        while ((nodePort = getNodePort(agentName)) == null) {
            Thread.sleep(1000);
        }
    }

    // THEN create the Pod
    createResource(podDef, custom);
    while (getHostIP(agentName) == null) {
        Thread.sleep(1000);
    }

    if (nodePort != null) {
        return new LaunchResult(agentName, null, Integer.parseInt(nodePort), false);
    } else {
        return new LaunchResult(agentName, null, 0, false);
    }
}
```

**Note on `terminateNode()`:** No change needed — it deletes the Service by `nodeInstanceId` (the agentName), which remains the same.

---

## Workaround

Add an init container in the Kubernetes cloud profile Pod Customization to delay agent startup:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: ${name}
spec:
  initContainers:
  - name: wait-for-nodeport
    image: busybox
    command: ['sh', '-c', 'sleep 15']
```

> **Note:** This is not a reliable fix — it adds latency to every agent launch and may still race under slow cluster conditions.

---

## Affected Files

| File | Change Required |
|------|----------------|
| `com.pmease.quickbuild.plugin.cloud.kubernetes/.../KubernetesNodeLauncher.java` | **Yes** — reorder Service creation before Pod creation |
| `com.pmease.quickbuild/src/.../DefaultBuildEngine.java` | No — context for understanding the async launch |
| `com.pmease.quickbuild/src/.../ConnectServlet.java` | No — context for the fallback to port 8811 |

 All   Comments   Work Log   Change History      Sort Order:
There are no comments yet on this issue.