**Product:** QuickBuild 16.0.0
**Component:** Kubernetes Cloud Plugin (`com.pmease.quickbuild.plugin.cloud.kubernetes`)
**Severity:** High — causes sporadic build agent connection failures in Kubernetes environments
---
## Summary
When using the Kubernetes cloud profile with "Expose Service Via Node Port" enabled, agents sporadically fail to register with the server. The server attempts to ping the agent on the container port (8811) instead of the dynamically assigned NodePort, resulting in `java.net.ConnectException: Connection refused`.
---
## Steps to Reproduce
1. Configure a Kubernetes cloud profile with "Expose Service Via Node Port" enabled
2. Run multiple builds that trigger on-demand agent launches
3. Observe server logs — some agents fail with "Error connecting" on port 8811
---
## Error Log
```
ERROR com.pmease.quickbuild.RemotingSerializerFactory - Error invoking hessian method.
com.caucho.hessian.client.HessianRuntimeException: Error connecting '
http://<agent-ip>:8811/service/node'
Caused by: java.net.ConnectException: Connection refused
```
---
## Root Cause Analysis
The issue is a race condition between the `launchNode()` async thread and the agent's connect sequence.
### File: `KubernetesNodeLauncher.java` — method `launchNode()` (line ~316)
Current execution order:
1. Create Pod (`createResource(podDef, custom)`)
2. Wait for pod to get a hostIP (`while (getHostIP(agentName) == null)`)
3. Create NodePort Service (`createResource(serviceDef, custom)`)
4. Wait for nodePort (`while (getNodePort(agentName) == null)`)
5. Return `LaunchResult` with nodePort
### File: `DefaultBuildEngine.java` — method `launchNode()` (line ~1755)
The `launchNode()` call runs **asynchronously** via `Quickbuild.getInstance().getExecutor().execute(...)`. The Token is initially created with `port=0` and only updated after `launchNode()` returns (line ~1783):
```java
token.setPort(result.getPort()); // Only set after LaunchResult returns
TokenManager.instance.save(token);
```
### The Race
The agent pod starts at step 1 and makes its Phase 1 connect call (with `agentNodeId=null`) while steps 3-5 are still in progress. At that point, `token.getPort()` is still `0`.
### File: `ConnectServlet.java` — method `connect()` (line ~262-270)
The Phase 1 connect handler falls back to the container port when `token.getPort()` is 0:
```java
int accessPort = token.getPort(); // 0 — not yet updated
if (accessPort == 0)
accessPort = agentPort; // Falls back to 8811 (container port)
result.setSessionToken(String.valueOf(accessPort) + ":" + String.valueOf(accessOverSSL));
```
The agent receives `sessionToken="8811:false"` and uses port 8811 for Phase 2 connect. The server then tries to ping back `
http://<host-ip>:8811/service/node` (line ~208), which is the container port — not reachable from outside the Kubernetes cluster. Only the NodePort (e.g., 31234) is externally accessible.
---
## Proposed Fix
In `KubernetesNodeLauncher.java`, reorder `launchNode()` to create the NodePort Service **before** creating the Pod:
1. Create NodePort Service
2. Wait for nodePort allocation
3. Create Pod
4. Wait for hostIP
5. Return `LaunchResult` with nodePort
This is safe because a Kubernetes Service does not require the target Pod to exist at creation time — it simply has no endpoints until the Pod matches the selector and becomes ready. This eliminates the race entirely because the nodePort is known before the agent can possibly start.
### Changed Method (Pseudocode)
```java
public LaunchResult launchNode(String launchData, boolean testLaunch) {
String agentName = "buildagent-" + UUID.randomUUID().toString();
createNamespace(getAgentNamespace());
// ... prepare podDef and custom ...
// FIX: Create Service FIRST to get NodePort before agent starts
String nodePort = null;
if (getExposeServiceViaNodePort() != null) {
// ... create serviceDef ...
createResource(serviceDef, serviceCustom);
while ((nodePort = getNodePort(agentName)) == null) {
Thread.sleep(1000);
}
}
// THEN create the Pod
createResource(podDef, custom);
while (getHostIP(agentName) == null) {
Thread.sleep(1000);
}
if (nodePort != null) {
return new LaunchResult(agentName, null, Integer.parseInt(nodePort), false);
} else {
return new LaunchResult(agentName, null, 0, false);
}
}
```
**Note on `terminateNode()`:** No change needed — it deletes the Service by `nodeInstanceId` (the agentName), which remains the same.
---
## Workaround
Add an init container in the Kubernetes cloud profile Pod Customization to delay agent startup:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: ${name}
spec:
initContainers:
- name: wait-for-nodeport
image: busybox
command: ['sh', '-c', 'sleep 15']
```
> **Note:** This is not a reliable fix — it adds latency to every agent launch and may still race under slow cluster conditions.
---
## Affected Files
| File | Change Required |
|------|----------------|
| `com.pmease.quickbuild.plugin.cloud.kubernetes/.../KubernetesNodeLauncher.java` | **Yes** — reorder Service creation before Pod creation |
| `com.pmease.quickbuild/src/.../DefaultBuildEngine.java` | No — context for understanding the async launch |
| `com.pmease.quickbuild/src/.../ConnectServlet.java` | No — context for the fallback to port 8811 |