-
Notifications
You must be signed in to change notification settings - Fork 247
fix: better error message in windows CSE when API server can't be contacted #7828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cbe6b64 to
d426ca1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates Linux VHD/provisioning artifacts, SIG image version pinning, and CI pipelines (including E2E auth) alongside regenerating snapshot testdata.
Changes:
- Freeze MarinerV2/AzureLinuxV2 SIG image version and update related tests/testdata.
- Add disk queue tuning script/service and corresponding E2E validations.
- Switch CI tasks to
AzureCLI@2and update E2E auth to use Azure CLI credentials.
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/agent/testdata/AKSUbuntu2204+CustomCloud/CustomData | Regenerated snapshot CustomData payloads. |
| pkg/agent/testdata/AKSUbuntu2204+China/CustomData | Regenerated snapshot CustomData payloads. |
| pkg/agent/datamodel/sig_config_test.go | Update expected SIG version constant for frozen V2 images. |
| pkg/agent/datamodel/sig_config.go | Introduce frozen V2 SIG image version constant and apply it to templates. |
| pkg/agent/datamodel/osimageconfig.go | Adjust cloud-to-image mapping (new cloud entry + key fixes). |
| pkg/agent/datamodel/linux_sig_version.json | Bump linux SIG version value to 202512.06.0. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Remove NVIDIA repo cleanup call from GPU cleanup path. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh | Add apt-get clean after successful installs. |
| parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh | Remove NVIDIA repo cleanup call from GPU cleanup path. |
| parts/linux/cloud-init/artifacts/disk_queue.sh | Add disk queue tuning script selecting Azure root/os device link. |
| parts/linux/cloud-init/artifacts/disk_queue.service | Update ExecStart to run the new disk_queue.sh script. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Remove NVIDIA repo cleanup after GPU driver installation. |
| parts/common/components.json | Update multiple component versions and one download location path. |
| packer.mk | Remove az-login logic; target now only sets subscription. |
| e2e/validators.go | Add validators for aks-log-collector and disk_queue.service. |
| e2e/validation.go | Call new validators from common Linux validation suite. |
| e2e/scenario_test.go | Change secure TLS bootstrapping failure setup to use invalid user-assigned identity. |
| e2e/config/azure.go | Switch to Azure CLI credential for ARM clients. |
| .pipelines/templates/e2e-template.yaml | Run E2E scripts via AzureCLI@2 task. |
| .pipelines/templates/.builder-release-template.yaml | Wrap builder steps with AzureCLI@2. |
| .pipelines/templates/.builder-release-template-windows.yaml | Wrap Windows builder steps with AzureCLI@2 and tweak conditions. |
| .pipelines/templates/.build-and-test-windows-vhds-template.yaml | Run cleanup via AzureCLI@2. |
| .pipelines/scripts/windows_build_vhd.sh | Remove explicit make ... az-login invocation. |
| .pipelines/scripts/windows-sub-cleanup.sh | Remove explicit make ... az-login invocation. |
| .pipelines/scripts/e2e_run.sh | Remove managed identity login step; rely on Azure CLI session. |
| .pipelines/.vsts-vhd-builder-release.yaml | Disable MarinerV2/AzureLinuxV2 builds by default. |
| .pipelines/.vsts-vhd-builder-pr-windows.yaml | Run windows sub cleanup via AzureCLI@2. |
| .pipelines/.vsts-garabge-collection.yaml | Run GC script via AzureCLI@2. |
| .github/workflows/pr-lint.yaml | Trigger PR lint workflow on synchronize events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
| try | ||
| { | ||
| $tcpClient = New-Object Net.Sockets.TcpClient | ||
| $tcpClient.SendTimeout = $ConnectTimeout*1000 | ||
| $tcpClient.ReceiveTimeout = $ConnectTimeout*1000 | ||
|
|
||
| Write-Log "Retry ${retryString}: Trying to connect to API server $MasterIP" | ||
| $tcpClient.Connect($MasterIP, 443) | ||
| if ($tcpClient.Connected) | ||
| { | ||
| $tcpClient.Close() | ||
| Write-Log "Retry ${retryString}: Connected to API server successfully" | ||
| return | ||
| } | ||
| $tcpClient.Close() | ||
| } catch [System.AggregateException] { | ||
| Write-Log "Retry ${retryString}: Failed to connect to API server $MasterIP. AggregateException: " + $_.Exception.ToString() | ||
| Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. AggregateException: " + $_.Exception.ToString() | ||
| $LastException = $_.Exception.ToString() | ||
| } catch { | ||
| Write-Log "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_" | ||
| } | ||
| Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_" | ||
| $LastException = "$_" | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the exception paths, $tcpClient isn't closed/disposed. With up to 60 retries this can transiently leak sockets/handles until GC runs. Add a finally block (or explicit dispose/close in each catch) so the TcpClient is always cleaned up even when Connect() throws.
| } while ($retryCount -lt $MaxRetryCount) | ||
|
|
||
| Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries" | ||
| Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries. Last exception: $LastException" |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including $LastException via ToString() in the exit -ErrorMessage can introduce newlines/stack traces, which can break the single-line ExitCode: |...| completion-file format and will be aggressively truncated anyway. Prefer storing a one-line value (e.g., Exception.Message / innermost message) and normalize whitespace (replace CR/LF with spaces) before appending it.
| Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries. Last exception: $LastException" | |
| $lastExceptionMessage = "" | |
| if ($LastException -ne $null) { | |
| if ($LastException.Exception -ne $null -and -not [string]::IsNullOrEmpty($LastException.Exception.Message)) { | |
| $lastExceptionMessage = $LastException.Exception.Message | |
| } else { | |
| $lastExceptionMessage = $LastException.ToString() | |
| } | |
| # Normalize any CR/LF in the exception message to spaces to keep ErrorMessage single-line. | |
| $lastExceptionMessage = $lastExceptionMessage -replace "(`r|`n)+", " " | |
| } | |
| Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries. Last exception: $lastExceptionMessage" |
| Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. AggregateException: " + $_.Exception.ToString() | ||
| $LastException = $_.Exception.ToString() | ||
| } catch { | ||
| Write-Log "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_" | ||
| } | ||
| Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_" | ||
| $LastException = "$_" |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs-To-Event writes a per-call JSON file (see parts/windows/windowscsehelper.ps1), so logging every failed retry here can create up to $MaxRetryCount extra files and add noise/IO during bootstrap. Consider keeping per-retry details in Write-Log and emitting a single Logs-To-Event entry on the final failure (or rate-limit to first/last attempt only).
What this PR does / why we need it:
Better error message when API server can't be contacted
Which issue(s) this PR fixes:
Fixes #