Skip to content

Conversation

@timmy-wright
Copy link
Contributor

What this PR does / why we need it:

Better error message when API server can't be contacted

Which issue(s) this PR fixes:

Fixes #

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Linux VHD/provisioning artifacts, SIG image version pinning, and CI pipelines (including E2E auth) alongside regenerating snapshot testdata.

Changes:

  • Freeze MarinerV2/AzureLinuxV2 SIG image version and update related tests/testdata.
  • Add disk queue tuning script/service and corresponding E2E validations.
  • Switch CI tasks to AzureCLI@2 and update E2E auth to use Azure CLI credentials.

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pkg/agent/testdata/AKSUbuntu2204+CustomCloud/CustomData Regenerated snapshot CustomData payloads.
pkg/agent/testdata/AKSUbuntu2204+China/CustomData Regenerated snapshot CustomData payloads.
pkg/agent/datamodel/sig_config_test.go Update expected SIG version constant for frozen V2 images.
pkg/agent/datamodel/sig_config.go Introduce frozen V2 SIG image version constant and apply it to templates.
pkg/agent/datamodel/osimageconfig.go Adjust cloud-to-image mapping (new cloud entry + key fixes).
pkg/agent/datamodel/linux_sig_version.json Bump linux SIG version value to 202512.06.0.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Remove NVIDIA repo cleanup call from GPU cleanup path.
parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh Add apt-get clean after successful installs.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Remove NVIDIA repo cleanup call from GPU cleanup path.
parts/linux/cloud-init/artifacts/disk_queue.sh Add disk queue tuning script selecting Azure root/os device link.
parts/linux/cloud-init/artifacts/disk_queue.service Update ExecStart to run the new disk_queue.sh script.
parts/linux/cloud-init/artifacts/cse_config.sh Remove NVIDIA repo cleanup after GPU driver installation.
parts/common/components.json Update multiple component versions and one download location path.
packer.mk Remove az-login logic; target now only sets subscription.
e2e/validators.go Add validators for aks-log-collector and disk_queue.service.
e2e/validation.go Call new validators from common Linux validation suite.
e2e/scenario_test.go Change secure TLS bootstrapping failure setup to use invalid user-assigned identity.
e2e/config/azure.go Switch to Azure CLI credential for ARM clients.
.pipelines/templates/e2e-template.yaml Run E2E scripts via AzureCLI@2 task.
.pipelines/templates/.builder-release-template.yaml Wrap builder steps with AzureCLI@2.
.pipelines/templates/.builder-release-template-windows.yaml Wrap Windows builder steps with AzureCLI@2 and tweak conditions.
.pipelines/templates/.build-and-test-windows-vhds-template.yaml Run cleanup via AzureCLI@2.
.pipelines/scripts/windows_build_vhd.sh Remove explicit make ... az-login invocation.
.pipelines/scripts/windows-sub-cleanup.sh Remove explicit make ... az-login invocation.
.pipelines/scripts/e2e_run.sh Remove managed identity login step; rely on Azure CLI session.
.pipelines/.vsts-vhd-builder-release.yaml Disable MarinerV2/AzureLinuxV2 builds by default.
.pipelines/.vsts-vhd-builder-pr-windows.yaml Run windows sub cleanup via AzureCLI@2.
.pipelines/.vsts-garabge-collection.yaml Run GC script via AzureCLI@2.
.github/workflows/pr-lint.yaml Trigger PR lint workflow on synchronize events.

Copilot AI review requested due to automatic review settings February 9, 2026 03:48
@timmy-wright timmy-wright changed the title Timmy/better error fix: better error message in windows CSE when API server can't be contacted Feb 9, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment on lines +189 to +210
try
{
$tcpClient = New-Object Net.Sockets.TcpClient
$tcpClient.SendTimeout = $ConnectTimeout*1000
$tcpClient.ReceiveTimeout = $ConnectTimeout*1000

Write-Log "Retry ${retryString}: Trying to connect to API server $MasterIP"
$tcpClient.Connect($MasterIP, 443)
if ($tcpClient.Connected)
{
$tcpClient.Close()
Write-Log "Retry ${retryString}: Connected to API server successfully"
return
}
$tcpClient.Close()
} catch [System.AggregateException] {
Write-Log "Retry ${retryString}: Failed to connect to API server $MasterIP. AggregateException: " + $_.Exception.ToString()
Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. AggregateException: " + $_.Exception.ToString()
$LastException = $_.Exception.ToString()
} catch {
Write-Log "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_"
}
Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_"
$LastException = "$_"
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the exception paths, $tcpClient isn't closed/disposed. With up to 60 retries this can transiently leak sockets/handles until GC runs. Add a finally block (or explicit dispose/close in each catch) so the TcpClient is always cleaned up even when Connect() throws.

Copilot uses AI. Check for mistakes.
} while ($retryCount -lt $MaxRetryCount)

Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries"
Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries. Last exception: $LastException"
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including $LastException via ToString() in the exit -ErrorMessage can introduce newlines/stack traces, which can break the single-line ExitCode: |...| completion-file format and will be aggressively truncated anyway. Prefer storing a one-line value (e.g., Exception.Message / innermost message) and normalize whitespace (replace CR/LF with spaces) before appending it.

Suggested change
Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries. Last exception: $LastException"
$lastExceptionMessage = ""
if ($LastException -ne $null) {
if ($LastException.Exception -ne $null -and -not [string]::IsNullOrEmpty($LastException.Exception.Message)) {
$lastExceptionMessage = $LastException.Exception.Message
} else {
$lastExceptionMessage = $LastException.ToString()
}
# Normalize any CR/LF in the exception message to spaces to keep ErrorMessage single-line.
$lastExceptionMessage = $lastExceptionMessage -replace "(`r|`n)+", " "
}
Set-ExitCode -ExitCode $global:WINDOWS_CSE_ERROR_CHECK_API_SERVER_CONNECTIVITY -ErrorMessage "Failed to connect to API server $MasterIP after $retryCount retries. Last exception: $lastExceptionMessage"

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +209
Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. AggregateException: " + $_.Exception.ToString()
$LastException = $_.Exception.ToString()
} catch {
Write-Log "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_"
}
Logs-To-Event -TaskName "AKS.WindowsCSE.CheckAPIServerConnectivity" -TaskMessage "Retry ${retryString}: Failed to connect to API server $MasterIP. Error: $_"
$LastException = "$_"
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs-To-Event writes a per-call JSON file (see parts/windows/windowscsehelper.ps1), so logging every failed retry here can create up to $MaxRetryCount extra files and add noise/IO during bootstrap. Consider keeping per-retry details in Write-Log and emitting a single Logs-To-Event entry on the final failure (or rate-limit to first/last attempt only).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant