[00:01:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86686 and previous config saved to /var/cache/conftool/dbconfig/20251217-000109-marostegui.json [00:01:16] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:01:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:16:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P86687 and previous config saved to /var/cache/conftool/dbconfig/20251217-001617-marostegui.json [00:17:54] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [00:18:01] rolling some envoy updates, staging only [00:18:18] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [00:20:07] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [00:20:27] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [00:20:38] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [00:20:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:20:53] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [00:21:12] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [00:22:15] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [00:22:25] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [00:22:32] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [00:22:54] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [00:23:22] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [00:23:34] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [00:23:44] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [00:23:54] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [00:24:20] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [00:24:30] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [00:24:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [00:24:48] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [00:25:24] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [00:25:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [00:25:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [00:25:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [00:26:15] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [00:26:27] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [00:26:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [00:26:49] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [00:27:01] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [00:27:12] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [00:27:18] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [00:27:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [00:27:54] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [00:28:06] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [00:28:17] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [00:28:36] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [00:28:50] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [00:29:02] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [00:29:28] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [00:29:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [00:30:11] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [00:30:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [00:30:44] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [00:31:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P86688 and previous config saved to /var/cache/conftool/dbconfig/20251217-003126-marostegui.json [00:31:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [00:32:34] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [00:33:50] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [00:34:50] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [00:37:30] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [00:37:42] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [00:38:09] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [00:38:29] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [00:39:25] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:39:33] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [00:39:47] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [00:40:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 [00:40:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 (owner: 10TrainBranchBot) [00:41:36] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [00:41:42] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [00:42:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [00:42:43] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [00:42:50] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [00:43:01] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [00:43:09] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [00:43:19] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [00:43:33] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [00:43:39] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [00:43:45] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [00:43:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [00:45:28] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [00:45:39] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [00:45:55] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [00:46:07] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [00:46:20] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [00:46:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86689 and previous config saved to /var/cache/conftool/dbconfig/20251217-004634-marostegui.json [00:46:36] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [00:46:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [00:46:40] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [00:46:45] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [00:46:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2220.codfw.wmnet with reason: Maintenance [00:46:57] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [00:47:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86690 and previous config saved to /var/cache/conftool/dbconfig/20251217-004659-marostegui.json [00:48:27] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [00:48:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [00:48:56] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [00:49:15] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [00:49:25] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [00:49:53] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [00:49:58] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [00:50:31] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [00:50:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [00:50:50] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [00:52:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218860 (owner: 10TrainBranchBot) [00:56:19] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [00:56:30] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [00:56:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [00:56:52] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [00:57:03] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [00:57:21] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [00:57:30] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:58:06] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:58:12] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [00:58:31] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [01:01:03] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 [01:10:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 (owner: 10TrainBranchBot) [01:25:14] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 24m 10s) [01:34:32] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218862 (owner: 10TrainBranchBot) [01:44:06] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on Wikidata for Firefox (Browser extension) - https://phabricator.wikimedia.org/T398588#11466958 (10Aklapper) 05Open→03Declined [01:48:05] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11466974 (10Papaul) a:05Papaul→03ayounsi @ayounsi assigned back to you since you are working on it. thanks [01:55:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86691 and previous config saved to /var/cache/conftool/dbconfig/20251217-015538-marostegui.json [01:55:44] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [01:55:45] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:10:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P86692 and previous config saved to /var/cache/conftool/dbconfig/20251217-021046-marostegui.json [02:13:10] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86693 and previous config saved to /var/cache/conftool/dbconfig/20251217-021310-ladsgroup.json [02:13:14] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:25:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P86694 and previous config saved to /var/cache/conftool/dbconfig/20251217-022554-marostegui.json [02:28:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86695 and previous config saved to /var/cache/conftool/dbconfig/20251217-022818-ladsgroup.json [02:30:24] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:41:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86696 and previous config saved to /var/cache/conftool/dbconfig/20251217-024103-marostegui.json [02:41:09] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:41:09] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:41:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance [02:41:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86697 and previous config saved to /var/cache/conftool/dbconfig/20251217-024127-marostegui.json [02:43:27] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P86698 and previous config saved to /var/cache/conftool/dbconfig/20251217-024326-ladsgroup.json [02:58:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86699 and previous config saved to /var/cache/conftool/dbconfig/20251217-025835-ladsgroup.json [02:58:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [02:58:52] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance [02:59:01] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2227 (T410589)', diff saved to https://phabricator.wikimedia.org/P86700 and previous config saved to /var/cache/conftool/dbconfig/20251217-025900-ladsgroup.json [03:41:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86701 and previous config saved to /var/cache/conftool/dbconfig/20251217-034143-marostegui.json [03:41:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:41:50] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:49:20] (03CR) 10Dzahn: [C:04-2] "this can go last after everything else, cleanup-only and it needs a typo fix" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [03:54:19] PROBLEM - Host lsw1-e2-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [03:54:52] that is me [03:55:12] evening papaul :) thanks [03:55:31] rzl: hello [03:56:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P86702 and previous config saved to /var/cache/conftool/dbconfig/20251217-035651-marostegui.json [04:02:23] FIRING: GnmiTargetDown: lsw1-e2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [04:04:33] RECOVERY - Host lsw1-e2-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.67 ms [04:07:22] RESOLVED: GnmiTargetDown: lsw1-e2-codfw is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [04:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:12:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P86703 and previous config saved to /var/cache/conftool/dbconfig/20251217-041200-marostegui.json [04:17:26] (03CR) 10Dzahn: ats: gerrit: don't validate TLS host for now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215684 (https://phabricator.wikimedia.org/T411895) (owner: 10CDanis) [04:27:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86704 and previous config saved to /var/cache/conftool/dbconfig/20251217-042708-marostegui.json [04:27:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:27:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:27:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1253.eqiad.wmnet with reason: Maintenance [04:27:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86705 and previous config saved to /var/cache/conftool/dbconfig/20251217-042733-marostegui.json [04:29:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86706 and previous config saved to /var/cache/conftool/dbconfig/20251217-042943-marostegui.json [04:44:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P86707 and previous config saved to /var/cache/conftool/dbconfig/20251217-044453-marostegui.json [04:51:29] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467201 (10Papaul) I took a quick look at this before getting the support ticket going on. On lsw1-e2-codfw we have ` Frame length statistics for m... [04:55:41] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 562521992 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:59:41] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:00:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253', diff saved to https://phabricator.wikimedia.org/P86708 and previous config saved to /var/cache/conftool/dbconfig/20251217-050001-marostegui.json [05:01:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:01:51] FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [05:01:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:02:07] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:02:09] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [05:02:48] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:58] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:02:59] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Swift [05:02:59] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift [05:04:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:04:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:04:13] !incidents [05:04:14] 7196 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:04:14] 7197 (UNACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:04:14] 7198 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:04:14] 7199 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:04:15] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [05:04:24] !ack 7196 [05:04:24] 7196 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:04:28] !ack 7197 [05:04:29] 7197 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:04:33] !ack 7198 [05:04:34] 7198 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:04:37] !ack 7199 [05:04:37] 7199 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:06:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [05:06:51] RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [05:06:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:08:18] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467204 (10Marostegui) [05:08:25] !incidents [05:08:25] 7199 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:08:25] 7200 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [05:08:25] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:08:26] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:08:26] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:08:26] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [05:08:32] !ack 7200 [05:08:32] 7200 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [05:09:11] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [05:09:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [05:11:32] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467205 (10Marostegui) [05:12:18] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875#11467206 (10Marostegui) p:05Triage→03Medium [05:15:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1253 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86709 and previous config saved to /var/cache/conftool/dbconfig/20251217-051509-marostegui.json [05:15:15] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [05:15:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:15:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:17:25] (03PS5) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [05:21:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86710 and previous config saved to /var/cache/conftool/dbconfig/20251217-052117-marostegui.json [05:21:23] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:21:23] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:24:32] (03PS6) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [05:24:57] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Access Admin menu in Airflow - https://phabricator.wikimedia.org/T412119#11467222 (10Marostegui) 05Open→03Resolved I believe this is all done - please reopen if not. Thanks Ben for handling this. [05:25:20] !incidents [05:25:20] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [05:25:20] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [05:25:21] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [05:25:21] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [05:25:21] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [05:25:21] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [05:25:23] (03PS7) 10Pppery: Handle languages with nonstandard plural rules [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217845 (https://phabricator.wikimedia.org/T412422) [05:27:57] (03PS1) 10Marostegui: es2028: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1218872 [05:29:00] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1218872 (owner: 10Marostegui) [05:29:01] (03CR) 10Marostegui: [C:03+2] es2028: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1218872 (owner: 10Marostegui) [05:30:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2222.codfw.wmnet with reason: schema change [05:33:24] (03PS4) 10Pppery: Add an internal translation file for this repo's own strings [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1217873 (https://phabricator.wikimedia.org/T412651) [05:34:11] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:36:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P86711 and previous config saved to /var/cache/conftool/dbconfig/20251217-053625-marostegui.json [05:51:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P86712 and previous config saved to /var/cache/conftool/dbconfig/20251217-055133-marostegui.json [06:06:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86713 and previous config saved to /var/cache/conftool/dbconfig/20251217-060641-marostegui.json [06:06:48] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:06:48] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:06:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2221.codfw.wmnet with reason: Maintenance [06:07:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86714 and previous config saved to /var/cache/conftool/dbconfig/20251217-060706-marostegui.json [06:07:45] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:07:59] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.082 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.088 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.156 second response time https://wikitech.wikimedia.org/wiki/Swift [06:07:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:00] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.212 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:01] PROBLEM - Swift https backend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:01] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:01] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:02] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1017.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:08:02] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.103 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:03] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1013.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1020.eqiad.wmnet, ms-fe1009.eqiad.wmnet, ms-fe1012.eqiad.wmnet, ms-fe1019.eqiad.wmnet, ms-fe1016.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:08:07] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:07] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:09] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:09] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:08:57] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:57] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:57] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.149 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:59] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Swift [06:08:59] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Swift [06:09:05] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.278 second response time https://wikitech.wikimedia.org/wiki/Swift [06:09:11] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:12] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [06:09:13] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:09:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Swift [06:09:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Swift [06:10:01] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:11:03] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.246 second response time https://wikitech.wikimedia.org/wiki/Swift [06:11:59] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.093 second response time https://wikitech.wikimedia.org/wiki/Swift [06:11:59] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.105 second response time https://wikitech.wikimedia.org/wiki/Swift [06:12:05] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.555 second response time https://wikitech.wikimedia.org/wiki/Swift [06:12:51] FIRING: [2x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:12:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift [06:12:59] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.580 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.089 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:05] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.473 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:05] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.914 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:05] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.131 second response time https://wikitech.wikimedia.org/wiki/Swift [06:13:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:01] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:05] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.509 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:11] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:14:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2018.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:14:35] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:37] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:14:57] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:59] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [06:14:59] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:01] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.173 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:01] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.257 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:11] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:35] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.610 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:35] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:37] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:15:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:59] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.080 second response time https://wikitech.wikimedia.org/wiki/Swift [06:15:59] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:01] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:01] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:07] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.886 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:11] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:16:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:16:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:16:27] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:35] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.189 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:51] FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:16:59] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.093 second response time https://wikitech.wikimedia.org/wiki/Swift [06:16:59] PROBLEM - Swift https backend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:03] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.611 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:07] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 5.713 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:27] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:29] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.720 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:35] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.715 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:35] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.800 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:51] RESOLVED: [5x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:17:57] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:17:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.061 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:59] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.518 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:01] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.080 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:01] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.189 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:03] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.694 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:03] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.548 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:07] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.571 second response time https://wikitech.wikimedia.org/wiki/Swift [06:18:11] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:18:24] !incidents [06:18:25] 7201 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:18:25] 7202 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:18:25] 7203 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:18:25] 7205 (UNACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:18:26] 7204 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [06:18:26] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:18:26] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:18:26] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:18:26] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:18:27] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [06:18:27] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [06:18:42] !ack 7205 [06:18:43] 7205 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:18:49] !ack 7203 [06:18:49] 7203 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:18:51] !ack 7202 [06:18:52] 7202 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:18:53] !ack 7201 [06:18:54] 7201 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:18:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:03] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.176 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:05] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.495 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:05] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.998 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:11] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.864 second response time https://wikitech.wikimedia.org/wiki/Swift [06:19:11] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:19:11] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:19:35] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:19:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:20:01] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:05] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.132 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:11] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:20:24] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:20:25] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:43] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:20:59] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.181 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:01] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:33] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.778 second response time https://wikitech.wikimedia.org/wiki/Swift [06:21:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in drmrs #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:21:57] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [06:22:05] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.789 second response time https://wikitech.wikimedia.org/wiki/Swift [06:23:59] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.158 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:01] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.484 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:01] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.523 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:05] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 5.819 second response time https://wikitech.wikimedia.org/wiki/Swift [06:24:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.068 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.190 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:01] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.246 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:05] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.392 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:11] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:25:11] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:25:24] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:25:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:25:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2012.codfw.wmnet, ms-fe2018.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet, ms-fe2015.codfw.wmnet, ms-fe2016.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:25:27] PROBLEM - Swift https backend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.214 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:27] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.303 second response time https://wikitech.wikimedia.org/wiki/Swift [06:25:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:26:01] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.198 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:03] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.157 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:11] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 9.374 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:27] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.217 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:29] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.263 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:59] PROBLEM - Swift https frontend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Swift [06:26:59] PROBLEM - Swift https backend on ms-fe1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:01] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:01] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.388 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:01] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.192 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:05] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.945 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:07] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.116 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:11] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:27:27] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.713 second response time https://wikitech.wikimedia.org/wiki/Swift [06:27:31] RECOVERY - Swift https backend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.561 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.139 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:07] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.291 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:25] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:28:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:28:29] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.686 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:33] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.184 second response time https://wikitech.wikimedia.org/wiki/Swift [06:28:43] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:29:01] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:29:01] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 2.501 second response time https://wikitech.wikimedia.org/wiki/Swift [06:29:11] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:29:31] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [06:29:37] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.096 second response time https://wikitech.wikimedia.org/wiki/Swift [06:29:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:01] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.457 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:03] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.052 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:24] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:30:24] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:30:25] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 5.155 second response time https://wikitech.wikimedia.org/wiki/Docker [06:30:35] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.380 second response time https://wikitech.wikimedia.org/wiki/Swift [06:30:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:31:01] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.714 second response time https://wikitech.wikimedia.org/wiki/Swift [06:31:11] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:31:35] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:31:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:32:03] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.936 second response time https://wikitech.wikimedia.org/wiki/Swift [06:32:11] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:32:18] !incidents [06:32:18] 7201 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:32:18] 7202 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:32:18] 7203 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:32:19] 7205 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:32:19] 7206 (UNACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [06:32:19] 7204 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [06:32:19] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:32:19] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:32:20] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:32:20] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:32:21] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [06:32:21] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [06:32:27] !ack 7206 [06:32:28] 7206 (ACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [06:32:29] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.289 second response time https://wikitech.wikimedia.org/wiki/Swift [06:32:29] RECOVERY - Swift https backend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.569 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:01] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:07] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.363 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:09] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:33:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2019.codfw.wmnet, ms-fe2014.codfw.wmnet, ms-fe2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:33:29] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.566 second response time https://wikitech.wikimedia.org/wiki/Swift [06:33:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:33:59] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:01] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.149 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:03] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.362 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:11] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:34:11] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:34:11] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:34:39] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 7.030 second response time https://wikitech.wikimedia.org/wiki/Swift [06:34:59] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.371 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:01] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:01] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:01] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.775 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:05] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.618 second response time https://wikitech.wikimedia.org/wiki/Swift [06:35:11] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:35:24] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:35:37] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:35:41] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 525440072 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:35:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:36:01] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.601 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:01] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.206 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:03] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.599 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:05] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.825 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:09] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.312 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2013.codfw.wmnet, ms-fe2019.codfw.wmnet, ms-fe2011.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:36:27] PROBLEM - Swift https frontend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:33] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [06:36:37] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:36:59] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:37:11] PROBLEM - Swift https frontend on ms-fe2015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:37:11] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:37:41] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:37:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:37:59] PROBLEM - Swift https frontend on ms-fe1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:37:59] PROBLEM - Swift https frontend on ms-fe1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.096 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:01] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:38:29] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.789 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:35] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.960 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:35] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.036 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:38:37] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 4.096 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:59] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.301 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:01] PROBLEM - Swift https backend on ms-fe2016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.241 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:03] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.469 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:03] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.498 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:11] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:39:11] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:57] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:01] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 1.762 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:01] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.043 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:01] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.328 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:03] PROBLEM - Swift https backend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:07] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 9.444 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:07] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 9.270 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:11] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:40:24] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:40:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:40:25] RECOVERY - Swift https frontend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Swift [06:40:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:41:01] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:01] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:03] RECOVERY - Swift https frontend on ms-fe2015 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.567 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:07] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.007 second response time https://wikitech.wikimedia.org/wiki/Swift [06:41:43] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:42:01] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:35] RECOVERY - Swift https backend on ms-fe2018 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 1.780 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:42:59] PROBLEM - Swift https frontend on ms-fe1015 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.090 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:59] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.280 second response time https://wikitech.wikimedia.org/wiki/Swift [06:42:59] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:01] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.297 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:03] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:03] RECOVERY - Swift https backend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.238 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:07] PROBLEM - Swift https frontend on ms-fe1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:09] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.957 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:43:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2009.codfw.wmnet, ms-fe2011.codfw.wmnet, ms-fe2020.codfw.wmnet, ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:43:25] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:43:35] PROBLEM - Swift https frontend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:37] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:37] PROBLEM - Swift https frontend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:43] PROBLEM - Swift https backend on ms-fe2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:43:57] RECOVERY - Swift https frontend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Swift [06:43:57] RECOVERY - Swift https frontend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https backend on ms-fe1020 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.877 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.896 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:01] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.475 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:11] PROBLEM - Swift https frontend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:44:11] PROBLEM - Swift https frontend on ms-fe2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [06:44:11] FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:33] RECOVERY - Swift https frontend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.122 second response time https://wikitech.wikimedia.org/wiki/Swift [06:44:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:59] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.933 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:01] RECOVERY - Swift https frontend on ms-fe2019 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 1.191 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 4.317 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 3.579 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https frontend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 5.822 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:03] RECOVERY - Swift https backend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 3.293 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:07] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 6.961 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:25] RECOVERY - Swift https frontend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:25] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:45:27] RECOVERY - Swift https backend on ms-fe2017 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Swift [06:45:33] RECOVERY - Swift https backend on ms-fe2020 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift [06:46:01] RECOVERY - Swift https frontend on ms-fe2016 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Swift [06:46:02] !incidents [06:46:03] 7202 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:46:03] 7203 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:46:03] 7205 (ACKED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:46:03] 7206 (ACKED) [4x] ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet) [06:46:03] 7207 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:04] 7201 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:04] 7204 (RESOLVED) [2x] TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 gnmi codfw) [06:46:04] 7200 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule@main) [06:46:04] 7199 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule@main) [06:46:05] 7198 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:05] 7197 (RESOLVED) CoreOutboundSaturation network sre (cr1-eqiad:9804 Core: asw2-b-eqiad:xe-2/0/45 {#3457} xe-3/2/3 gnmi eqiad) [06:46:06] 7196 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-eqiad:9804 Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009} xe-3/0/6 gnmi eqiad) [06:46:06] 7195 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [06:46:13] !ack 7207 [06:46:13] 7207 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [06:46:51] RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:47:01] PROBLEM - Swift https backend on ms-fe1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 2.121 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Swift [06:47:59] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.161 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:07] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 8.586 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:12] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:48:53] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.932 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:57] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https frontend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 2.087 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https backend on ms-fe1015 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Swift [06:48:59] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.320 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:00] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:01] RECOVERY - Swift https backend on ms-fe1017 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.955 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:01] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:01] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:49:05] RECOVERY - Swift https frontend on ms-fe1019 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 7.799 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:05] RECOVERY - Swift https backend on ms-fe1018 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 6.908 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:07] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 7.496 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:11] RESOLVED: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:49:51] FIRING: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:49:59] RECOVERY - Swift https backend on ms-fe1016 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.237 second response time https://wikitech.wikimedia.org/wiki/Swift [06:51:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:52:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:54:12] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [06:54:13] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [06:54:51] RESOLVED: CoreOutboundSaturation: Core link outbound traffic above 90% capacity - cr1-eqiad:xe-3/2/3 (Core: asw2-b-eqiad:xe-2/0/45 {#3457}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreOutboundSaturation [06:56:51] RESOLVED: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:57:51] RESOLVED: [3x] TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [06:59:11] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0700) [07:04:11] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:07:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:07:48] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:17:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:22:43] FIRING: [22x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:27:53] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:32:43] FIRING: [21x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:37:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:42:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:42:53] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:47:43] FIRING: [11x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:47:48] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:52:43] RESOLVED: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [08:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by akosiaris@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) (owner: 10Alexandros Kosiaris) [08:03:46] (03Merged) 10jenkins-bot: Update fc-list to point to fc-list Tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216750 (https://phabricator.wikimedia.org/T280718) (owner: 10Alexandros Kosiaris) [08:04:41] !log akosiaris@deploy2002 Started scap sync-world: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] [08:04:45] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 [08:07:36] !log akosiaris@deploy2002 akosiaris: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:08:24] !log akosiaris@deploy2002 akosiaris: Continuing with sync [08:09:41] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 8160.08 ms [08:10:03] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 0%, RTA = 1393.21 ms [08:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:13:03] !log akosiaris@deploy2002 Finished scap sync-world: Backport for [[gerrit:1216750|Update fc-list to point to fc-list Tool (T280718)]] (duration: 08m 22s) [08:13:07] T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice - https://phabricator.wikimedia.org/T280718 [08:26:37] !log installing jq security updates [08:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:43] (03PS1) 10Elukey: scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) [08:27:49] (03CR) 10Dpogorzelski: [C:03+1] scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) (owner: 10Elukey) [08:29:05] (03CR) 10Elukey: [V:03+2 C:03+2] scap: add ml-build1001 to the scap targets [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219114 (https://phabricator.wikimedia.org/T412524) (owner: 10Elukey) [08:32:13] (03PS1) 10Muehlenhoff: debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 [08:40:43] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@4533f76]: Deploy docker-pkg [08:41:39] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@4533f76]: Deploy docker-pkg (duration: 01m 08s) [08:42:51] (03PS1) 10Alexandros Kosiaris: Remove scap_proxy profile [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) [08:45:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [08:45:55] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [08:48:44] (03CR) 10Elukey: [C:03+1] debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 (owner: 10Muehlenhoff) [08:50:08] (03CR) 10Muehlenhoff: [C:03+2] admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 (owner: 10Elukey) [08:50:58] (03PS1) 10KartikMistry: Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 [08:54:21] (03CR) 10Ayounsi: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey) [08:58:04] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11467826 (10ayounsi) My guess is that SR-Linux < 25 doesn't have stats for mgmt0 (either not implemented yet or a bug), with the upgrade we've started... [09:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0900) [09:00:08] (03PS1) 10Elukey: images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 [09:01:14] (03PS2) 10Elukey: images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 [09:03:13] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey) [09:04:28] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:05:34] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:06:35] !log jelto@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:07:21] !log jelto@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:09:38] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:10:04] (03CR) 10Alexandros Kosiaris: "Adding Blake and Jasmine per comments in https://phabricator.wikimedia.org/T411508 for review (also feel free to deploy)" [puppet] - 10https://gerrit.wikimedia.org/r/1219117 (https://phabricator.wikimedia.org/T411508) (owner: 10Alexandros Kosiaris) [09:10:28] (03CR) 10Alexandros Kosiaris: [C:03+2] images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey) [09:12:05] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:12:37] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 (owner: 10KartikMistry) [09:13:02] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:13:39] (03PS1) 10Daniel Kinzler: rest-gateway: log x-wmf- headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219123 [09:13:55] !log installing nginx security updates [09:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:19] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:14:28] (03Merged) 10jenkins-bot: Update cxserver to 2025-12-15-140202-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219119 (owner: 10KartikMistry) [09:17:35] jouncebot: nowandnext [09:17:35] For the next 1 hour(s) and 42 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T0900) [09:17:35] In 1 hour(s) and 42 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100) [09:18:01] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [09:18:53] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [09:23:54] (03CR) 10Jelto: [C:03+1] "lgtm, I deployed this on all wikikube clusters" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [09:26:35] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11467881 (10MoritzMuehlenhoff) [09:28:50] !log depool and disable puppet on cp7009 for haproxy qos testing (T412785) [09:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:54] T412785: Enable QoS for upload video files - https://phabricator.wikimedia.org/T412785 [09:32:05] !log fabfur@cumin1003 conftool action : set/pooled=yes; selector: name=cp7009.* [09:32:12] !log fabfur@cumin1003 conftool action : set/pooled=no; selector: name=cp7009.* [09:36:02] (03PS3) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) [09:37:25] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:38:25] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:39:00] (03PS4) 10STran: Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) [09:40:47] 10SRE-SLO, 10Observability-Metrics, 13Patch-For-Review: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521#11467954 (10tappof) [09:46:42] (03CR) 10Elukey: [V:03+2] images: add python3-build-trixie image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219121 (owner: 10Elukey) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:35] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [09:55:10] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [09:55:41] (03PS4) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731 [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:37] (03PS1) 10Muehlenhoff: kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 [09:59:15] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [09:59:50] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [10:02:38] (03CR) 10Muehlenhoff: [C:03+2] debdeploy: Remove buster from list of supported releases [puppet] - 10https://gerrit.wikimedia.org/r/1219115 (owner: 10Muehlenhoff) [10:05:54] (03CR) 10Mszwarc: [C:03+1] Enable v2 non-emergency workflow by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [10:07:18] !log Updated cxserver to 2025-12-15-140202-production [10:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:50] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [10:09:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207845 (https://phabricator.wikimedia.org/T410512) (owner: 10STran) [10:19:58] (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) [10:22:52] (03PS1) 10Elukey: Rework Makefile.build to ease additional distributions [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219130 [10:22:52] (03PS1) 10Elukey: Add Trixie artifacts [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1219131 [10:25:18] (03CR) 10Elukey: [C:03+1] kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 (owner: 10Muehlenhoff) [10:25:28] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:25:34] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: fix retry logic for the Supermicro BMC password [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey) [10:25:39] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#11468080 (10ayounsi) We can set the rule now as non-paging to start collecting data and test it. So we can gain trust in it before... [10:26:54] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [10:26:57] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 2 others: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11468090 (10elukey) @VRiley-WMF @Jclark-ctr the new code is merged, so you can test it once you have servers ready (I don't want to rush you). Please r... [10:27:17] (03PS1) 10Filippo Giunchedi: metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) [10:27:46] (03CR) 10CI reject: [V:04-1] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [10:30:23] (03CR) 10Muehlenhoff: [C:03+2] kartotherian: Bump version to include latest libpng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219126 (owner: 10Muehlenhoff) [10:30:24] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:30:52] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7829/co" [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [10:33:03] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [10:33:41] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [10:34:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [10:34:29] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: apply [10:35:33] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply [10:35:56] (03CR) 10Elukey: [C:03+1] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [10:36:16] (03CR) 10Elukey: [C:03+1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [10:36:19] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: apply [10:37:44] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply [10:42:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86716 and previous config saved to /var/cache/conftool/dbconfig/20251217-104240-marostegui.json [10:42:46] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [10:42:47] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [10:44:23] (03PS1) 10Muehlenhoff: Add missing stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1219133 [10:45:17] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add missing stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1219133 (owner: 10Muehlenhoff) [10:45:53] (03PS2) 10Muehlenhoff: Remove puppetmaster::backend role and related Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) [10:47:35] (03PS1) 10Filippo Giunchedi: typos: match .wmet [puppet] - 10https://gerrit.wikimedia.org/r/1219134 [10:50:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:51:38] !log installing libssh security updates [10:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P86717 and previous config saved to /var/cache/conftool/dbconfig/20251217-105748-marostegui.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251217T1100) [11:04:05] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [11:09:20] (03CR) 10Majavah: [C:03+1] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [11:12:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2221', diff saved to https://phabricator.wikimedia.org/P86718 and previous config saved to /var/cache/conftool/dbconfig/20251217-111257-marostegui.json [11:14:15] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "CI failure will be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1219134 (only a typo)" [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [11:14:20] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] metricsinfra: enable space-based retention up to 85% [puppet] - 10https://gerrit.wikimedia.org/r/1219132 (https://phabricator.wikimedia.org/T412927) (owner: 10Filippo Giunchedi) [11:14:40] (03CR) 10Muehlenhoff: "There's some noise in the PCC, which seems to be around stale PCC data, puppetmaster2002 is already gone e.g." [puppet] - 10https://gerrit.wikimedia.org/r/1219129 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [11:18:41] (03CR) 10Silvan Heintze: [C:03+1] "nice - now the symlinks are working in our local dev environment, too 👍" [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) (owner: 10Jakob)