[00:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0000) [00:05:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet2007-dev.codfw.wmnet [00:05:57] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [00:08:54] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:09:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:11:54] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:12:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [00:12:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet [00:14:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:14:54] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:15:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:18:33] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet [00:18:54] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:19:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:21:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:28:59] (03CR) 10Xcollazo: [C:03+1] Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [00:37:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11403718 (10RLazarus) @Milimetric @Ahoelzl Ping - can you approve for Data Engineering please? The requester is not a WMF or WMDE emp... [00:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210762 [00:40:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210762 (owner: 10TrainBranchBot) [00:41:18] (03PS8) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) [00:41:19] (03CR) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [00:42:33] 06SRE, 06Infrastructure-Foundations: Improve "reuse" feature for standard partman recipes - https://phabricator.wikimedia.org/T410601#11403723 (10RLazarus) [00:42:55] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11403724 (10RLazarus) [00:52:24] (03CR) 10Pmiazga: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [00:54:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1210762 (owner: 10TrainBranchBot) [00:56:22] (03PS2) 10RLazarus: all charts: Update mesh.configuration 1.14.1 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) [00:57:45] (03CR) 10RLazarus: [C:03+2] admin: Move rzl pre-FIDO ssh key to buster only [puppet] - 10https://gerrit.wikimedia.org/r/1208451 (owner: 10RLazarus) [01:00:38] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:52] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 14s) [01:02:28] (03CR) 10Tim Starling: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [01:04:53] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs group for amastilovic - https://phabricator.wikimedia.org/T410972 (10amastilovic) 03NEW [01:06:36] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06Traffic: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11403767 (10ssingh) [01:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:10:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210765 [01:10:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210765 (owner: 10TrainBranchBot) [01:10:05] (03CR) 10RLazarus: [C:03+2] "PS2 just re-bumps the chart versions for charts that were touched in the meantime." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [01:10:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [01:17:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:22:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:22:15] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.35.6 - https://phabricator.wikimedia.org/T410975 (10RLazarus) 03NEW [01:22:24] (03Merged) 10jenkins-bot: all charts: Update mesh.configuration 1.14.1 to 1.15.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1208484 (https://phabricator.wikimedia.org/T409510) (owner: 10RLazarus) [01:24:53] (03CR) 10Tim Starling: [C:03+2] admin: Remove my non-FIDO keys [puppet] - 10https://gerrit.wikimedia.org/r/1210224 (owner: 10Tim Starling) [01:27:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:27:21] churning out some envoy updates in staging, no production impact [01:28:13] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/apertium: apply [01:28:32] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/apertium: apply [01:30:05] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [01:30:17] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [01:30:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [01:30:49] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [01:31:04] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [01:31:32] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [01:32:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:32:19] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [01:32:37] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [01:32:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [01:33:09] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [01:33:28] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [01:33:47] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [01:34:02] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [01:34:18] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [01:34:29] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [01:34:42] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [01:35:04] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/device-analytics: apply [01:35:32] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [01:35:42] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1210765 (owner: 10TrainBranchBot) [01:35:45] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/echostore: apply [01:36:04] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/echostore: apply [01:36:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [01:37:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:37:06] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [01:37:27] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [01:37:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [01:37:55] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [01:38:35] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [01:39:03] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [01:39:40] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [01:40:34] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [01:41:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [01:41:27] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [01:41:52] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [01:42:39] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [01:43:11] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [01:44:36] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [01:44:45] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [01:44:50] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [01:45:15] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [01:46:37] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [01:46:55] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [01:47:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [01:48:00] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [01:48:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [01:48:28] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [01:48:43] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [01:49:02] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [01:51:48] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [01:52:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:54:38] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [01:54:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mathoid: apply [01:55:05] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [01:55:35] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [01:56:02] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [01:56:49] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [01:57:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [01:57:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [01:58:05] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [01:58:35] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [01:58:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [01:58:58] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [01:59:13] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [01:59:29] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [01:59:47] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [01:59:58] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [02:00:21] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [02:00:43] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [02:01:07] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [02:01:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [02:01:46] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [02:02:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:02:05] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [02:02:22] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [02:02:31] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [02:02:53] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [02:03:12] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [02:04:53] (03PS3) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) [02:06:16] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [02:06:22] (03CR) 10MusikAnimal: [metawiki] enable voting on entities with the 'Under review' status (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208231 (https://phabricator.wikimedia.org/T409613) (owner: 10MusikAnimal) [02:06:44] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [02:07:08] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [02:07:22] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [02:08:18] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [02:08:34] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [02:09:20] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [02:09:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210773 (https://phabricator.wikimedia.org/T408274) [02:09:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210773 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [02:09:38] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [02:09:53] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [02:10:20] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [02:12:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:12:23] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [02:12:35] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [02:13:24] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [02:13:59] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [02:14:23] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [02:14:54] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [02:15:42] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [02:15:51] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [02:16:33] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [02:16:52] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [02:20:32] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [02:20:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [02:20:45] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [02:21:51] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [02:22:02] RESOLVED: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:22:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [02:22:32] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [02:23:01] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [02:23:38] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [02:23:56] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [02:24:34] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.4 [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210773 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [02:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:24:47] and done [02:30:23] (03PS1) 10RLazarus: Update to v1.35.6 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1210776 (https://phabricator.wikimedia.org/T410975) [02:33:11] (03CR) 10RLazarus: [C:03+2] Update to v1.35.6 [debs/envoyproxy] (v1.35) - 10https://gerrit.wikimedia.org/r/1210776 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [02:36:05] !log rzl@apt1002:~$ sudo -i reprepro -C component/envoy-future include bullseye-wikimedia /home/rzl/envoyproxy_1.35.6-1_amd64.changes # T410975 [02:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:10] T410975: Upgrade Envoy to v1.35.6 - https://phabricator.wikimedia.org/T410975 [02:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0300) [03:05:12] (03PS1) 10RLazarus: envoy-future: Update to v1.35.6 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1210806 (https://phabricator.wikimedia.org/T410975) [03:59:48] (03CR) 10Jforrester: [C:03+1] deployment_server: drop PHP 8.1 fallback in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1207979 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0400) [04:02:05] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210843 (https://phabricator.wikimedia.org/T408274) [04:02:08] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210843 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [04:02:57] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210843 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [04:03:28] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.4 refs T408274 [04:03:33] T408274: 1.46.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T408274 [04:10:04] (03PS1) 10C. Scott Ananian: Clone ParserOutput in Article before post-processing (take 2) [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210844 (https://phabricator.wikimedia.org/T410923) [04:10:31] (03CR) 10C. Scott Ananian: [C:03+2] "Just missed the branch cut." [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210844 (https://phabricator.wikimedia.org/T410923) (owner: 10C. Scott Ananian) [04:21:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:23:45] (03Merged) 10jenkins-bot: Clone ParserOutput in Article before post-processing (take 2) [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1210844 (https://phabricator.wikimedia.org/T410923) (owner: 10C. Scott Ananian) [04:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:58:28] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.4 refs T408274 (duration: 55m 00s) [04:58:32] T408274: 1.46.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T408274 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0500) [05:03:55] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.1 (duration: 03m 53s) [05:08:58] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:15:54] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 1 (gerrit2003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:20:22] RECOVERY - MariaDB Replica Lag: s2 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:28] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:20:34] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:21:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T410589)', diff saved to https://phabricator.wikimedia.org/P85560 and previous config saved to /var/cache/conftool/dbconfig/20251125-052121-ladsgroup.json [05:21:27] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [05:24:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:31:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:33:58] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:36:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:36:30] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P85561 and previous config saved to /var/cache/conftool/dbconfig/20251125-053629-ladsgroup.json [05:39:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:51:36] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P85562 and previous config saved to /var/cache/conftool/dbconfig/20251125-055136-ladsgroup.json [06:06:44] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T410589)', diff saved to https://phabricator.wikimedia.org/P85563 and previous config saved to /var/cache/conftool/dbconfig/20251125-060643-ladsgroup.json [06:06:49] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [06:07:01] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [06:07:08] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1162 (T410589)', diff saved to https://phabricator.wikimedia.org/P85564 and previous config saved to /var/cache/conftool/dbconfig/20251125-060708-ladsgroup.json [06:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [06:26:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2205.codfw.wmnet with reason: Maintenance [06:31:06] (03PS1) 10Marostegui: clouddb1024: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210907 [06:39:40] (03CR) 10Marostegui: [C:03+2] clouddb1024: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210907 (owner: 10Marostegui) [06:39:45] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install x1 host - https://phabricator.wikimedia.org/T407897#11404095 (10Marostegui) @RobH it was all clarified earlier at T407897#11354110 so this seems to be a loop :-) It is all good from our side. This host has been in production sinc... [06:44:34] (03PS1) 10Marostegui: installserver: Do not reimage clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1210917 [06:46:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1159.eqiad.wmnet with reason: Maintenance [06:46:59] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage clouddb102[45] [puppet] - 10https://gerrit.wikimedia.org/r/1210917 (owner: 10Marostegui) [06:46:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85565 and previous config saved to /var/cache/conftool/dbconfig/20251125-064658-marostegui.json [06:47:04] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [06:49:19] (03PS1) 10Marostegui: clouddb1025: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210918 [06:50:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85566 and previous config saved to /var/cache/conftool/dbconfig/20251125-065026-marostegui.json [06:50:55] (03CR) 10Marostegui: [C:03+2] clouddb1025: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1210918 (owner: 10Marostegui) [06:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0700). [07:05:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P85567 and previous config saved to /var/cache/conftool/dbconfig/20251125-070534-marostegui.json [07:16:04] (03CR) 10Arnaudb: [C:03+2] apt-staging: logging and metrics [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833) (owner: 10Arnaudb) [07:16:10] (03CR) 10Arnaudb: [C:03+2] apt-staging: error handling for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb) [07:16:28] (03PS4) 10Arnaudb: apt-staging: error handling for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) [07:18:36] (03CR) 10Arnaudb: [C:03+2] "ccing Moritz, I'll merge and test today and revert if it breaks something" [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb) [07:20:32] (03PS4) 10Muehlenhoff: test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) [07:20:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P85568 and previous config saved to /var/cache/conftool/dbconfig/20251125-072041-marostegui.json [07:22:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1210695 (https://phabricator.wikimedia.org/T410426) (owner: 10RLazarus) [07:25:33] (03PS1) 10Marostegui: clouddb1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1210944 (https://phabricator.wikimedia.org/T409557) [07:26:07] !log upgrade Envoy on puppet servers T405808 [07:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:12] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [07:26:23] (03CR) 10Marostegui: [C:03+2] clouddb1024: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1210944 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [07:35:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T410531)', diff saved to https://phabricator.wikimedia.org/P85569 and previous config saved to /var/cache/conftool/dbconfig/20251125-073549-marostegui.json [07:35:55] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:36:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1161.eqiad.wmnet with reason: Maintenance [07:36:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:36:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T410531)', diff saved to https://phabricator.wikimedia.org/P85570 and previous config saved to /var/cache/conftool/dbconfig/20251125-073634-marostegui.json [07:38:08] (03PS1) 10Arnaudb: apt-staging: log level bump [puppet] - 10https://gerrit.wikimedia.org/r/1210954 (https://phabricator.wikimedia.org/T409832) [07:40:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T410531)', diff saved to https://phabricator.wikimedia.org/P85571 and previous config saved to /var/cache/conftool/dbconfig/20251125-074002-marostegui.json [07:55:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P85572 and previous config saved to /var/cache/conftool/dbconfig/20251125-075509-marostegui.json [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0800). [08:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org [08:00:54] I will start with the backports I have scheduled in about 30 minutes. [08:05:52] (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:06:32] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:07:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org [08:08:28] (03Merged) 10jenkins-bot: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1210643 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:10:13] (03PS1) 10Muehlenhoff: Fix relforge Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1210962 [08:10:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P85573 and previous config saved to /var/cache/conftool/dbconfig/20251125-081017-marostegui.json [08:12:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1005.wikimedia.org [08:13:56] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:14:25] (03PS2) 10Muehlenhoff: Fix relforge Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1210962 [08:14:48] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1208370 (owner: 10Andrew Bogott) [08:16:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1005.wikimedia.org [08:20:21] (03PS1) 10Brouberol: data: add new usbC yubikey for brouberol [puppet] - 10https://gerrit.wikimedia.org/r/1211000 [08:21:32] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:24:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [08:25:08] (03Merged) 10jenkins-bot: hCaptcha: Adjust addurl config for zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210621 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [08:25:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T410531)', diff saved to https://phabricator.wikimedia.org/P85575 and previous config saved to /var/cache/conftool/dbconfig/20251125-082525-marostegui.json [08:25:27] (03CR) 10Brouberol: Report integrity metric from Wikidata dump scripts (033 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [08:25:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance [08:25:31] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:25:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T410531)', diff saved to https://phabricator.wikimedia.org/P85576 and previous config saved to /var/cache/conftool/dbconfig/20251125-082537-marostegui.json [08:25:56] FIRING: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [08:27:56] (03CR) 10Ayounsi: [C:03+2] Remove maps from SKIP_V6_DNS_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 (owner: 10Ayounsi) [08:28:06] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210621|hCaptcha: Adjust addurl config for zhwiki and jawiki (T410354 T409957)]] [08:28:12] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [08:28:12] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [08:28:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T410531)', diff saved to https://phabricator.wikimedia.org/P85577 and previous config saved to /var/cache/conftool/dbconfig/20251125-082836-marostegui.json [08:30:00] (03Merged) 10jenkins-bot: Remove maps from SKIP_V6_DNS_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1208360 (owner: 10Ayounsi) [08:32:37] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210621|hCaptcha: Adjust addurl config for zhwiki and jawiki (T410354 T409957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:33:39] (03PS2) 10Arnaudb: apt-staging: wrong error code in gitlab_package_puller_run_success [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) [08:33:39] (03CR) 10Arnaudb: [C:03+1] "I inverted 0 and 1 for a boolean alert, this swaps them back" [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [08:33:43] (03CR) 10Arnaudb: [C:03+2] apt-staging: wrong error code in gitlab_package_puller_run_success [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [08:34:29] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:34:59] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:35:21] (03Merged) 10jenkins-bot: apt-staging: wrong error code in gitlab_package_puller_run_success [alerts] - 10https://gerrit.wikimedia.org/r/1211001 (https://phabricator.wikimedia.org/T409835) (owner: 10Arnaudb) [08:35:54] !log kharlan@deploy2002 kharlan: Continuing with sync [08:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:41:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [08:41:55] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210621|hCaptcha: Adjust addurl config for zhwiki and jawiki (T410354 T409957)]] (duration: 13m 49s) [08:42:02] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [08:42:02] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [08:42:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [08:43:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P85578 and previous config saved to /var/cache/conftool/dbconfig/20251125-084344-marostegui.json [08:50:56] RESOLVED: GitlabPackagePullerFailedOnRun: Package puller has some run errors that needs investigation. - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DGitlabPackagePullerFailedOnRun [08:51:40] (03Merged) 10jenkins-bot: hCaptcha: Adjust addurl logic for 100% passive mode [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210637 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [08:52:16] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210637|hCaptcha: Adjust addurl logic for 100% passive mode (T409957)]] [08:52:21] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [08:52:27] (03CR) 10Arnaudb: "sorry about the lack of context:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1210386 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:52:52] (03PS1) 10Kevin Bazira: ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) [08:53:10] (03CR) 10Fabfur: [C:03+1] "good job!" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [08:54:11] (03CR) 10Dpogorzelski: [C:03+1] ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:54:40] (03CR) 10Kevin Bazira: [C:03+2] ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:55:45] 06SRE, 06Infrastructure-Foundations: wmf-auto-restart: Support a pre-defined restart time - https://phabricator.wikimedia.org/T410986 (10MoritzMuehlenhoff) 03NEW [08:55:51] 06SRE, 06Infrastructure-Foundations: wmf-auto-restart: Support a pre-defined restart time - https://phabricator.wikimedia.org/T410986#11404302 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:55:57] (03CR) 10Filippo Giunchedi: [C:03+1] interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [08:56:06] !log drain Arelion codfw transit - T401100 [08:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:13] (03CR) 10Filippo Giunchedi: [C:03+1] labs: add infra-tracing-nfs account [labs/private] - 10https://gerrit.wikimedia.org/r/1210591 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [08:56:30] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210637|hCaptcha: Adjust addurl logic for 100% passive mode (T409957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:56:38] (03Merged) 10jenkins-bot: ml-services: set llm model-server bnb dtype to none [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211006 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:56:52] (03CR) 10Majavah: [V:03+1 C:03+2] interface::tagged: Add strict typing [puppet] - 10https://gerrit.wikimedia.org/r/1208293 (owner: 10Majavah) [08:57:24] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1210582 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [08:58:02] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [08:58:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2044.codfw.wmnet [08:58:30] (03CR) 10Majavah: [C:03+2] P:openstack: neutron: Cleanup legacy_vlan_naming hiera key [puppet] - 10https://gerrit.wikimedia.org/r/1208306 (owner: 10Majavah) [08:58:46] !log kharlan@deploy2002 kharlan: Continuing with sync [08:58:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P85579 and previous config saved to /var/cache/conftool/dbconfig/20251125-085852-marostegui.json [09:00:00] (03CR) 10Fabfur: [C:03+2] P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [09:00:05] jnuche and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251125T0900) [09:01:09] (03PS3) 10Majavah: interface::tagged: Remove legacy_vlan_naming option [puppet] - 10https://gerrit.wikimedia.org/r/1208307 [09:02:04] (03CR) 10Fabfur: [C:03+2] "merged for swfrench to fully enable it later" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [09:02:05] 👋 backports are still happening, the train will begin after that [09:03:17] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210637|hCaptcha: Adjust addurl logic for 100% passive mode (T409957)]] (duration: 11m 01s) [09:03:22] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [09:03:43] jnuche: thanks, I still have a few more to go [09:03:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2044.codfw.wmnet [09:04:57] (03CR) 10Brouberol: [C:03+1] Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:05:08] kostajh: how many more? will it take long? [09:05:31] !log convert Arelion codfw transit to LACP - T401100 [09:05:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:08] jnuche: after this one (which should be quick), it's one config and one wmf.3 patch. I could sync those two together [09:06:20] (03Merged) 10jenkins-bot: hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210622 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:06:38] kostajh: ack, thx [09:06:48] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11404346 (10fgiunchedi) [09:06:51] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210622|hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode (T405586)]] [09:06:56] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:09:23] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11404351 (10fgiunchedi) The logical side on the host side is done. Next up is deleting the interfaces from netbox for the hosts and unplug network cables. I'll file subtasks [09:09:37] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:10:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:10:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:11:03] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210622|hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:12:31] (03CR) 10Harroyo-wmf: [C:03+1] hCaptcha: Define valid SiteKeys for account creation and edit triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:13:04] PROBLEM - Juniper alarms on cr1-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.192 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:14:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T410531)', diff saved to https://phabricator.wikimedia.org/P85580 and previous config saved to /var/cache/conftool/dbconfig/20251125-091400-marostegui.json [09:14:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1200.eqiad.wmnet with reason: Maintenance [09:14:06] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:14:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T410531)', diff saved to https://phabricator.wikimedia.org/P85581 and previous config saved to /var/cache/conftool/dbconfig/20251125-091412-marostegui.json [09:15:45] !log kharlan@deploy2002 kharlan: Continuing with sync [09:15:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:15:56] RECOVERY - Juniper alarms on cr1-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [09:17:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T410531)', diff saved to https://phabricator.wikimedia.org/P85582 and previous config saved to /var/cache/conftool/dbconfig/20251125-091712-marostegui.json [09:17:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 39): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7697/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1208307 (owner: 10Majavah) [09:18:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:19:16] (03CR) 10Elukey: [C:03+1] test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [09:19:48] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210622|hCaptcha: Enable hCaptcha editing on frwiki in 100% passive mode (T405586)]] (duration: 12m 57s) [09:19:53] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:20:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:20:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:21:03] jnuche: syncing the last two now [09:21:06] (03PS1) 10Esanders: FlowMoveBoardsToSubpages: Skip moves that throw exceptions [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211008 (https://phabricator.wikimedia.org/T402552) [09:21:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1211008 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [09:21:51] (03Merged) 10jenkins-bot: hCaptcha: Define valid SiteKeys for account creation and edit triggers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210627 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:24:32] (03CR) 10Muehlenhoff: [C:03+2] test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff) [09:26:13] (03CR) 10Brouberol: [C:03+1] Update documentation for rdf_functions.sh path in dumpwikibaserdf.sh [dumps] - 10https://gerrit.wikimedia.org/r/1204598 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [09:28:33] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1211000 (owner: 10Brouberol) [09:30:04] (03CR) 10Brouberol: [C:03+2] data: add new usbC yubikey for brouberol [puppet] - 10https://gerrit.wikimedia.org/r/1211000 (owner: 10Brouberol) [09:31:33] (03Merged) 10jenkins-bot: hCaptcha: Allow providing a set of valid keys for site verify per action [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1210737 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [09:32:09] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1210627|hCaptcha: Define valid SiteKeys for account creation and edit triggers (T410657)]], [[gerrit:1210737|hCaptcha: Allow providing a set of valid keys for site verify per action (T410657 T410863)]] [09:32:11] (03CR) 10Brouberol: [C:03+2] Bump Hadoop max container size to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1210744 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:32:16] T410657: hCaptcha: Improve support for SiteKey verification - https://phabricator.wikimedia.org/T410657 [09:32:16] T410863: hCaptcha: SiteKey mismatch error on "always challenge" workflow - https://phabricator.wikimedia.org/T410863 [09:32:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P85583 and previous config saved to /var/cache/conftool/dbconfig/20251125-093219-marostegui.json [09:32:49] (03PS11) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [09:32:49] (03PS1) 10Majavah: P:wmcs::cloudgw: Remove unused conditions around IPv6 setup [puppet] - 10https://gerrit.wikimedia.org/r/1211010 [09:32:49] (03PS1) 10Majavah: P:wmcs::cloudgw: Convert raw nftables file to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1211011 [09:32:49] (03PS1) 10Majavah: P:wmcs::cloudgw: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1211012 [09:33:31] (03CR) 10CI reject: [V:04-1] P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [09:35:03] (03PS12) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [09:35:03] (03PS2) 10Majavah: P:wmcs::cloudgw: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1211012 [09:36:24] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1210627|hCaptcha: Define valid SiteKeys for account creation and edit triggers (T410657)]], [[gerrit:1210737|hCaptcha: Allow providing a set of valid keys for site verify per action (T410657 T410863)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:36:51] (03CR) 10Muehlenhoff: P:wmcs::cloudgw: Convert raw nftables file to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:39:05] (03CR) 10Superpes15: [C:03+1] trwikisource: Create rollbacker user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210681 (https://phabricator.wikimedia.org/T410931) (owner: 10Neriah) [09:39:11] !log kharlan@deploy2002 kharlan: Continuing with sync [09:43:12] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1210627|hCaptcha: Define valid SiteKeys for account creation and edit triggers (T410657)]], [[gerrit:1210737|hCaptcha: Allow providing a set of valid keys for site verify per action (T410657 T410863)]] (duration: 11m 03s) [09:43:18] T410657: hCaptcha: Improve support for SiteKey verification - https://phabricator.wikimedia.org/T410657 [09:43:19] T410863: hCaptcha: SiteKey mismatch error on "always challenge" workflow - https://phabricator.wikimedia.org/T410863 [09:43:57] (03PS9) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [09:44:06] (03CR) 10Majavah: P:wmcs::cloudgw: Convert raw nftables file to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:44:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7698/console" [puppet] - 10https://gerrit.wikimedia.org/r/1211010 (owner: 10Majavah) [09:44:18] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [09:44:21] kostajh: ok to go ahead with the train? [09:44:32] jnuche: yes, waiting for the patches to finish syncing [09:44:39] (03PS1) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 [09:44:39] jnuche: ah they finished [09:44:41] yes, go ahead [09:44:45] thanks [09:45:34] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211017 (https://phabricator.wikimedia.org/T408274) [09:45:36] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by jnuche@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211017 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [09:45:47] (03PS2) 10Majavah: P:wmcs::cloudgw: Convert raw nftables file to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1211011 [09:45:47] (03PS13) 10Majavah: P:wmcs::cloudgw: Use interface::route wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1196367 [09:45:47] (03PS3) 10Majavah: P:wmcs::cloudgw: Remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1211012 [09:46:33] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211017 (https://phabricator.wikimedia.org/T408274) (owner: 10TrainBranchBot) [09:46:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7700/co" [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:47:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P85584 and previous config saved to /var/cache/conftool/dbconfig/20251125-094727-marostegui.json [09:47:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:48:42] (03CR) 10Muehlenhoff: [C:03+2] Properly rename tilerator_pass variable [puppet] - 10https://gerrit.wikimedia.org/r/1204900 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:52:08] (03CR) 10CI reject: [V:04-1] sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 (owner: 10Effie Mouzeli) [09:52:15] (03PS1) 10Joal: Update hadoop max container memory to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1211019 (https://phabricator.wikimedia.org/T410966) [09:52:15] (03CR) 10Jgiannelos: [C:03+1] profile::thanos::swift: add tegola account for staging [puppet] - 10https://gerrit.wikimedia.org/r/1210599 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [09:52:15] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7701/co" [puppet] - 10https://gerrit.wikimedia.org/r/1196367 (owner: 10Majavah) [09:52:49] (03PS3) 10FNegri: toolsdb: increase innodb_log_file_size to 512M [puppet] - 10https://gerrit.wikimedia.org/r/1204472 (https://phabricator.wikimedia.org/T409922) [09:53:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198537 (https://phabricator.wikimedia.org/T278481) (owner: 10Jgiannelos) [09:53:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:54:19] (03CR) 10Brouberol: [C:03+1] Update hadoop max container memory to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1211019 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:54:25] (03CR) 10Brouberol: [C:03+2] Update hadoop max container memory to 128Gb [puppet] - 10https://gerrit.wikimedia.org/r/1211019 (https://phabricator.wikimedia.org/T410966) (owner: 10Joal) [09:54:31] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.4 refs T408274 [09:54:36] T408274: 1.46.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T408274 [09:54:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:59:23] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Remove unused conditions around IPv6 setup [puppet] - 10https://gerrit.wikimedia.org/r/1211010 (owner: 10Majavah) [09:59:30] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Convert raw nftables file to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [09:59:58] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloudgw: Convert raw nftables file to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211011 (owner: 10Majavah) [10:01:11] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11404602 (10elukey) Looping in also @BTullis and @brouberol for a quick high level discussion, since AQS will be probably the first cluster to target :) [10:02:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T410531)', diff saved to https://phabricator.wikimedia.org/P85585 and previous config saved to /var/cache/conftool/dbconfig/20251125-100235-marostegui.json [10:02:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [10:02:40] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:02:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T410531)', diff saved to https://phabricator.wikimedia.org/P85586 and previous config saved to /var/cache/conftool/dbconfig/20251125-100247-marostegui.json [10:04:06] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:ae1 (Arelion transit) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:05:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:05:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T410531)', diff saved to https://phabricator.wikimedia.org/P85587 and previous config saved to /var/cache/conftool/dbconfig/20251125-100549-marostegui.json [10:08:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-codfw and Arelion (2001:2035:0:af4::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:11:36] (03PS2) 10Muehlenhoff: Remove the new unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) [10:11:52] (03PS3) 10Muehlenhoff: Remove the now unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) [10:17:29] (03PS2) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 [10:18:03] (03CR) 10Muehlenhoff: [C:03+2] Remove the now unused tilerator_pass [puppet] - 10https://gerrit.wikimedia.org/r/1204914 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [10:20:29] (03PS1) 10Kevin Bazira: ml-services: fix memory allocation failure in llm model-server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211028 (https://phabricator.wikimedia.org/T410906) [10:20:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P85588 and previous config saved to /var/cache/conftool/dbconfig/20251125-102057-marostegui.json