[00:01:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:07:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:38:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966840 [00:38:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966840 (owner: 10TrainBranchBot) [00:55:08] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966840 (owner: 10TrainBranchBot) [01:00:19] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:30:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:00:58] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:39] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:17] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:03:39] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:11:52] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:27:14] (03PS1) 10Gergő Tisza: logging: Raise 'error' channel threshold to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967679 (https://phabricator.wikimedia.org/T193472) [05:00:19] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:04:28] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:06] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:17:51] (03CR) 10Elukey: [C: 03+1] "I think we can proceed, we should be careful with readability but no blocker for the rest. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) (owner: 10Ilias Sarantopoulos) [06:22:40] (03Abandoned) 10Elukey: profile::prometheus::k8s: drop unused Istio labels [puppet] - 10https://gerrit.wikimedia.org/r/967140 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [06:45:04] (03CR) 10Elukey: Define environment variables to ease the use of prometheus-metricsfetcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) (owner: 10Brouberol) [06:53:17] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:55:20] (03CR) 10Majavah: [C: 03+1] acme_chief: Disable proxy buffering on nginx [puppet] - 10https://gerrit.wikimedia.org/r/967477 (https://phabricator.wikimedia.org/T349384) (owner: 10Vgutierrez) [06:59:50] (03PS1) 10Marostegui: install_server: Do not reimage db1226 [puppet] - 10https://gerrit.wikimedia.org/r/967762 [07:00:04] Amir1, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T0700). [07:00:04] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:24] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:40] o/ [07:00:47] (03PS1) 10Elukey: ml-services: update recommendation-api-ng's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967763 [07:00:56] o/ looking [07:01:15] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1226 [puppet] - 10https://gerrit.wikimedia.org/r/967762 (owner: 10Marostegui) [07:03:39] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:03:40] aanzx: why is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/966574/ changing the logo from v2 to v1? the task linked seems unrelated to that [07:05:29] i replaced image on old image , since v2 image was protected [07:06:49] (03PS4) 10Majavah: dewiktionary: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966569 (https://phabricator.wikimedia.org/T348978) (owner: 10Anzx) [07:07:01] (03PS3) 10Majavah: hiwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967213 (https://phabricator.wikimedia.org/T310961) (owner: 10Anzx) [07:07:19] so in reality it's like v3 and not v1.. aha [07:07:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966574 (https://phabricator.wikimedia.org/T349036) (owner: 10Anzx) [07:07:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966569 (https://phabricator.wikimedia.org/T348978) (owner: 10Anzx) [07:07:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967213 (https://phabricator.wikimedia.org/T310961) (owner: 10Anzx) [07:08:24] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:08:44] (03Merged) 10jenkins-bot: knwiktionary: update logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966574 (https://phabricator.wikimedia.org/T349036) (owner: 10Anzx) [07:08:47] (03Merged) 10jenkins-bot: dewiktionary: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966569 (https://phabricator.wikimedia.org/T348978) (owner: 10Anzx) [07:08:50] (03Merged) 10jenkins-bot: hiwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967213 (https://phabricator.wikimedia.org/T310961) (owner: 10Anzx) [07:09:00] taavi: it should have been v1 because , original image was had some text misaligned [07:09:07] !log taavi@deploy2002 Started scap: Backport for [[gerrit:966574|knwiktionary: update logo (T349036)]], [[gerrit:966569|dewiktionary: add tagline (T348978)]], [[gerrit:967213|hiwikisource: Adjust width-height ratio of logo to fix display issue (T310961)]] [07:09:15] T348978: New tagline in german wiktionary for vector-2022 skin - https://phabricator.wikimedia.org/T348978 [07:09:15] T349036: Several Wiktionary projects lose taglines when switching to Vector 2022 - https://phabricator.wikimedia.org/T349036 [07:09:16] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [07:11:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:17:38] !log taavi@deploy2002 taavi and anzx: Backport for [[gerrit:966574|knwiktionary: update logo (T349036)]], [[gerrit:966569|dewiktionary: add tagline (T348978)]], [[gerrit:967213|hiwikisource: Adjust width-height ratio of logo to fix display issue (T310961)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:17:39] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update recommendation-api-ng's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967763 (owner: 10Elukey) [07:17:44] T348978: New tagline in german wiktionary for vector-2022 skin - https://phabricator.wikimedia.org/T348978 [07:17:44] aanzx: please test [07:17:44] T349036: Several Wiktionary projects lose taglines when switching to Vector 2022 - https://phabricator.wikimedia.org/T349036 [07:17:45] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [07:17:52] checking [07:18:05] (03CR) 10Elukey: [C: 03+2] ml-services: update recommendation-api-ng's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967763 (owner: 10Elukey) [07:20:11] taavi: looks good [07:20:47] !log taavi@deploy2002 taavi and anzx: Continuing with sync [07:21:18] (03CR) 10Ayounsi: [C: 03+1] Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [07:21:21] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:21:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:22:05] !log elukey@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:22:36] !log elukey@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [07:26:07] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:966574|knwiktionary: update logo (T349036)]], [[gerrit:966569|dewiktionary: add tagline (T348978)]], [[gerrit:967213|hiwikisource: Adjust width-height ratio of logo to fix display issue (T310961)]] (duration: 16m 59s) [07:27:21] taavi: thanks [07:27:45] T348978: New tagline in german wiktionary for vector-2022 skin - https://phabricator.wikimedia.org/T348978 [07:27:46] T349036: Several Wiktionary projects lose taglines when switching to Vector 2022 - https://phabricator.wikimedia.org/T349036 [07:27:46] T310961: Site logo cropped/not fully displayed on some projects - https://phabricator.wikimedia.org/T310961 [07:28:55] (03PS21) 10Brouberol: Define environment variables to ease the use of prometheus-metricsfetcher [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) [07:30:16] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 138 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:30:27] (03CR) 10Brouberol: Define environment variables to ease the use of prometheus-metricsfetcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) (owner: 10Brouberol) [07:33:14] (03CR) 10Filippo Giunchedi: "tbh I would prefer if we could change scap to not use nrpe for checks, maybe we can e.g. deploy the script standalone and switch to type: " [puppet] - 10https://gerrit.wikimedia.org/r/967202 (owner: 10Hnowlan) [07:34:20] jouncebot: now [07:34:21] For the next 0 hour(s) and 25 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T0700) [07:35:34] (03PS7) 10Muehlenhoff: Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) [07:36:03] !log Upgrading CI Jenkins # T349282 [07:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:24] 10SRE, 10Maps, 10Traffic: Allow Wikimedia Maps usage on Wikimedia Commons Android app - https://phabricator.wikimedia.org/T349280 (10Marostegui) [07:40:10] taavi: is it possible to Run echo 'https://en.wikipedia.org/static/images/project-logos/XXwiki.png' | mwscript purgeList.php for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/966574 old version is still appearing for me [07:40:27] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) (owner: 10Brouberol) [07:40:41] oh yes, sorry I always forget that step [07:41:32] (03PS3) 10Muehlenhoff: Failover testreduce to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220) [07:42:07] !log mwscript purgeList.php enwiki <<< "https://en.wikipedia.org/static/images/project-logos/knwiktionary.png" (and for 1.5x and 2x variants) [07:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:50] (03CR) 10Muehlenhoff: [C: 03+2] Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [07:46:55] taavi: also needed to run echo for https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-wordmark-kn.svg [07:47:41] done [07:48:19] thanks, new logo visible now [08:01:34] !log installing Linux kernel updates for Buster 5.10 backport [08:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:26] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:07:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:08:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1004.eqiad.wmnet [08:10:19] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966840 (owner: 10TrainBranchBot) [08:13:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1004.eqiad.wmnet [08:14:30] !log brouberol@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1006.eqiad.wmnet [08:16:08] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10serviceops, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Joe) Sorry for the silence, I was first at a conference then in bed sick (and I'm still not in a great he... [08:16:22] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Disable proxy buffering on nginx [puppet] - 10https://gerrit.wikimedia.org/r/967477 (https://phabricator.wikimedia.org/T349384) (owner: 10Vgutierrez) [08:16:22] I am shutting down the CI Jenkins for some minutes [08:17:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1003.eqiad.wmnet [08:18:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966841 [08:18:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966841 (owner: 10TrainBranchBot) [08:19:46] !log brouberol@cumin1001 START - Cookbook sre.dns.netbox [08:21:53] !log brouberol@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [08:24:48] !log brouberol@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [08:24:48] !log brouberol@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:24:49] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1006.eqiad.wmnet [08:25:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1003.eqiad.wmnet [08:31:56] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:16] !log brouberol@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1005.eqiad.wmnet [08:38:10] !log brouberol@cumin1001 START - Cookbook sre.dns.netbox [08:41:14] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966841 (owner: 10TrainBranchBot) [08:43:05] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/139/con" [puppet] - 10https://gerrit.wikimedia.org/r/965400 (owner: 10Majavah) [08:43:46] (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [08:44:06] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:toolforge: provision root sudo policy via here [puppet] - 10https://gerrit.wikimedia.org/r/965400 (owner: 10Majavah) [08:44:39] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Soft-launch" iOS-compatible HLS video transcodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967531 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [08:48:04] (03CR) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [08:49:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966842 [08:49:04] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966842 (owner: 10TrainBranchBot) [08:51:00] !log brouberol@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [08:52:03] !log brouberol@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [08:52:03] !log brouberol@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:04] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1005.eqiad.wmnet [08:55:26] !log brouberol@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1004.eqiad.wmnet [08:58:11] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [08:58:36] (03PS1) 10Muehlenhoff: firewall: Fix interval of metrics export [puppet] - 10https://gerrit.wikimedia.org/r/967862 (https://phabricator.wikimedia.org/T348499) [09:00:00] (03CR) 10Jelto: "thanks for adding cookbook locking to the cookbooks. Some answers in-line." [cookbooks] - 10https://gerrit.wikimedia.org/r/967629 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [09:00:07] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [09:00:15] 10SRE, 10Infrastructure-Foundations, 10netops: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers - https://phabricator.wikimedia.org/T334230 (10ayounsi) [09:00:19] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:00:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10ayounsi) 05Open→03Resolved Thanks! I'll open a different task if needed to debug any remaining issue, but looks like it's a Dell bug so far. [09:00:24] !log brouberol@cumin1001 START - Cookbook sre.dns.netbox [09:00:41] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:45] (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [09:07:03] (03CR) 10JMeybohm: [C: 04-1] "This should use the value from common_images: Ie1ddb7b7e4de0a449161cdca94176630ae71b12a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [09:09:04] (03CR) 10Cathal Mooney: [C: 03+2] Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [09:09:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:10:17] (03CR) 10Muehlenhoff: [C: 03+2] firewall: Fix interval of metrics export [puppet] - 10https://gerrit.wikimedia.org/r/967862 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [09:10:24] (03Merged) 10jenkins-bot: Unify BGP policy on L3 and EVPN switches and adjust LVS backup pref [homer/public] - 10https://gerrit.wikimedia.org/r/966904 (https://phabricator.wikimedia.org/T344601) (owner: 10Cathal Mooney) [09:12:37] PROBLEM - Kerberos KAdmin daemon on krb1001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:12:53] PROBLEM - Kerberos KDC daemon on krb1001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:13:06] (03CR) 10Elukey: [C: 03+2] ml-services: deploy new Bullseye version [deployment-charts] - 10https://gerrit.wikimedia.org/r/966199 (https://phabricator.wikimedia.org/T348647) (owner: 10Ilias Sarantopoulos) [09:13:25] ^ krb1001 is expected, ongoing experiment [09:13:45] !log brouberol@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [09:14:18] (03PS1) 10Filippo Giunchedi: prometheus: switch alerts to cloud prometheus [puppet] - 10https://gerrit.wikimedia.org/r/967863 (https://phabricator.wikimedia.org/T336854) [09:15:05] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:15:25] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:07] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:20] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:18:01] PROBLEM - Host cumin1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:18:03] !log elukey@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:18:15] (03PS1) 10JMeybohm: mw-debug: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967864 (https://phabricator.wikimedia.org/T300033) [09:18:17] (03PS1) 10JMeybohm: mw-web: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967865 (https://phabricator.wikimedia.org/T300033) [09:18:19] (03PS1) 10JMeybohm: mw-jobrunner: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967866 (https://phabricator.wikimedia.org/T300033) [09:18:21] (03PS1) 10JMeybohm: mw-api-ext: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967867 (https://phabricator.wikimedia.org/T300033) [09:18:23] (03PS1) 10JMeybohm: mw-api-int: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967868 (https://phabricator.wikimedia.org/T300033) [09:18:44] !log elukey@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:18:47] RECOVERY - Host cumin1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [09:18:53] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:19:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:38] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:21:11] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:24] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [09:21:39] (KeyholderUnarmed) firing: 2 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:21:55] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:24:12] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966842 (owner: 10TrainBranchBot) [09:24:56] (03PS1) 10Giuseppe Lavagetto: jobrunner: increase open files limit [puppet] - 10https://gerrit.wikimedia.org/r/967870 (https://phabricator.wikimedia.org/T344428) [09:26:42] (03PS4) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [09:27:56] (03PS5) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [09:28:49] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:29:09] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:29:26] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/140/console" [puppet] - 10https://gerrit.wikimedia.org/r/967863 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [09:29:39] (03PS6) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [09:29:49] (03PS7) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [09:30:46] (03CR) 10Klausman: images: Update kserve/build to kserve v0.11.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [09:31:11] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:20] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:31:21] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:31:29] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:20] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:32:20] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:33:13] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/967863 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [09:33:43] (03CR) 10JMeybohm: [C: 03+2] CI: Properly detect changes to link targets in helmfile.d/*services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967445 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:34:05] 10SRE, 10Infrastructure-Foundations, 10netops: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers - https://phabricator.wikimedia.org/T334230 (10cmooney) 05Open→03Resolved All devices online and ready for servers in their racks. [09:34:13] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:00] (03CR) 10Majavah: [V: 03+1 C: 03+1] "pcc seems to be having some intermittent failures, but this looks good" [puppet] - 10https://gerrit.wikimedia.org/r/967863 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [09:35:05] !log brouberol@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001 - brouberol@cumin1001 - T336044" [09:35:10] T336044: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 [09:35:21] (03CR) 10Cathal Mooney: [C: 03+2] Remove host interface errors alert until ethtool stats exposed [alerts] - 10https://gerrit.wikimedia.org/r/964916 (https://phabricator.wikimedia.org/T347312) (owner: 10Cathal Mooney) [09:36:09] !log brouberol@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001 - brouberol@cumin1001 - T336044" [09:36:39] (KeyholderUnarmed) resolved: 2 unarmed Keyholder key(s) on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:37:29] !log brouberol@cumin1001 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: kafka-jumbo1004.eqiad.wmnet [09:37:30] !log brouberol@cumin1001 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: kafka-jumbo1004.eqiad.wmnet [09:38:29] 10SRE, 10Infrastructure-Foundations, 10netops: CRs ECMP traffic to LVS VIPs despite higher MED on backup route - https://phabricator.wikimedia.org/T348446 (10cmooney) 05Open→03Resolved a:03cmooney Closing this task. As mentioned internal traffic may not always go to the activve LVS until we complete T... [09:41:20] (03CR) 10Brouberol: [C: 03+1] Fix issues with multiple spark shufflers specific to version 3.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [09:41:58] (03CR) 10Brouberol: [C: 03+1] Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [09:42:27] (03Merged) 10jenkins-bot: CI: Properly detect changes to link targets in helmfile.d/*services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967445 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:44:13] (03CR) 10Elukey: "Did docker-pkg worked locally etc..?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [09:45:03] (03PS2) 10JMeybohm: mw-debug: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967864 (https://phabricator.wikimedia.org/T300033) [09:45:03] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:05] (03PS2) 10JMeybohm: mw-web: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967865 (https://phabricator.wikimedia.org/T300033) [09:45:07] (03PS2) 10JMeybohm: mw-jobrunner: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967866 (https://phabricator.wikimedia.org/T300033) [09:45:09] (03PS2) 10JMeybohm: mw-api-ext: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967867 (https://phabricator.wikimedia.org/T300033) [09:45:11] (03PS2) 10JMeybohm: mw-api-int: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967868 (https://phabricator.wikimedia.org/T300033) [09:45:13] (03PS1) 10JMeybohm: mw: Move extraFQDNs definition into a separate file [deployment-charts] - 10https://gerrit.wikimedia.org/r/967872 (https://phabricator.wikimedia.org/T300033) [09:45:15] (03CR) 10Klausman: images: Update kserve/build to kserve v0.11.1 (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [09:45:20] (03PS1) 10Filippo Giunchedi: prometheus: redirect prometheus-site homepage to 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/967873 [09:45:58] (03CR) 10Hnowlan: [C: 03+2] media-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/967438 (https://phabricator.wikimedia.org/T347899) (owner: 10Hnowlan) [09:46:13] (03CR) 10Filippo Giunchedi: "A convenience to have, for example https://prometheus-eqiad.wikimedia.org, do the right thing and display prometheus instead of apache def" [puppet] - 10https://gerrit.wikimedia.org/r/967873 (owner: 10Filippo Giunchedi) [09:46:59] (03Merged) 10jenkins-bot: media-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/967438 (https://phabricator.wikimedia.org/T347899) (owner: 10Hnowlan) [09:47:01] (03CR) 10Lucas Werkmeister (WMDE): New stream for Android Patroller tasks feature (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) (owner: 10Sharvaniharan) [09:47:13] RECOVERY - Kerberos KAdmin daemon on krb1001 is OK: PROCS OK: 1 process with args /usr/sbin/kadmind https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:47:36] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:47:39] RECOVERY - Kerberos KDC daemon on krb1001 is OK: PROCS OK: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [09:47:57] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [09:48:25] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [09:48:57] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [09:49:28] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [09:49:35] 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10Physikerwelt) 05Open→03Resolved a:03Physikerwelt @SalixAlba strange, we just improved the error logging. Maybe the restb... [09:49:41] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [09:49:41] brouberol: ok to remove kafka-jumbo1003 from dns? [09:49:56] it is [09:50:04] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:50:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Can confirm that a grep on deploy2002:mediawiki-staging only finds the setting in CommonSettings.php and in the HISTORY files 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [09:50:13] actually I have an issue in my change [09:50:17] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [09:50:27] I'm currently decommissioning kafka-jumbo100[1-6] [09:50:38] what's the issue? [09:51:23] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:51:37] +lo0.lsw1-f8-eqiad 1H IN AAAA 2620:0:861:11b:: [09:51:49] technically correct, but cleaner to have a non 0 IP :) [09:51:53] (03PS8) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [09:52:01] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you Taavi!" [puppet] - 10https://gerrit.wikimedia.org/r/967863 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [09:53:16] (03CR) 10Klausman: images: Update kserve/build to kserve v0.11.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [09:53:41] (done) [09:54:12] !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-f8/ssw links - ayounsi@cumin1001" [09:54:17] XioNoX for my own edification: why are you running sre.dns.netbox yourself? My cookbook is attempting to acquire a lock that is taken by your run, meaning that it would have been executed by my sre.hosts.decommission run. Did we try to apply multiple changes at the same time? [09:54:26] ah, ^ seems like it [09:54:30] (03CR) 10Lucas Werkmeister (WMDE): Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [09:54:50] (03PS3) 10JMeybohm: mw-debug: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967864 (https://phabricator.wikimedia.org/T300033) [09:54:52] (03PS3) 10JMeybohm: mw-web: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967865 (https://phabricator.wikimedia.org/T300033) [09:54:54] (03PS3) 10JMeybohm: mw-jobrunner: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967866 (https://phabricator.wikimedia.org/T300033) [09:54:56] brouberol: I did a manual dns change in Netbox [09:54:57] (03PS3) 10JMeybohm: mw-api-ext: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967867 (https://phabricator.wikimedia.org/T300033) [09:54:59] (03PS3) 10JMeybohm: mw-api-int: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967868 (https://phabricator.wikimedia.org/T300033) [09:55:07] gotcha [09:55:13] so I needed to run it manually [09:55:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add A/PTR for lsw1-f8/ssw links - ayounsi@cumin1001" [09:55:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:55:25] but your changes got caught, in there, so I pushed them too [09:55:31] lock released :) [09:55:37] !log brouberol@cumin1001 START - Cookbook sre.dns.netbox [09:55:41] so it should be a noop for your change [09:56:09] thank you! [09:57:01] (03CR) 10Elukey: images: Update kserve/build to kserve v0.11.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [09:57:12] !log brouberol@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:57:13] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1003.eqiad.wmnet [09:57:55] (03PS9) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [09:58:04] (03CR) 10Klausman: images: Update kserve/build to kserve v0.11.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [09:59:35] (03PS10) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [09:59:57] (03CR) 10Btullis: [C: 03+2] Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1000) [10:01:49] (03CR) 10Elukey: images: Update kserve/build to kserve v0.11.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [10:01:52] (03PS9) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) [10:01:54] (03PS8) 10Btullis: Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) [10:02:22] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:02:33] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:02:45] (03PS1) 10Filippo Giunchedi: grafana: point prometheus/labs to prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/967874 (https://phabricator.wikimedia.org/T336854) [10:02:53] (03CR) 10Btullis: Fix issues with multiple spark shufflers specific to version 3.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:03:26] (03CR) 10Filippo Giunchedi: "There's ~14 days of data in the new instance, I'm ok to wait too if that's not enough" [puppet] - 10https://gerrit.wikimedia.org/r/967874 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [10:04:22] !log brouberol@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1002.eqiad.wmnet [10:05:04] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10cmooney) > Regarding UDP based protocols, DNS over UDP is usually capped at 512 bytes. While that was true at one stage, most DNS implementations now support [[ https://datatracke... [10:05:56] (03CR) 10Majavah: [C: 03+1] "thanks! I'm fine with moving forward with this now" [puppet] - 10https://gerrit.wikimedia.org/r/967874 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [10:08:18] (03CR) 10Btullis: [C: 03+2] Fix issues with multiple spark shufflers specific to version 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/967434 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:09:03] (03CR) 10Majavah: [C: 03+2] aptrepo: drop k8s 1.22 components [puppet] - 10https://gerrit.wikimedia.org/r/966865 (https://phabricator.wikimedia.org/T298005) (owner: 10Majavah) [10:09:29] (03CR) 10Btullis: [C: 03+2] Enable the multiple spark shufflers on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/967448 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:10:07] !log brouberol@cumin1001 START - Cookbook sre.dns.netbox [10:11:07] (03PS1) 10Arnaudb: add s6 replacement for db1131 (db1231) [puppet] - 10https://gerrit.wikimedia.org/r/966844 (https://phabricator.wikimedia.org/T344036) [10:11:37] !log reprepro: drop thirdparty/kubeadm-k8s-1-22 component [10:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:18] !log brouberol@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [10:13:24] (03CR) 10Marostegui: [C: 04-1] "You also need to remove it from the list of insetup, on site.pp line 748" [puppet] - 10https://gerrit.wikimedia.org/r/966844 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:13:29] !log brouberol@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [10:13:29] !log brouberol@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:13:30] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1002.eqiad.wmnet [10:14:21] (03Abandoned) 10JMeybohm: mw: Move extraFQDNs definition into a separate file [deployment-charts] - 10https://gerrit.wikimedia.org/r/967872 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:15:26] (03CR) 10Btullis: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/967409 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [10:15:57] (03PS1) 10Majavah: aptrepo: Import kubeadm 1.23 for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/967875 (https://phabricator.wikimedia.org/T284656) [10:16:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:19:27] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [10:20:11] (03CR) 10Brouberol: [C: 03+2] Monitor the expiration date of the skein x509 certificates [alerts] - 10https://gerrit.wikimedia.org/r/967409 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [10:20:30] (03PS1) 10Hashar: httpd: let Apache strip unavailable log fields [puppet] - 10https://gerrit.wikimedia.org/r/967877 (https://phabricator.wikimedia.org/T332672) [10:20:31] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, and 2 others: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) Commands for change later on: ` wmcs-openstack port unset ca4cb8c7-bfb8-440b-8e41-74bb8e834717 --fixed-ip subnet=clo... [10:20:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:21:02] (03PS2) 10Arnaudb: mariadb: add s6 replacement for db1131 (db1231) [puppet] - 10https://gerrit.wikimedia.org/r/966844 (https://phabricator.wikimedia.org/T344036) [10:21:06] (03PS1) 10Giuseppe Lavagetto: kubernetes: update common images [puppet] - 10https://gerrit.wikimedia.org/r/967878 [10:21:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:21:23] (03Merged) 10jenkins-bot: Monitor the expiration date of the skein x509 certificates [alerts] - 10https://gerrit.wikimedia.org/r/967409 (https://phabricator.wikimedia.org/T329398) (owner: 10Brouberol) [10:22:32] (03CR) 10Marostegui: [C: 03+1] mariadb: add s6 replacement for db1131 (db1231) [puppet] - 10https://gerrit.wikimedia.org/r/966844 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:22:43] (03CR) 10Arnaudb: [C: 03+2] mariadb: add s6 replacement for db1131 (db1231) [puppet] - 10https://gerrit.wikimedia.org/r/966844 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:23:06] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [10:23:29] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/143/con" [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) (owner: 10Brouberol) [10:24:19] (03CR) 10JMeybohm: [C: 03+2] mw-debug: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967864 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:25:18] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test - jbond@cumin1001" [10:25:28] (03Merged) 10jenkins-bot: mw-debug: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967864 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:26:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test - jbond@cumin1001" [10:26:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:26:16] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [10:28:19] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test - jbond@cumin1001" [10:29:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: test - jbond@cumin1001" [10:29:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:29:29] (03CR) 10Btullis: [C: 03+1] "I'm happy with this change in principle, but I'm still a little in the dark on how and when and why to use prometheus-metricsfetcher." [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) (owner: 10Brouberol) [10:30:05] (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the email address used for refine-test systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/967215 (owner: 10Btullis) [10:31:24] 10SRE, 10Infrastructure-Foundations, 10netops: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10cmooney) 05Open→03Resolved [10:31:30] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [10:31:40] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:32:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Provision db1231 depooled as a candidate master for s6', diff saved to https://phabricator.wikimedia.org/P53024 and previous config saved to /var/cache/conftool/dbconfig/20231023-103202-arnaudb.json [10:32:09] 10SRE, 10Infrastructure-Foundations, 10netops: Add non-EVPN L3 Switch routing policy definitions to Homer - https://phabricator.wikimedia.org/T344601 (10cmooney) 05Open→03Resolved [10:32:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [10:32:28] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack::pdns::auth: make pdns web server listen on all IPs [puppet] - 10https://gerrit.wikimedia.org/r/966494 (https://phabricator.wikimedia.org/T336854) (owner: 10Majavah) [10:32:49] !log brouberol@cumin1001 START - Cookbook sre.hosts.decommission for hosts kafka-jumbo1001.eqiad.wmnet [10:34:11] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:34:44] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:34:57] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:35:22] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:35:31] (03CR) 10Brouberol: [V: 03+1] Define environment variables to ease the use of prometheus-metricsfetcher (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) (owner: 10Brouberol) [10:36:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1002.eqiad.wmnet [10:37:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [10:37:19] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: provisionning - T344036 [10:37:24] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [10:37:33] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: provisionning - T344036 [10:37:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: provisionning - T344036 [10:37:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: provisionning - T344036 [10:38:49] (03PS11) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [10:39:25] (03CR) 10Klausman: images: Update kserve/build to kserve v0.11.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [10:40:37] !log brouberol@cumin1001 START - Cookbook sre.dns.netbox [10:40:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1002.eqiad.wmnet [10:41:12] !log switched mw-debug (mw-on-k8s) to certmanager certificates - T300033 [10:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:20] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [10:42:52] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove the need for the analytics-meta database to require java [puppet] - 10https://gerrit.wikimedia.org/r/965761 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [10:43:08] 10SRE-swift-storage, 10Move-Files-To-Commons, 10WMDE-TechWish-Maintenance, 10MW-1.42-notes (1.42.0-wmf.2; 2023-10-24), 10Wikimedia-production-error: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688 (10MatthewVernon) @Kizule if you want logs looked at... [10:44:09] (03PS2) 10Jbond: sre.gitlab.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967629 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:44:22] 10SRE-swift-storage, 10Move-Files-To-Commons, 10WMDE-TechWish-Maintenance, 10MW-1.42-notes (1.42.0-wmf.2; 2023-10-24), 10Wikimedia-production-error: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688 (10Kizule) >>! In T348688#9272055, @MatthewVernon wr... [10:44:28] (03CR) 10Elukey: "Two things and we are good to go!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [10:46:05] (03PS12) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [10:47:20] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) >>! In T344547#9108360, @ayounsi wrote: > ` > set policy-options policy-statement Switch_out term from_bgp from as-path core_and_local_LVS > set policy-... [10:48:51] (03PS13) 10Klausman: images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) [10:49:23] (03CR) 10Klausman: images: Update kserve/build to kserve v0.11.1 (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [10:49:29] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:50:27] !log brouberol@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [10:50:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depool db1131 T344036', diff saved to https://phabricator.wikimedia.org/P53025 and previous config saved to /var/cache/conftool/dbconfig/20231023-105036-arnaudb.json [10:50:49] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [10:51:19] !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1131.eqiad.wmnet onto db1231.eqiad.wmnet [10:51:27] !log brouberol@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-jumbo1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1001" [10:51:27] !log brouberol@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:51:28] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kafka-jumbo1001.eqiad.wmnet [10:53:17] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:56:34] (03CR) 10Samtar: [C: 03+1] Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [10:57:49] (03PS3) 10Samtar: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [10:58:20] (03PS2) 10Hashar: httpd: ErrorLogFormat to strip fields with unavailable values [puppet] - 10https://gerrit.wikimedia.org/r/967877 (https://phabricator.wikimedia.org/T332672) [10:58:47] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967865 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:59:03] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:15:48] (03CR) 10Jbond: "lgtm will test properly shortly" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [11:17:59] 10SRE-swift-storage, 10Move-Files-To-Commons, 10WMDE-TechWish-Maintenance, 10MW-1.42-notes (1.42.0-wmf.2; 2023-10-24), 10Wikimedia-production-error: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688 (10MatthewVernon) I find four PUTs around that time... [11:20:05] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966842 (owner: 10TrainBranchBot) [11:21:09] (03CR) 10Brouberol: "The hosts have now been decommissioned" [puppet] - 10https://gerrit.wikimedia.org/r/967240 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [11:22:14] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Define environment variables to ease the use of prometheus-metricsfetcher [puppet] - 10https://gerrit.wikimedia.org/r/967134 (https://phabricator.wikimedia.org/T349393) (owner: 10Brouberol) [11:27:00] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T349495 (10Tobi.smt) [11:27:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966845 [11:27:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966845 (owner: 10TrainBranchBot) [11:28:09] (03CR) 10Jbond: [C: 04-1] "see inline" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [11:30:06] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T349495 (10Tobi.smt) how do I delete this? [11:30:20] (03PS9) 10Hashar: contint: rename jenkins-slave to jenkins-agent [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) [11:30:40] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/922555 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [11:30:54] (03PS1) 10Jbond: puppet_compiler: add typing_extensions [puppet] - 10https://gerrit.wikimedia.org/r/967889 [11:31:03] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T349495 (10Tobi.smt) 05Open→03Resolved p:05Triage→03Low a:05Tobi.smt→03None [11:33:29] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Server not yet in productin use [11:33:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Server not yet in productin use [11:38:43] (03Abandoned) 10Hashar: wip [puppet] - 10https://gerrit.wikimedia.org/r/737100 (owner: 10Herron) [11:43:04] (03PS2) 10Jbond: puppet_compiler: add typing_extensions [puppet] - 10https://gerrit.wikimedia.org/r/967889 [11:45:45] (03CR) 10Jbond: Add a json representation for each host (031 comment) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 (owner: 10Hashar) [11:46:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/966845 (owner: 10TrainBranchBot) [11:47:25] (03PS5) 10Hashar: gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) [11:47:37] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [11:49:02] !log added Balthazar to pwstore [11:49:02] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@054e07d] (releasing): (no justification provided) [11:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:45] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@054e07d] (releasing): (no justification provided) (duration: 00m 42s) [11:49:48] (03PS3) 10Jbond: puppet_compiler: add typing_extensions [puppet] - 10https://gerrit.wikimedia.org/r/967889 [11:52:37] (03PS6) 10Hashar: gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) [11:53:04] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [11:54:26] (03PS4) 10Jbond: puppet_compiler: add typing_extensions [puppet] - 10https://gerrit.wikimedia.org/r/967889 [12:00:52] (03CR) 10Majavah: [C: 03+2] P:cloudceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [12:06:38] (03CR) 10Muehlenhoff: [C: 04-1] prometheus::node_debian_version: Move to prometheus::node_textfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965732 (owner: 10Muehlenhoff) [12:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:10:41] (03CR) 10Filippo Giunchedi: [C: 03+2] "Ok thank you, it is easy to revert if we need to, until cloudmetrics hosts are around that is." [puppet] - 10https://gerrit.wikimedia.org/r/967874 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [12:14:50] (03CR) 10Muehlenhoff: [C: 03+2] Setup a prerouting chain in the base table to exempt traffic from conntrack [puppet] - 10https://gerrit.wikimedia.org/r/965659 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [12:16:38] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1131.eqiad.wmnet onto db1231.eqiad.wmnet [12:17:32] (03PS3) 10Muehlenhoff: nftables::service Write out notrack rules for services skipping conntrack [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) [12:18:07] (03CR) 10Jelto: [C: 03+1] "lgtm thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/967629 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:18:32] (03PS1) 10Majavah: wmnet: drop cloudmetrics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266) [12:20:30] (03PS1) 10Muehlenhoff: Add Tyler for approval of various release groups [puppet] - 10https://gerrit.wikimedia.org/r/967899 (https://phabricator.wikimedia.org/T276465) [12:20:43] (03CR) 10Ladsgroup: [C: 04-1] "Can you use virtual domains? See https://phabricator.wikimedia.org/T330590 and accompanying patches in URL shortener as an example." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [12:23:40] (03CR) 10Samtar: [C: 04-2] "Self -2, awaiting T348487#9272333 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967900 (https://phabricator.wikimedia.org/T348487) (owner: 10Samtar) [12:24:21] (03CR) 10Muehlenhoff: "If any of those groups should no longer be actively used, let me know, when we can mark them as deprecated." [puppet] - 10https://gerrit.wikimedia.org/r/967899 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [12:27:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [12:27:18] (03PS2) 10Jelto: miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) [12:27:56] (03CR) 10CI reject: [V: 04-1] miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [12:33:27] !log installing libx11 security updates [12:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:10] (03CR) 10Jelto: [C: 03+2] kubernetes::deployment_server: add common_image for httpd exporter [puppet] - 10https://gerrit.wikimedia.org/r/967174 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [12:34:37] (03CR) 10JMeybohm: [C: 03+2] mw-web: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967865 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:35:11] (03PS1) 10Filippo Giunchedi: alertmanager: allow api access for alertmanagers hosts too [puppet] - 10https://gerrit.wikimedia.org/r/967904 (https://phabricator.wikimedia.org/T321579) [12:39:01] (03Merged) 10jenkins-bot: mw-web: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967865 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:39:44] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:40:10] (03CR) 10Btullis: [C: 03+1] Drop kafka-jumbo100[1-6].eqiad.wmnet from the puppet site [puppet] - 10https://gerrit.wikimedia.org/r/967240 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:40:11] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:40:35] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:40:37] !log switched mw-web (mw-on-k8s) to certmanager certificates - T300033 [12:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:42] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [12:41:10] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967866 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:41:11] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:42:13] (03CR) 10Brouberol: [C: 03+2] Drop kafka-jumbo100[1-6].eqiad.wmnet from the puppet site [puppet] - 10https://gerrit.wikimedia.org/r/967240 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol) [12:44:57] (03CR) 10Jelto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) (owner: 10Jelto) [12:49:36] (03CR) 10Majavah: [C: 03+2] admin: hashar: update gdbinit from php 7.4.30 [puppet] - 10https://gerrit.wikimedia.org/r/941949 (owner: 10Hashar) [12:52:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [12:55:08] (03PS3) 10Jelto: miscweb: remove the use of :latest image tag in httpd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/967147 (https://phabricator.wikimedia.org/T348856) [12:55:10] (03PS4) 10Sharvaniharan: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) [12:56:45] (03PS5) 10Sharvaniharan: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) [12:58:02] (03CR) 10Sharvaniharan: "Fixed code review comments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) (owner: 10Sharvaniharan) [12:58:11] (03PS6) 10Sharvaniharan: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1300) [13:00:04] MichaelG_WMDE, MatmaRex, and sharvani__: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] 👋 [13:00:11] hi [13:00:17] a lot of patches! [13:00:19] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:00:22] o/ [13:00:23] Hi! here for deployment window. [13:00:36] i can deploy today, unless Lucas_WMDE wants to :) [13:00:43] was gonna say the same :P [13:00:49] :D [13:01:04] I looked at some of the patches earlier but no blockers from me I think [13:01:24] (03CR) 10Urbanecm: [C: 03+2] Remove 'currentProto'/'finalProto'/'proto' business [extensions/CentralAuth] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967195 (https://phabricator.wikimedia.org/T348852) (owner: 10Bartosz Dziewoński) [13:01:31] ack [13:01:47] thanks for reviewing, i've just noticed your comments [13:01:48] (03PS3) 10Cathal Mooney: Change eqiad cloudgw VIPs to /29 so system can ARP from them [puppet] - 10https://gerrit.wikimedia.org/r/965708 (https://phabricator.wikimedia.org/T348140) [13:02:00] (03CR) 10Urbanecm: [C: 03+2] wikidatawiki: Switch property for determining Lexeme language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967429 (https://phabricator.wikimedia.org/T348923) (owner: 10Michael Große) [13:02:03] good point about this being a release blocker [13:02:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967429 (https://phabricator.wikimedia.org/T348923) (owner: 10Michael Große) [13:03:01] (03Merged) 10jenkins-bot: wikidatawiki: Switch property for determining Lexeme language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967429 (https://phabricator.wikimedia.org/T348923) (owner: 10Michael Große) [13:03:28] MatmaRex: giving space to you to reply to Lucas's comments, but all patches have a +1, so we can definitely try and see :) [13:03:56] yeah, i think they're good to go [13:04:00] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:967429|wikidatawiki: Switch property for determining Lexeme language code (T348923)]] [13:04:03] (03CR) 10David Caro: [C: 03+1] ":shipit:!" [puppet] - 10https://gerrit.wikimedia.org/r/965708 (https://phabricator.wikimedia.org/T348140) (owner: 10Cathal Mooney) [13:04:08] ack [13:04:18] T348923: Switch Property that we use for determining available language codes - https://phabricator.wikimedia.org/T348923 [13:04:27] !log installing libxpm security updates on buster [13:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:14] MatmaRex: is it a good idea to try deploying all/some config patches you have at once, to save time? seems like 966915 is a no-op, so is 966919. any opinion? [13:05:16] !log urbanecm@deploy2002 migr and urbanecm: Backport for [[gerrit:967429|wikidatawiki: Switch property for determining Lexeme language code (T348923)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:30] * MichaelG_WMDE tests [13:05:31] MichaelG_WMDE: your patch is at mwdebug2001, please test :) [13:05:36] you're quicker! :) [13:05:45] (03PS7) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [13:05:50] (03CR) 10Cathal Mooney: [C: 03+2] Change eqiad cloudgw VIPs to /29 so system can ARP from them [puppet] - 10https://gerrit.wikimedia.org/r/965708 (https://phabricator.wikimedia.org/T348140) (owner: 10Cathal Mooney) [13:06:03] urbanecm: yeah, we probably could [13:06:10] okay, will do. [13:06:11] we definitely could [13:06:27] (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [13:06:50] (03PS3) 10Urbanecm: Remove unused $wgIncludeLegacyJavaScript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [13:06:55] (03PS2) 10Urbanecm: Remove $wgApiFrameOptions override for enwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183) (owner: 10Bartosz Dziewoński) [13:06:57] mh, I do not see the effect of it yet. @Lucas_WMDE can you have a look as well? [13:06:58] (03CR) 10Urbanecm: [C: 03+2] Remove unused $wgIncludeLegacyJavaScript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [13:07:02] * Lucas_WMDE looks [13:07:02] (03CR) 10Urbanecm: [C: 03+2] Remove $wgApiFrameOptions override for enwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183) (owner: 10Bartosz Dziewoński) [13:07:07] (03Merged) 10jenkins-bot: Remove 'currentProto'/'finalProto'/'proto' business [extensions/CentralAuth] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967195 (https://phabricator.wikimedia.org/T348852) (owner: 10Bartosz Dziewoński) [13:07:46] (03Merged) 10jenkins-bot: Remove unused $wgIncludeLegacyJavaScript [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966915 (owner: 10Bartosz Dziewoński) [13:07:49] (03Merged) 10jenkins-bot: Remove $wgApiFrameOptions override for enwiki and zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966919 (https://phabricator.wikimedia.org/T131183) (owner: 10Bartosz Dziewoński) [13:08:09] MichaelG_WMDE: with ?debug=2, I get “This Item has an unrecognized language code. Please select one below.” as the result for British English [13:08:13] (rather than no message at all) [13:08:24] so I think it’s making some difference at least [13:08:31] let me try with `?debug=2` too [13:08:57] I was looking at the network panel and still seeing it requesting P218 [13:09:07] aha, the IETF language tag is en-GB, uppercase, that’s why it doesn’t match [13:09:19] i see failed query expectations logged `readQueryRows <= 10000`, i guess that's not new? [13:09:31] (but I’m not sure if the change was even expected to solve this, it was just the first example that popped into my head) [13:09:39] urbanecm: I think those happen from time to time? [13:09:41] but let me see [13:09:54] it's on a WD table (`SELECT pi_property_id,pi_info FROM `wb_property_info`), hence i mention it. [13:09:59] ah, I see [13:10:00] but yeah, i wouldn't be surprised if it's not new. [13:10:03] pretty sure it’s unrelated [13:10:08] me too. [13:10:08] but yeah we should do something about that ™ [13:10:20] (it’s probably *much* worse on test wikidata, actually, which has way more properties) [13:10:30] ah gotcha, with `debug=2` I also see it requesting P305 [13:10:42] * Lucas_WMDE searches phabricator [13:10:45] so it was just some cache thingy I guess [13:10:56] yeah, apparently Ctrl+F5 wasn’t enough [13:11:03] MichaelG_WMDE: so, that means it's ok to go? [13:11:24] yes, I think this change does what it is supposed to do 👍 [13:11:29] great, proceeding [13:11:30] !log urbanecm@deploy2002 migr and urbanecm: Continuing with sync [13:11:50] (03PS1) 10Brouberol: Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) [13:12:19] (03PS8) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [13:12:29] (03CR) 10Brouberol: Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) (owner: 10Brouberol) [13:12:58] (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [13:15:57] (03PS1) 10David Caro: networktests: use tool network-tests instead of personal one [puppet] - 10https://gerrit.wikimedia.org/r/967932 [13:16:14] (03CR) 10CI reject: [V: 04-1] networktests: use tool network-tests instead of personal one [puppet] - 10https://gerrit.wikimedia.org/r/967932 (owner: 10David Caro) [13:16:50] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:967429|wikidatawiki: Switch property for determining Lexeme language code (T348923)]] (duration: 12m 50s) [13:16:54] finally [13:16:55] T348923: Switch Property that we use for determining available language codes - https://phabricator.wikimedia.org/T348923 [13:17:13] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:967195|Remove 'currentProto'/'finalProto'/'proto' business (T348852)]], [[gerrit:966915|Remove unused $wgIncludeLegacyJavaScript]], [[gerrit:966919|Remove $wgApiFrameOptions override for enwiki and zhwiki (T131183)]] [13:17:14] MatmaRex: proceeding wiht (some) of your config patches + the backport. [13:17:18] T131183: Remove $wgApiFrameOptions = 'SAMEORIGIN' override for enwiki and zhwiki - https://phabricator.wikimedia.org/T131183 [13:17:18] T348852: Remove CentralAuth support for mixed-protocol HTTP/HTTPS wikis - https://phabricator.wikimedia.org/T348852 [13:17:36] okay [13:17:49] urbanecm: filed T349511 for the too many rows read issue [13:17:50] T349511: Wikibase reads too many wb_property_info rows at once (expectation readQueryRows <= 10000 not met) - https://phabricator.wikimedia.org/T349511 [13:17:53] ty [13:18:20] (03CR) 10Jbond: [C: 03+2] sre.gitlab.*: customize lock arguments (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967629 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [13:18:24] Lucas_WMDE: just curious, is it intentional Herald auto-adds a deprecated project? [13:18:24] (03PS3) 10Jbond: sre.gitlab.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967629 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [13:18:26] !log urbanecm@deploy2002 matmarex and urbanecm: Backport for [[gerrit:967195|Remove 'currentProto'/'finalProto'/'proto' business (T348852)]], [[gerrit:966915|Remove unused $wgIncludeLegacyJavaScript]], [[gerrit:966919|Remove $wgApiFrameOptions override for enwiki and zhwiki (T131183)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:18:48] it's the https://phabricator.wikimedia.org/H380 rule, fwiw [13:18:52] that’s T349104 :) [13:18:53] T349104: Update Herald rule to tag Wikidata tech tasks - https://phabricator.wikimedia.org/T349104 [13:18:59] MatmaRex: please test your patches at mwdebug2001 :) [13:19:06] my test plan for this is to just log in and out a couple of times on a couple of wikis. give me a few minutes [13:19:21] ack [13:19:23] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add typing_extensions [puppet] - 10https://gerrit.wikimedia.org/r/967889 (owner: 10Jbond) [13:19:31] Lucas_WMDE: okay, everything has a task 'round here :) [13:19:41] ty [13:20:03] (03CR) 10Elukey: [C: 03+1] images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [13:20:23] (03CR) 10Jforrester: Add wikifunctions.org to prod wgLocalVirtualHosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:21:20] (03CR) 10Jforrester: "Same reason for Wikifunctions as Wikidata. Doing this now just means reverting it later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [13:21:42] !bash Lucas_WMDE: okay, everything has a task 'round here :) [13:21:43] Lucas_WMDE: Stored quip at https://bash.toolforge.org/quip/rSeyXIsBhuQtenzvymsu [13:21:53] :P [13:21:57] :) [13:22:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/967899 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [13:23:05] (03PS9) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [13:23:49] urbanecm: seems good [13:23:51] (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [13:23:55] thanks, proceeding [13:23:57] !log urbanecm@deploy2002 matmarex and urbanecm: Continuing with sync [13:25:02] (03PS7) 10Urbanecm: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) (owner: 10Sharvaniharan) [13:25:05] (03CR) 10Urbanecm: [C: 03+2] New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) (owner: 10Sharvaniharan) [13:25:37] i noticed something interesting about central login, but it seems unrelated. i logged in on one wiki, visited another wiki where i didn't have a local account yet, and it failed to log me in. i thought that would work [13:25:46] (03Merged) 10jenkins-bot: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) (owner: 10Sharvaniharan) [13:25:56] MatmaRex: that's...definitely supposed to work [13:26:26] well, i'll see if i can reproduce, and file some bugs [13:26:33] (03PS10) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [13:26:56] i just got a https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Journal&logid=109315835 created for me, so seems to be at least sometimes working [13:27:15] (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [13:27:46] looking at it now, it actually created the local account, but did not log me into it: https://meta.wikimedia.org/wiki/Special:CentralAuth?target=MatmaBot [13:28:05] interesting. i can't reproduce from Wikipedia to Wikipedia, but can reproduce cross-family. [13:28:16] when trying to log in, i was redirected to https://en.wiktionary.org/w/index.php?returnto=Wiktionary%3AMain+Page&title=Special:UserLogin¢ralAuthAutologinTried=1¢ralAuthError=Local+user+is+not+attached (note the error at the end - it was not shown in the interface) [13:28:24] few visits have fixed that though [13:29:09] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:967195|Remove 'currentProto'/'finalProto'/'proto' business (T348852)]], [[gerrit:966915|Remove unused $wgIncludeLegacyJavaScript]], [[gerrit:966919|Remove $wgApiFrameOptions override for enwiki and zhwiki (T131183)]] (duration: 11m 56s) [13:29:19] T131183: Remove $wgApiFrameOptions = 'SAMEORIGIN' override for enwiki and zhwiki - https://phabricator.wikimedia.org/T131183 [13:29:20] T348852: Remove CentralAuth support for mixed-protocol HTTP/HTTPS wikis - https://phabricator.wikimedia.org/T348852 [13:29:28] MatmaRex: can that be a https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/967195/ fallout in theory? [13:30:02] sharvani__: proceeding with your patch now. [13:30:08] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:965610|New stream for Android Patroller tasks feature (T348816)]] [13:30:08] Thank you [13:30:15] T348816: Create a new stream for Patroller tasks - https://phabricator.wikimedia.org/T348816 [13:30:36] i don't see how, but i guess it's not impossible. i'll file a bug about it later, and look into it this week [13:30:50] i suspect it has been like this for a long time, though, and we just didn't know [13:31:06] or maybe it's a new issue with the top-level autologin [13:31:22] !log urbanecm@deploy2002 urbanecm and sharvaniharan: Backport for [[gerrit:965610|New stream for Android Patroller tasks feature (T348816)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:31:23] thanks [13:31:35] sharvani__: your patch is now at mwdebug2001. can you test that please? [13:31:41] i'm working on adding error logging for failed autologins currently, btw :') [13:31:46] tested looks good thank you! [13:31:50] !log urbanecm@deploy2002 urbanecm and sharvaniharan: Continuing with sync [13:32:13] MatmaRex: ❤️, thanks for doing that! [13:33:02] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10User-dcaro: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) [13:33:40] (03PS4) 10Urbanecm: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [13:33:44] (03CR) 10Urbanecm: [C: 03+2] Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [13:33:54] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10User-dcaro: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) 05In progress→03Resolved This went as expected, and all the changes have been applied :) Thanks a lot @cmooney ! [13:34:00] (03PS11) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [13:34:28] actually, we have logging for this specific case! and it's not new: https://logstash.wikimedia.org/goto/2363e93df11124694f183ee75814fbbf [13:34:33] (03Merged) 10jenkins-bot: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [13:34:38] you can see my failure on mwdebug2001 a couple minutes ago [13:34:46] but it happens hundreds of times a day [13:34:47] (03CR) 10CI reject: [V: 04-1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [13:34:48] yay, i guess! [13:34:50] good to know [13:36:22] (03CR) 10David Caro: "retest" [puppet] - 10https://gerrit.wikimedia.org/r/967932 (owner: 10David Caro) [13:36:44] (03CR) 10David Caro: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/967932 (owner: 10David Caro) [13:37:02] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:965610|New stream for Android Patroller tasks feature (T348816)]] (duration: 06m 54s) [13:37:17] T348816: Create a new stream for Patroller tasks - https://phabricator.wikimedia.org/T348816 [13:37:24] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:967302|Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook]] [13:37:30] sharvani__: your patch is now deployed [13:37:41] MatmaRex: proceeding with your last config change [13:37:42] Thank you. [13:37:44] np [13:38:08] alright [13:38:38] i'll need a few minutes to test that as well. log in on mobile on mwdebug and prod, compare the cookies i get [13:38:39] !log urbanecm@deploy2002 urbanecm and matmarex: Backport for [[gerrit:967302|Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:22] (03PS12) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [13:41:32] ughhh i keep finding unrelated issues. trying to log in on mobile redirects you to desktop the first time, because of the top-level autologin. that's also not new [13:42:58] (03CR) 10Giuseppe Lavagetto: modules: add base.statsd (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [13:44:43] MatmaRex: keeping few minutes's fine, this is the last patch and we've ~15 mins. [13:48:00] urbanecm: looks good [13:48:06] thanks, proceeding [13:48:09] !log urbanecm@deploy2002 urbanecm and matmarex: Continuing with sync [13:49:07] (03PS1) 10David Caro: openstack: add antelope to the tests [puppet] - 10https://gerrit.wikimedia.org/r/967934 [13:51:19] (03CR) 10Btullis: Adapt the reuse-kafka-jumbo partman recipe for the new kafka-jumbo disk layout (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/967930 (https://phabricator.wikimedia.org/T348495) (owner: 10Brouberol) [13:52:00] !log installing batik security updates [13:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:52:54] ehm... [13:53:15] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:967302|Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook]] (duration: 15m 50s) [13:55:03] don't see how the spike is related to anything that i've synced, so presumably unrelated. [13:55:31] MatmaRex: anyway, synced to prod :) [13:55:36] and we're just on time :) [13:56:05] thanks urbanecm [13:56:37] (03PS1) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 [13:57:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:57:24] (03Abandoned) 10Ottomata: evenstreams - publicly expose mediawiki.page_change.v1 stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/931646 (https://phabricator.wikimedia.org/T336817) (owner: 10Ottomata) [13:57:46] (03PS1) 10Ilias Sarantopoulos: Revert "ml-services: test kserve batcher for revertrisk-multilingual in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967773 [13:57:54] (03CR) 10CI reject: [V: 04-1] Revert "ml-services: test kserve batcher for revertrisk-multilingual in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967773 (owner: 10Ilias Sarantopoulos) [13:57:57] (03PS2) 10Ottomata: Don't hardcode /opt/conda-analytics in spark3.env.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/821293 (https://phabricator.wikimedia.org/T312882) [13:58:28] (03PS8) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [13:58:46] urbanecm: while you're here, can you check the progress of the scripts on https://phabricator.wikimedia.org/T315510 for me? [13:59:34] (03Abandoned) 10Ottomata: Don't hardcode /opt/conda-analytics in spark3.env.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/821293 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [13:59:35] (the s1 one you started a long time ago, and the ones thciprian.i started recently) [14:00:22] (03CR) 10CI reject: [V: 04-1] sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (owner: 10Jbond) [14:00:33] (brb) [14:00:51] (03PS2) 10Ilias Sarantopoulos: Revert "ml-services: test kserve batcher for revertrisk-multilingual in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967773 (https://phabricator.wikimedia.org/T348536) [14:00:54] (03Abandoned) 10Ottomata: Release 2.1.4-py3.7-5 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/793504 (https://phabricator.wikimedia.org/T307115) (owner: 10Ottomata) [14:01:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:09] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [14:01:17] (03CR) 10JMeybohm: [C: 03+2] mw-jobrunner: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967866 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:02:17] (03Merged) 10jenkins-bot: mw-jobrunner: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967866 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:03:50] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967867 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:05:33] !log jayme@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [14:05:33] !log jayme@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [14:05:45] !log jayme@deploy2002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:05:47] !log jayme@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:05:48] !log switched mw-jobrunner (mw-on-k8s) to certmanager certificates - T300033 [14:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:14] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [14:06:17] !log jayme@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [14:06:17] !log jayme@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [14:06:27] !log jayme@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:06:28] !log jayme@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:06:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:08] (03CR) 10JMeybohm: [C: 03+2] mw-api-ext: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967867 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:07:51] (03Merged) 10jenkins-bot: mw-api-ext: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967867 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:09:21] (03PS1) 10Jbond: pupet_compiler: this module dose not support buster. [puppet] - 10https://gerrit.wikimedia.org/r/967937 [14:10:09] (03CR) 10Elukey: "We can leave both in theory, so Aiko will be able to test when she'll be back, what do you think?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967773 (https://phabricator.wikimedia.org/T348536) (owner: 10Ilias Sarantopoulos) [14:10:18] (03CR) 10Jbond: [C: 03+2] pupet_compiler: this module dose not support buster. [puppet] - 10https://gerrit.wikimedia.org/r/967937 (owner: 10Jbond) [14:10:24] (03CR) 10Jbond: [V: 03+2 C: 03+2] pupet_compiler: this module dose not support buster. [puppet] - 10https://gerrit.wikimedia.org/r/967937 (owner: 10Jbond) [14:10:53] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967868 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:11:23] (03PS9) 10Marostegui: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) [14:11:50] (03CR) 10CI reject: [V: 04-1] mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [14:12:36] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:12:57] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [14:13:15] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:13:23] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:13:33] (03CR) 10Herron: [C: 03+1] alertmanager: allow api access for alertmanagers hosts too [puppet] - 10https://gerrit.wikimedia.org/r/967904 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [14:13:55] (03Abandoned) 10Ilias Sarantopoulos: Revert "ml-services: test kserve batcher for revertrisk-multilingual in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/967773 (https://phabricator.wikimedia.org/T348536) (owner: 10Ilias Sarantopoulos) [14:13:58] !log switched mw-api-ext (mw-on-k8s) to certmanager certificates - T300033 [14:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:03] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [14:14:12] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:14:18] (03CR) 10Herron: [C: 03+1] "Good call this will be much friendlier" [puppet] - 10https://gerrit.wikimedia.org/r/967873 (owner: 10Filippo Giunchedi) [14:14:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:14:44] (03PS13) 10Giuseppe Lavagetto: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) [14:14:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:14:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.231 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:04] (03PS1) 10Jbond: pupet_compiler: this module dose not support buster. [puppet] - 10https://gerrit.wikimedia.org/r/967938 [14:15:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] pupet_compiler: this module dose not support buster. [puppet] - 10https://gerrit.wikimedia.org/r/967938 (owner: 10Jbond) [14:15:31] (03CR) 10Herron: [C: 03+1] slo_definitions: update all dashboards with the new Istio SLI metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/967423 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [14:15:51] (03PS10) 10Jbond: mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [14:16:01] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_definitions: update all dashboards with the new Istio SLI metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/967423 (https://phabricator.wikimedia.org/T349072) (owner: 10Elukey) [14:16:06] (03PS2) 10Giuseppe Lavagetto: kubernetes: update common images [puppet] - 10https://gerrit.wikimedia.org/r/967878 [14:17:36] (03PS1) 10Ilias Sarantopoulos: ml-services: rename revertrisk multilingual with batcher support [deployment-charts] - 10https://gerrit.wikimedia.org/r/967939 (https://phabricator.wikimedia.org/T348536) [14:19:46] (03CR) 10Elukey: [C: 03+1] ml-services: rename revertrisk multilingual with batcher support [deployment-charts] - 10https://gerrit.wikimedia.org/r/967939 (https://phabricator.wikimedia.org/T348536) (owner: 10Ilias Sarantopoulos) [14:20:18] (03CR) 10JMeybohm: [C: 03+1] kubernetes: update common images [puppet] - 10https://gerrit.wikimedia.org/r/967878 (owner: 10Giuseppe Lavagetto) [14:20:21] (03CR) 10JMeybohm: [C: 03+1] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [14:21:28] (03CR) 10JMeybohm: [C: 03+2] mw-api-int: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967868 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:22:11] (03Merged) 10jenkins-bot: mw-api-int: Switch to certmanager certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/967868 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:22:16] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: rename revertrisk multilingual with batcher support [deployment-charts] - 10https://gerrit.wikimedia.org/r/967939 (https://phabricator.wikimedia.org/T348536) (owner: 10Ilias Sarantopoulos) [14:23:02] (03Merged) 10jenkins-bot: ml-services: rename revertrisk multilingual with batcher support [deployment-charts] - 10https://gerrit.wikimedia.org/r/967939 (https://phabricator.wikimedia.org/T348536) (owner: 10Ilias Sarantopoulos) [14:24:58] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:25:53] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:26:08] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:26:25] !log switched mw-api-int (mw-on-k8s) to certmanager certificates - T300033 [14:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:35] (03CR) 10Klausman: [V: 03+2 C: 03+2] images: Update kserve/build to kserve v0.11.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967451 (https://phabricator.wikimedia.org/T337213) (owner: 10Klausman) [14:26:38] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [14:26:38] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:27:12] (03PS1) 10JMeybohm: mw-on-k8s: Globally enable certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967940 (https://phabricator.wikimedia.org/T300033) [14:30:17] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:30:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes: update common images [puppet] - 10https://gerrit.wikimedia.org/r/967878 (owner: 10Giuseppe Lavagetto) [14:32:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [14:33:34] (03Merged) 10jenkins-bot: modules: add base.statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [14:33:40] 10SRE, 10ops-eqiad, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10Jclark-ctr) 05Open→03Resolved [14:34:25] (03Abandoned) 10Andrew Bogott: slapd: introduce new slapd.conf template for ldap >= 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/961188 (https://phabricator.wikimedia.org/T331699) (owner: 10Andrew Bogott) [14:35:52] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [14:36:10] (03Abandoned) 10Andrew Bogott: Open magnum and heat apis to the greater internet [puppet] - 10https://gerrit.wikimedia.org/r/854092 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [14:36:16] (03CR) 10Jbond: [C: 03+2] mariadb: Productionize pc1016, pc2016 [puppet] - 10https://gerrit.wikimedia.org/r/966329 (https://phabricator.wikimedia.org/T343408) (owner: 10Marostegui) [14:37:25] (03Abandoned) 10Andrew Bogott: ordered_json.rb: add a new function, ordered_json_verbose [puppet] - 10https://gerrit.wikimedia.org/r/589741 (owner: 10Andrew Bogott) [14:37:37] (03Abandoned) 10Andrew Bogott: mcrouter: get some newlines in the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/589742 (owner: 10Andrew Bogott) [14:38:39] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:53] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro We need to start updating firmwares on servers they will need to be restarted to finalize installation. would y... [14:41:20] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1021'] [14:41:35] (03PS1) 10Ilias Sarantopoulos: ml-services: set OMP_NUM_THREADS in readability isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/967943 (https://phabricator.wikimedia.org/T348664) [14:42:14] (03PS1) 10Elukey: admin_ng: raise pod limits for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/967944 [14:42:18] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1021'] [14:42:27] (03CR) 10Elukey: [C: 03+1] ml-services: set OMP_NUM_THREADS in readability isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/967943 (https://phabricator.wikimedia.org/T348664) (owner: 10Ilias Sarantopoulos) [14:43:38] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: set OMP_NUM_THREADS in readability isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/967943 (https://phabricator.wikimedia.org/T348664) (owner: 10Ilias Sarantopoulos) [14:44:25] (03Merged) 10jenkins-bot: ml-services: set OMP_NUM_THREADS in readability isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/967943 (https://phabricator.wikimedia.org/T348664) (owner: 10Ilias Sarantopoulos) [14:45:29] (03CR) 10Muehlenhoff: [C: 03+2] nftables::service Write out notrack rules for services skipping conntrack [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [14:46:18] (03CR) 10Klausman: [C: 03+1] admin_ng: raise pod limits for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/967944 (owner: 10Elukey) [14:46:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcephosd1021'] [14:47:31] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: provisionning db1227 - T344036 [14:47:37] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [14:47:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: provisionning db1227 - T344036 [14:47:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: provisionning db1227 - T344036 [14:48:13] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: provisionning db1227 - T344036 [14:48:58] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [14:49:26] (03CR) 10Elukey: [C: 03+2] admin_ng: raise pod limits for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/967944 (owner: 10Elukey) [14:50:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Provision db1227 depooled as a candidate master for s7', diff saved to https://phabricator.wikimedia.org/P53027 and previous config saved to /var/cache/conftool/dbconfig/20231023-145011-arnaudb.json [14:50:49] (03PS1) 10Muehlenhoff: firewall::service: Remove notrack sanity check now that it's implemented [puppet] - 10https://gerrit.wikimedia.org/r/967945 (https://phabricator.wikimedia.org/T348735) [14:51:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P53028 and previous config saved to /var/cache/conftool/dbconfig/20231023-145101-arnaudb.json [14:51:02] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcephosd1021'] [14:52:23] (03PS1) 10Arnaudb: mariadb: Replace db1127 with db1227 [puppet] - 10https://gerrit.wikimedia.org/r/967907 (https://phabricator.wikimedia.org/T344036) [14:52:59] (03PS1) 10Jforrester: [Staging only] wikifunctions: Update Py WASM evaluator again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967946 [14:53:06] jouncebot: now [14:53:07] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [14:53:17] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:53:18] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Update Py WASM evaluator again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967946 (owner: 10Jforrester) [14:53:39] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:03] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Update Py WASM evaluator again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967946 (owner: 10Jforrester) [14:54:51] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:55:03] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:55:24] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:55:31] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:55:42] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:55:50] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:55:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/967945 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [14:56:00] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:56:12] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:57:11] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: redirect prometheus-site homepage to 'ops' [puppet] - 10https://gerrit.wikimedia.org/r/967873 (owner: 10Filippo Giunchedi) [14:57:13] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: allow api access for alertmanagers hosts too [puppet] - 10https://gerrit.wikimedia.org/r/967904 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [14:57:15] (03PS1) 10Elukey: admin_ng: apply new pod limits to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/967948 [14:58:21] (ProbeDown) firing: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:58:25] (03PS2) 10Elukey: admin_ng: apply new pod limits to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/967948 [15:01:08] (03CR) 10Elukey: [C: 03+2] admin_ng: apply new pod limits to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/967948 (owner: 10Elukey) [15:01:34] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: apply new pod limits to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/967948 (owner: 10Elukey) [15:02:00] PROBLEM - prometheus-esams.wikimedia.org requires authentication on prometheus3003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://prometheus-esams.wikimedia.org:443/ - 474 bytes in 0.370 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:02:49] that's me ^ [15:03:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:15] (03CR) 10Muehlenhoff: [C: 03+2] firewall::service: Remove notrack sanity check now that it's implemented [puppet] - 10https://gerrit.wikimedia.org/r/967945 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [15:05:21] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:05:30] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:05:35] (03PS1) 10Filippo Giunchedi: prometheus: redirect homepage for authenticated requests [puppet] - 10https://gerrit.wikimedia.org/r/967949 [15:09:37] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: redirect homepage for authenticated requests [puppet] - 10https://gerrit.wikimedia.org/r/967949 (owner: 10Filippo Giunchedi) [15:10:02] 10SRE-OnFire, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10Sustainability (Incident Followup): Add external meta-monitoring for metricsinfra - https://phabricator.wikimedia.org/T288053 (10BCornwall) [15:10:10] 10SRE-OnFire, 10Cloud-VPS, 10cloud-services-team, 10Sustainability (Incident Followup), 10User-dcaro: monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694 (10BCornwall) [15:10:51] 10SRE-OnFire, 10cloud-services-team, 10Sustainability (Incident Followup), 10User-dcaro: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) a:03dcaro [15:10:58] 10SRE-OnFire, 10cloud-services-team, 10Sustainability (Incident Followup), 10User-dcaro: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) [15:11:18] 10SRE-OnFire, 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) [15:11:20] PROBLEM - Host deploy1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:30] 10SRE-OnFire, 10cloud-services-team, 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Unplanned, and 2 others: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681 (10dcaro) [15:12:06] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:12:48] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:12:50] (03PS1) 10Herron: pyrra: onboard varnish-requests as pilot SLO [puppet] - 10https://gerrit.wikimedia.org/r/967950 (https://phabricator.wikimedia.org/T302995) [15:12:52] (03PS1) 10Muehlenhoff: idp::memcached Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/967951 [15:13:28] (03PS3) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) [15:14:04] (03CR) 10CI reject: [V: 04-1] mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [15:14:54] (03PS1) 10Filippo Giunchedi: modules: cleanup last dispatch renmants [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) [15:15:52] RECOVERY - prometheus-esams.wikimedia.org requires authentication on prometheus3003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.371 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:16:21] (03PS1) 10Herron: pyrra: add logstash-requests detail [puppet] - 10https://gerrit.wikimedia.org/r/967953 [15:18:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:18] (03CR) 10Herron: [C: 03+2] pyrra: add logstash-requests detail [puppet] - 10https://gerrit.wikimedia.org/r/967953 (owner: 10Herron) [15:20:14] (03PS2) 10Herron: pyrra: onboard varnish-requests as pilot SLO [puppet] - 10https://gerrit.wikimedia.org/r/967950 (https://phabricator.wikimedia.org/T302995) [15:23:17] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10VRiley-WMF) The following has been re-racked deploy1002 - C 3, U 34, CableID 3750, port 40 Ran script and powered the unit on. [15:23:38] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10VRiley-WMF) [15:25:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:50] (03CR) 10Herron: "Early days!" [puppet] - 10https://gerrit.wikimedia.org/r/967950 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [15:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1530) [15:31:34] (03PS1) 10Filippo Giunchedi: prometheus: fix SSO cookie matching for homepage redirect [puppet] - 10https://gerrit.wikimedia.org/r/967956 [15:31:48] (03CR) 10CI reject: [V: 04-1] prometheus: fix SSO cookie matching for homepage redirect [puppet] - 10https://gerrit.wikimedia.org/r/967956 (owner: 10Filippo Giunchedi) [15:34:04] (03PS2) 10Filippo Giunchedi: prometheus: fix SSO cookie matching for homepage redirect [puppet] - 10https://gerrit.wikimedia.org/r/967956 [15:34:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/967951 (owner: 10Muehlenhoff) [15:36:04] (03PS1) 10Elukey: services: Update Docker images of change-prop services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967957 (https://phabricator.wikimedia.org/T348950) [15:39:22] (03CR) 10Hnowlan: [C: 03+1] services: Update Docker images of change-prop services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967957 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [15:42:09] (03CR) 10Elukey: [C: 03+2] services: Update Docker images of change-prop services [deployment-charts] - 10https://gerrit.wikimedia.org/r/967957 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [15:46:44] (03CR) 10Muehlenhoff: "(The failure for idp-test is unrelated, apparently some missing stub secret for Datahub OIDC support)" [puppet] - 10https://gerrit.wikimedia.org/r/967951 (owner: 10Muehlenhoff) [15:50:53] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix SSO cookie matching for homepage redirect [puppet] - 10https://gerrit.wikimedia.org/r/967956 (owner: 10Filippo Giunchedi) [15:51:42] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [15:52:25] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [15:53:05] (03PS1) 10Jforrester: [Staging only] wikifunctions: Update Py WASM evaluator yet again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967959 [15:53:46] arnaudb: there are pendings changes to commit on dbctl (see icinga) [16:05:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:50] (03PS4) 10Gergő Tisza: CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) [16:06:02] (03CR) 10CI reject: [V: 04-1] CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [16:07:07] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Update Py WASM evaluator yet again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967959 (owner: 10Jforrester) [16:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:07:53] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Update Py WASM evaluator yet again [deployment-charts] - 10https://gerrit.wikimedia.org/r/967959 (owner: 10Jforrester) [16:08:19] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:08:59] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:09:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'New host being setup', diff saved to https://phabricator.wikimedia.org/P53029 and previous config saved to /var/cache/conftool/dbconfig/20231023-160926-marostegui.json [16:11:20] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:12:00] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:12:08] (03CR) 10Jforrester: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [16:27:16] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:40] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [16:30:06] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:44] (03PS24) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [16:33:00] (03CR) 10CI reject: [V: 04-1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [16:43:33] (03CR) 10FNegri: [C: 03+1] openstack: add antelope to the tests [puppet] - 10https://gerrit.wikimedia.org/r/967934 (owner: 10David Caro) [16:44:45] (03PS1) 10Jforrester: [Staging only] wikifunctions: Update Py WASM evaluator to one that logs timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/967962 [16:45:07] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Update Py WASM evaluator to one that logs timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/967962 (owner: 10Jforrester) [16:46:05] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Update Py WASM evaluator to one that logs timeouts [deployment-charts] - 10https://gerrit.wikimedia.org/r/967962 (owner: 10Jforrester) [16:46:47] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:47:28] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:52:47] (03PS1) 10Raymond Ndibe: prometheus: add build and envvars api metrics [puppet] - 10https://gerrit.wikimedia.org/r/967963 (https://phabricator.wikimedia.org/T337390) [16:53:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:56:57] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:57:00] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:58:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1700) [17:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T1700). [17:00:19] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:01:30] (03CR) 10Bartosz Dziewoński: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [17:02:47] (03PS5) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) [17:02:59] (03CR) 10CI reject: [V: 04-1] Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [17:03:32] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:03:46] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:07:58] (03PS6) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) [17:08:11] (03CR) 10CI reject: [V: 04-1] Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [17:09:42] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T2000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński) [17:09:46] (03PS2) 10Bartosz Dziewoński: Update comment about EditAttemptStep instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 [17:09:50] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T2000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [17:09:56] (03PS5) 10Bartosz Dziewoński: CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [17:10:02] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T2000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 (owner: 10Bartosz Dziewoński) [17:10:11] (03PS4) 10Bartosz Dziewoński: Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 [17:17:22] (03CR) 10Bartosz Dziewoński: Stop writing to $wgCentralAuthCookieDomain in 'EnterMobileMode' hook (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967302 (owner: 10Bartosz Dziewoński) [17:18:03] (03CR) 10Bartosz Dziewoński: Replace 'EnterMobileMode' hook with usingMobileDomain() check (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 (owner: 10Bartosz Dziewoński) [17:19:09] (03Abandoned) 10Bartosz Dziewoński: Replace 'EnterMobileMode' hook with usingMobileDomain() check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967303 (owner: 10Bartosz Dziewoński) [17:22:34] (03PS7) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [17:22:53] (03PS1) 10BCornwall: hiera: remove dns6002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967968 (https://phabricator.wikimedia.org/T342154) [17:22:55] (03CR) 10Bartosz Dziewoński: "(rebased)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [17:24:32] (03CR) 10Ssingh: hiera: remove dns6002 from authdns_servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967968 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:25:48] (03PS3) 10DLynch: Turn off DiscussionTools A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) (owner: 10Esanders) [17:26:12] (03PS2) 10BCornwall: hiera: remove dns6001 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967968 (https://phabricator.wikimedia.org/T342154) [17:26:21] (03CR) 10BCornwall: hiera: remove dns6001 from authdns_servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967968 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:31:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [17:31:40] (03PS1) 10Herron: prometheus: apt::pin prometheus package to bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/967969 (https://phabricator.wikimedia.org/T349521) [17:32:05] (03CR) 10CI reject: [V: 04-1] prometheus: apt::pin prometheus package to bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/967969 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [17:32:43] (03PS3) 10BCornwall: hiera: remove dns6002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967968 (https://phabricator.wikimedia.org/T342154) [17:32:51] (03PS2) 10Herron: prometheus: apt::pin prometheus package to bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/967969 (https://phabricator.wikimedia.org/T349521) [17:33:55] (03PS4) 10Esanders: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [17:33:57] (03CR) 10Ssingh: [C: 03+1] hiera: remove dns6002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967968 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:34:13] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns6002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967968 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:34:17] (03PS5) 10Esanders: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) [17:35:52] (03CR) 10Herron: [C: 03+1] "Thanks! One comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi) [17:37:25] (03PS1) 10Jforrester: [Staging only] wikifunctions: Raise orchestrator timeout to 20s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967970 [17:38:21] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Raise orchestrator timeout to 20s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967970 (owner: 10Jforrester) [17:38:46] (03PS3) 10Herron: prometheus: apt::pin prometheus package to bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/967969 (https://phabricator.wikimedia.org/T349521) [17:39:07] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Raise orchestrator timeout to 20s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967970 (owner: 10Jforrester) [17:40:30] (03CR) 10Majavah: [C: 04-1] "We currently use patched packages that include features not in the Debian provided packages, including k8s support and a backport for http" [puppet] - 10https://gerrit.wikimedia.org/r/967969 (https://phabricator.wikimedia.org/T349521) (owner: 10Herron) [17:40:44] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:41:16] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:44:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:44:47] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns6002.wikimedia.org with OS bookworm [17:44:57] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns6002.wikimedia.org with OS bookworm [17:49:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:49:18] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:49:20] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:53:02] PROBLEM - Host 2a02:ec80:600:2:185:15:58:37 is DOWN: CRITICAL - Destination Unreachable (2a02:ec80:600:2:185:15:58:37) [17:53:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:53:21] (03PS1) 10Jforrester: wikifunctions: Set the FUNCTION_EVALUATOR_TIMEOUT_MS value to 9s not 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967971 [17:53:39] (JobUnavailable) firing: (4) Reduced availability for job haproxy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:54:47] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Set the FUNCTION_EVALUATOR_TIMEOUT_MS value to 9s not 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967971 (owner: 10Jforrester) [17:55:34] (03Merged) 10jenkins-bot: wikifunctions: Set the FUNCTION_EVALUATOR_TIMEOUT_MS value to 9s not 10s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967971 (owner: 10Jforrester) [17:56:19] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:56:34] PROBLEM - Recursive DNS on 185.15.58.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:57:01] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:58:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:59:12] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [17:59:59] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:00:06] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:00:46] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:04:42] (03PS1) 10Bartosz Dziewoński: Remove unused VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 [18:08:10] (03CR) 10Muehlenhoff: [C: 03+2] Failover testreduce to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [18:08:14] (03PS2) 10Bartosz Dziewoński: Remove unused VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) [18:09:21] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns6002.wikimedia.org with reason: host reimage [18:11:40] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump PyWASM [deployment-charts] - 10https://gerrit.wikimedia.org/r/967974 [18:11:58] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump PyWASM [deployment-charts] - 10https://gerrit.wikimedia.org/r/967974 (owner: 10Jforrester) [18:12:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns6002.wikimedia.org with reason: host reimage [18:12:46] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump PyWASM [deployment-charts] - 10https://gerrit.wikimedia.org/r/967974 (owner: 10Jforrester) [18:13:54] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:14:38] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:15:34] (03CR) 10Muehlenhoff: modules: cleanup last dispatch renmants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi) [18:18:22] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [18:19:09] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [18:21:19] PROBLEM - Recursive DNS on 185.15.58.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:27:05] RECOVERY - Recursive DNS on 185.15.58.37 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:28:39] (JobUnavailable) firing: (4) Reduced availability for job haproxy in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:31:57] !log sretest1001:~/tmp/backfill$ promtool tsdb create-blocks-from rules --start 1672531200 --end 1698080718 --url http://prometheus.svc.eqiad.wmnet/ops/ logstash-requests.yaml T349521 [18:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:03] T349521: Prometheus/Pyrra: establish backfill process for recording rules - https://phabricator.wikimedia.org/T349521 [18:32:57] !log bking@deploy2002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [18:33:55] !log bking@deploy2002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [18:39:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:42:19] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:42:35] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:44:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:45:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns6002.wikimedia.org with OS bookworm [18:45:25] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns6002.wikimedia.org with OS bookworm completed: - dns6002 (**PASS**) - Downtimed on Icinga/Al... [18:48:24] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) [18:52:22] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) 05Open→03Resolved [18:52:37] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, and 2 others: [Maintenance] Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10Ahoelzl) [18:52:55] (03CR) 10Brian Wolff: "Just as somewhat of an aside, this scheme would be much more effective if the cookies were host cookies (had the name prefixed with _Host-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [18:53:17] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:57:17] (03PS1) 10BCornwall: Revert "hiera: remove dns6002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/967775 [18:59:46] (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns6002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/967775 (owner: 10BCornwall) [18:59:59] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns6002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/967775 (owner: 10BCornwall) [19:03:40] (03PS2) 10Bartosz Dziewoński: Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 [19:03:42] (03PS8) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [19:03:50] (03CR) 10Bartosz Dziewoński: "(rebased on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/967295)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [19:26:05] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:26:07] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:14] (03PS1) 10BCornwall: hiera: remove dns4004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967976 (https://phabricator.wikimedia.org/T342154) [19:26:27] (03PS1) 10Bartosz Dziewoński: Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 [19:26:43] PROBLEM - PyBal backends health check on lvs3008 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3069.esams.wmnet, cp3073.esams.wmnet, cp3070.esams.wmnet, cp3072.esams.wmnet are marked down but pooled: textlb_443: Servers cp3069.esams.wmnet, cp3066.esams.wmnet, cp3067.esams.wmnet, cp3071.esams.wmnet, cp3072.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3066.esams.wmnet, cp3071.esams.wmnet, cp3068.esams.wmne [19:26:43] 3.esams.wmnet, cp3072.esams.wmnet, cp3069.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3069.esams.wmnet, cp3066.esams.wmnet, cp3068.esams.wmnet, cp3073.esams.wmnet, cp3067.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:26:55] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: textlb_443: Servers cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2041.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled: testlb6_443: Servers cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmne [19:26:55] 1.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2039.codfw.wmnet, cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2033.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:27:07] (ProbeDown) firing: (8) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:27:15] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp2035.codfw.wmnet, cp2037.codfw.wmnet, cp2041.codfw.wmnet are marked down but pooled: testlb6_443: Servers cp2039.codfw.wmnet are marked down but pooled: textlb6_443: Servers cp2029.codfw.wmnet, cp2037.codfw.wmnet, cp2041.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:27:36] ongoing issues on traffic -all dcs- app serveres? [19:28:07] RECOVERY - PyBal backends health check on lvs3008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:28:19] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:28:37] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:28:42] jumped from 168 to 200 req/s [19:30:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:31:07] (ProbeDown) firing: (19) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:22] (ProbeDown) resolved: (19) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:32:07] (ProbeDown) firing: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:23] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10VRiley-WMF) 05Open→03Resolved [19:33:26] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10VRiley-WMF) [19:33:47] (03PS9) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [19:35:11] (03CR) 10Bartosz Dziewoński: "(rebased again on top of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/967977 – sorry about all these changes, I'm just t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [19:35:31] (03CR) 10Bartosz Dziewoński: "Should be safe, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [19:35:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:36:27] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) Hey @taavi and @cmooney Just wanted to see if there was a timeframe for us to move these servers. Any specific time when we know the servers... [19:37:07] (ProbeDown) resolved: (21) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:25] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10Data-Platform-SRE, 10cloud-services-team: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) >>! In T346948#9274072, @VRiley-WMF wrote: > Just wanted to see if there was a timeframe on this move. Like, a specific time when we know the server... [19:39:07] (hi I am a noob) looks resolved, right? did it resolve on its own? [19:43:57] (03CR) 10Jsn.sherman: [C: 03+1] "This looks good to me; we'll need to get it on the backport schedule to get it +2ed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [19:46:33] (03CR) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [19:47:26] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump PyWASM to a non-global log [deployment-charts] - 10https://gerrit.wikimedia.org/r/967979 [19:47:58] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump PyWASM to a non-global log [deployment-charts] - 10https://gerrit.wikimedia.org/r/967979 (owner: 10Jforrester) [19:48:05] (03CR) 10Thcipriani: [C: 03+1] "fine by me 😊" [puppet] - 10https://gerrit.wikimedia.org/r/967899 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [19:48:49] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump PyWASM to a non-global log [deployment-charts] - 10https://gerrit.wikimedia.org/r/967979 (owner: 10Jforrester) [19:49:29] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:50:10] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T2000). nyaa~ [20:00:07] cormacparle and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] hi [20:00:21] my changes are all no-ops [20:00:23] * cormacparle waves [20:01:06] mine's fixing an error [20:04:41] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:11:52] any deployers around today? [20:11:53] (03CR) 10Ssingh: [C: 03+1] hiera: remove dns4004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967976 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [20:12:05] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns4004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/967976 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [20:18:37] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bookworm [20:18:47] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns4004.wikimedia.org with OS bookworm [20:19:27] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:20:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:22:35] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:22:59] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:25:57] PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:39] (JobUnavailable) firing: (4) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:37:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:38:18] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:29] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [20:42:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:42:50] (03PS42) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [20:42:52] (03PS1) 10AOkoth: vrts: increase envoy response timeout [puppet] - 10https://gerrit.wikimedia.org/r/967985 (https://phabricator.wikimedia.org/T349471) [20:43:11] (03PS2) 10AOkoth: vrts: increase envoy response timeout [puppet] - 10https://gerrit.wikimedia.org/r/967985 (https://phabricator.wikimedia.org/T349471) [20:44:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [20:45:01] thcipriani: FYI seems like we've been a bit light on deployers last few weeks. My team mate couldn't find deployers in the last 2 backport windows either. [20:45:16] (03PS1) 10Btullis: Downgrade spark 3.1 to version 3.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967986 (https://phabricator.wikimedia.org/T344910) [20:47:22] (03CR) 10EoghanGaffney: [C: 03+1] vrts: increase envoy response timeout [puppet] - 10https://gerrit.wikimedia.org/r/967985 (https://phabricator.wikimedia.org/T349471) (owner: 10AOkoth) [20:47:57] (03PS1) 10Muehlenhoff: Revert "Failover testreduce to testreduce1002" [dns] - 10https://gerrit.wikimedia.org/r/967987 [20:48:09] (03PS2) 10Muehlenhoff: Revert "Failover testreduce to testreduce1002" [dns] - 10https://gerrit.wikimedia.org/r/967987 [20:48:54] (03CR) 10AOkoth: [C: 03+2] vrts: increase envoy response timeout [puppet] - 10https://gerrit.wikimedia.org/r/967985 (https://phabricator.wikimedia.org/T349471) (owner: 10AOkoth) [20:49:13] (03CR) 10Btullis: [V: 03+2 C: 03+2] Downgrade spark 3.1 to version 3.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/967986 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [20:49:27] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Failover testreduce to testreduce1002" [dns] - 10https://gerrit.wikimedia.org/r/967987 (owner: 10Muehlenhoff) [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231023T2100). [21:00:19] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:00:53] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump PyWASM to flushing logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967990 [21:01:11] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump PyWASM to flushing logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967990 (owner: 10Jforrester) [21:02:06] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump PyWASM to flushing logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967990 (owner: 10Jforrester) [21:04:48] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:05:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:05:26] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:10:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:59] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:05] (03CR) 10Gergő Tisza: [C: 03+1] Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [21:31:05] (03PS1) 10Bartosz Dziewoński: Remove no-op $wgHiddenPrefs[] = 'prefershttps' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967992 [21:35:01] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:35:37] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:36:45] 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10matmarex) I'm finding it hard to believe, as the rates of errors I linked in T343648#9241155 have not changed. [21:37:42] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns4004.wikimedia.org with OS bookworm [21:37:52] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns4004.wikimedia.org with OS bookworm completed: - dns4004 (**PASS**) - Downtimed on Icinga/Al... [21:40:31] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 237.77 ms [21:48:44] (03CR) 10Gergő Tisza: [C: 03+1] Clean up $wgCentralAuthAutoLoginWikis configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [22:01:18] (03PS1) 10BCornwall: Revert "hiera: remove dns4004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/967776 [22:02:10] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns4004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/967776 (owner: 10BCornwall) [22:06:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [22:10:42] (03CR) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:10:52] (03PS6) 10Gergő Tisza: CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) [22:10:54] (03PS3) 10Gergő Tisza: Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 (owner: 10Bartosz Dziewoński) [22:10:56] (03PS2) 10Gergő Tisza: Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [22:10:58] (03PS10) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) [22:17:17] (03CR) 10Bartosz Dziewoński: Clean up $wgCentralAuthAutoLoginWikis configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [22:25:20] (03CR) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:31:14] (03CR) 10Gergő Tisza: Clean up $wgCentralAuthAutoLoginWikis configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [22:31:53] (03CR) 10Bartosz Dziewoński: [C: 03+1] Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:43:43] (03PS1) 10Gergő Tisza: [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 [22:51:57] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump PyWASM to avoid call for version [deployment-charts] - 10https://gerrit.wikimedia.org/r/967998 [22:52:19] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump PyWASM to avoid call for version [deployment-charts] - 10https://gerrit.wikimedia.org/r/967998 (owner: 10Jforrester) [22:53:09] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump PyWASM to avoid call for version [deployment-charts] - 10https://gerrit.wikimedia.org/r/967998 (owner: 10Jforrester) [22:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:54:44] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:55:23] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:56:52] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump PyWASM timeout from 9 to 19s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967999 [22:57:01] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump PyWASM timeout from 9 to 19s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967999 (owner: 10Jforrester) [22:57:07] (03CR) 10Bartosz Dziewoński: [C: 03+1] [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [22:57:48] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump PyWASM timeout from 9 to 19s [deployment-charts] - 10https://gerrit.wikimedia.org/r/967999 (owner: 10Jforrester) [22:58:21] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:58:50] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [23:03:55] (03PS1) 10Jforrester: [Staging only] wikifunctions: Bump PyWASM timeout to 55s [deployment-charts] - 10https://gerrit.wikimedia.org/r/968001 [23:04:02] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Bump PyWASM timeout to 55s [deployment-charts] - 10https://gerrit.wikimedia.org/r/968001 (owner: 10Jforrester) [23:04:47] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Bump PyWASM timeout to 55s [deployment-charts] - 10https://gerrit.wikimedia.org/r/968001 (owner: 10Jforrester) [23:05:16] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [23:05:45] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [23:14:57] (03PS1) 10Jforrester: [Staging only] wikifunctions: Raise PyWASM CPU limits by 4x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968004 [23:31:09] (03PS1) 10Eevans: Decommission restbase2012 [puppet] - 10https://gerrit.wikimedia.org/r/968006 (https://phabricator.wikimedia.org/T349526)