[00:07:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:38:33] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938343
[00:38:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938343 (owner: 10TrainBranchBot)
[00:54:33] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/938343 (owner: 10TrainBranchBot)
[01:03:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342071 (10phaultfinder)
[01:50:39] <wikibugs>	 (03CR) 10Kaleem Bhatti: "anyone know why error showing what's problem how I can solve this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[01:54:13] <wikibugs>	 (03CR) 10Tim Starling: "I think you should set $wgAutoCreateTempUser['serialProvider'] = 'centralauth';" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0200)
[02:00:49] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:01:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:03:53] <wikibugs>	 (03CR) 10Tim Starling: [C: 04-1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[02:07:08] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.18 [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938344 (https://phabricator.wikimedia.org/T340246)
[02:07:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.18 [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938344 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[02:08:23] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:22:21] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.18 [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938344 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[02:29:22] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0300)
[03:00:35] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:05:07] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:04:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:06:31] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:11:01] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:19:17] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:19:57] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:24:25] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:24:51] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "There was an issue with some of the grants not having any effect and the issue was that multiple <RequireAll> were used which does not wor" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[05:31:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[05:37:57] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:42:27] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:44:37] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0600)
[06:00:04] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0600).
[06:00:27] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/938222 (https://phabricator.wikimedia.org/T341455) (owner: 10Cathal Mooney)
[06:00:33] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:05:03] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:08:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:12:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[06:13:14] <wikibugs>	 (03CR) 10Elukey: eventgate: set a more performant default for queue.buffering.max.ms (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) (owner: 10Elukey)
[06:14:05] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:14:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Python 3.11 support [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi)
[06:16:08] <wikibugs>	 (03Merged) 10jenkins-bot: Add Python 3.11 support [software/homer] - 10https://gerrit.wikimedia.org/r/922485 (owner: 10Ayounsi)
[06:17:01] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170)
[06:18:35] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:19:31] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[06:21:08] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add new variable in chart for s3 path [deployment-charts] - 10https://gerrit.wikimedia.org/r/938856 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[06:23:07] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:29:22] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:30:45] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:34:26] <wikibugs>	 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) p:05Triage→03Low
[06:34:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10ayounsi)
[06:35:07] <wikibugs>	 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi)
[06:36:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cr3-knams,cr3-knams IPv6 with reason: Downtime cr3-knams ahead of remote hands moving router
[06:36:57] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cr3-knams,cr3-knams IPv6 with reason: Downtime cr3-knams ahead of remote hands moving router
[06:48:39] <XioNoX>	 !log disable asw-b-codfw:ae0 (to cloudsw1-b1-codfw) - T342076
[06:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:43] <stashbot>	 T342076: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076
[06:53:13] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T0700)
[07:00:05] <jouncebot>	 James_F: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:02:13] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:08:20] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi)
[07:16:46] <elukey>	 !log restart kafka main-codfw rebalances (long maintenance) - T341558
[07:16:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:50] <stashbot>	 T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558
[07:20:17] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:24:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] New role: titan [puppet] - 10https://gerrit.wikimedia.org/r/938866 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[07:24:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: create /etc/prometheus when needed [puppet] - 10https://gerrit.wikimedia.org/r/938867 (https://phabricator.wikimedia.org/T341999) (owner: 10Filippo Giunchedi)
[07:24:45] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:33:54] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) a:03ayounsi
[07:35:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[07:35:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[07:36:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[07:36:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[07:38:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:40:59] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server: Add helmfile value wmf_staging_environment [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033)
[07:42:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) 05Open→03Resolved I’m going to close this task for now.  The problem has been mitigated as best as possible with the current equipment we have.  In time replacing...
[07:46:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "fix the commit message and LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/938902 (owner: 10Fabfur)
[07:47:45] <wikibugs>	 (03PS2) 10Fabfur: hiera: apply silent-drop on port 80 to drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938902 (https://phabricator.wikimedia.org/T340983)
[07:49:08] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:52:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix duplicate declaration for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/939232
[07:52:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] CR: cloud-host: allow return traffic for PDNS servers [homer/public] - 10https://gerrit.wikimedia.org/r/938819 (https://phabricator.wikimedia.org/T341966) (owner: 10Arturo Borrero Gonzalez)
[07:53:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix duplicate declaration for statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/939232 (owner: 10Filippo Giunchedi)
[07:55:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[07:59:01] <elukey>	 this is me --^
[07:59:56] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[08:00:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[08:08:31] <wikibugs>	 10SRE-tools, 10Spicerack: Spicerack: don't write logs to disk - https://phabricator.wikimedia.org/T342079 (10ayounsi)
[08:08:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet
[08:08:50] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1069.eqiad.wmnet
[08:09:17] <topranks>	 !log cr3-knams going offline for move 
[08:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:12] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233
[08:11:48] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:11:48] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:12:04] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:12:06] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:12:20] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:12:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:13:22] <fabfur>	 !log disable puppet on A:cp-drmrs to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/938902/ (T340983)
[08:13:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:25] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[08:14:25] <wikibugs>	 (03CR) 10Fabfur: hiera: apply silent-drop on port 80 to drmrs cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938902 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:14:42] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: apply silent-drop on port 80 to drmrs cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/938902 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:16:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2068.codfw.wmnet
[08:17:47] <fabfur>	 !log enable puppet on A:cp-drmrs for https://gerrit.wikimedia.org/r/c/operations/puppet/+/938902/ (T340983) (hosts will run puppet with the usual schedule)
[08:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2069.codfw.wmnet
[08:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:22:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli)
[08:23:08] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli)
[08:24:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli)
[08:24:47] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[08:24:54] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233 (owner: 10Ilias Sarantopoulos)
[08:25:32] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1069.eqiad.wmnet
[08:25:54] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1070.eqiad.wmnet
[08:27:19] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2069.codfw.wmnet
[08:27:30] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer)
[08:28:25] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet
[08:29:50] <wikibugs>	 (03CR) 10Ayounsi: Manage TLS on network devices (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[08:30:33] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[08:31:09] <wikibugs>	 (03PS27) 10Ayounsi: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594)
[08:32:00] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:23] <wikibugs>	 (03Merged) 10jenkins-bot: Replace Capirca with Aerleon [software/homer] - 10https://gerrit.wikimedia.org/r/929333 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[08:33:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: spec: introduce service tests [puppet] - 10https://gerrit.wikimedia.org/r/937450 (https://phabricator.wikimedia.org/T336497) (owner: 10Arturo Borrero Gonzalez)
[08:34:38] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1070.eqiad.wmnet
[08:34:45] <wikibugs>	 (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983)
[08:36:09] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet
[08:36:21] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Add GraphQL support to wmflib - https://phabricator.wikimedia.org/T341968 (10ayounsi)
[08:37:37] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet
[08:37:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet
[08:39:58] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 9 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42512/console" [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:40:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027)
[08:40:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[08:40:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027)
[08:40:59] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[08:41:56] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "It was just a styling error, it is now resolved!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[08:42:11] <wikibugs>	 (03PS3) 10Filippo Giunchedi: base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027)
[08:45:56] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:46:09] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet
[08:46:20] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:46:36] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:46:36] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] hiera: apply silent-drop on port 80 to eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:46:42] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:47:02] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet
[08:48:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet
[08:48:07] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet
[08:48:22] <Amir1>	 jouncebot: nowandnext
[08:48:22] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 11 minute(s)
[08:48:22] <jouncebot>	 In 1 hour(s) and 11 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1000)
[08:48:28] <Amir1>	 cool
[08:48:45] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[08:49:28] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:49:35] <wikibugs>	 (03Merged) 10jenkins-bot: ores: use envoy proxy for Lift Wing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937453 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[08:50:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:50:30] <wikibugs>	 (03PS11) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577)
[08:53:15] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:937453|ores: use envoy proxy for Lift Wing (T319170)]]
[08:53:18] <stashbot>	 T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170
[08:54:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10hashar) > I tested with this commit https://gerrit.wikimedia.org/r/c/operations/dns/+/936686 -- it all worked perfectly.  Looks like the Gerrit acces...
[08:55:28] <fabfur>	 !log disable puppet on A:cp-eqiad to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/939235 (T340983)
[08:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:34] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[08:56:09] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: apply silent-drop on port 80 to eqiad cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939235 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[08:56:37] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet
[08:56:41] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet
[08:57:33] <logmsgbot>	 !log ladsgroup@deploy1002 isaranto and ladsgroup: Backport for [[gerrit:937453|ores: use envoy proxy for Lift Wing (T319170)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:58:37] <fabfur>	 !log enable puppet on A:cp-eqiad for https://gerrit.wikimedia.org/r/939235 (T340983) (hosts will run puppet with the usual schedule)
[08:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:34] <wikibugs>	 (03CR) 10Jbond: vrts: drop bashisms and fix other CI issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[09:01:39] <wikibugs>	 (03PS2) 10Jbond: vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064)
[09:02:05] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet
[09:02:09] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet
[09:02:31] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) >>! In T320390#9018611, @Jelto wrote: > ... > There are two settings which we may test, one is `send_scope_to_token_endpoin...
[09:02:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] vrts: drop bashisms and fix other CI issues [puppet] - 10https://gerrit.wikimedia.org/r/938894 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[09:02:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[09:03:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233 (owner: 10Ilias Sarantopoulos)
[09:03:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] base: bump cadvisor rollout to 80% [puppet] - 10https://gerrit.wikimedia.org/r/939236 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[09:03:39] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy models for simplewiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/938859 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[09:04:06] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy updated language identification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/939233 (owner: 10Ilias Sarantopoulos)
[09:04:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:04:44] <wikibugs>	 (03PS2) 10Jbond: kerberos: fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064)
[09:04:46] <wikibugs>	 (03PS2) 10Jbond: kerberos: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064)
[09:07:55] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:08:11] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:937453|ores: use envoy proxy for Lift Wing (T319170)]] (duration: 14m 56s)
[09:08:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] kerberos: fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938895 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[09:08:14] <stashbot>	 T319170: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170
[09:08:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] kerberos: Fix bashisms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938896 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[09:09:28] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet
[09:09:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:10:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet
[09:11:58] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Proton: Bump image version of Proton on Beta Cluster - https://phabricator.wikimedia.org/T342087 (10DAlangi_WMF)
[09:14:54] <wikibugs>	 (03PS12) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577)
[09:15:19] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:16:07] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet
[09:16:18] <wikibugs>	 (03CR) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run (035 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[09:16:29] <wikibugs>	 (03PS7) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033)
[09:16:30] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:16:31] <wikibugs>	 (03PS7) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033)
[09:16:33] <wikibugs>	 (03PS7) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[09:16:46] <wikibugs>	 (03PS1) 10Fabfur: hiera: apply silent-drop on port 80 to all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983)
[09:17:03] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:17:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:17:48] <wikibugs>	 (03CR) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:18:02] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1001.eqiad.wmnet
[09:20:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:20:40] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:21:32] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:21:45] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:24:10] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42513/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:24:32] <XioNoX>	 !log remove asw-b1-codfw from asw-b-codfw VC - T342076
[09:24:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:36] <stashbot>	 T342076: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076
[09:24:50] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet
[09:24:59] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet
[09:26:04] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:33] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:28:09] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:28:25] <wikibugs>	 (03CR) 10ArielGlenn: "New ppc output: https://puppet-compiler.wmflabs.org/output/938816/42511/" [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[09:28:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1001.eqiad.wmnet
[09:28:47] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:28:55] <wikibugs>	 (03CR) 10Kaleem Bhatti: [C: 03+1] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[09:29:13] <wikibugs>	 (03PS6) 10Kaleem Bhatti: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203)
[09:29:53] <wikibugs>	 (03PS8) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[09:29:54] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2013 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:29:54] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:30:04] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:30:11] <godog>	 the cadvisor unit failures is me
[09:30:13] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[09:30:17] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42514/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:30:20] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:30:27] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:30:37] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:30:45] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:33:18] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:50] <wikibugs>	 (03CR) 10Jbond: "thanks see responses inline" [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[09:34:12] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet
[09:34:16] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42515/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:34:48] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:35:27] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) I looked at the GitLab `gitlabhq_production` database and `identities` table.  I connected to the psql database using: `sud...
[09:35:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10Joe) >>! In T341717#9010061, @jbond wrote: > I wonder if we should instead move config-master to a VM.  AFAIK the...
[09:37:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[09:38:17] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42516/console" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:39:16] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[09:40:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:41:25] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] make sure certain systemd jobs run only on the primary xml dumps NFS shares [puppet] - 10https://gerrit.wikimedia.org/r/938816 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[09:43:40] <wikibugs>	 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10ayounsi) a:05ayounsi→03None
[09:43:51] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:45:15] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] hiera: apply silent-drop on port 80 to all cp hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[09:45:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:46:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10Joe) I think it makes sense to keep the conftool master on the same server as the puppetmaster, as we share reposito...
[09:46:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] CR: cloud-host: allow return traffic for PDNS servers [homer/public] - 10https://gerrit.wikimedia.org/r/938819 (https://phabricator.wikimedia.org/T341966) (owner: 10Arturo Borrero Gonzalez)
[09:47:59] <wikibugs>	 (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[09:48:10] <MichaelG_WMDE>	 A gate-and-submit job of wmf-quibble-vendor-mysql-php81-docker failed with some rsync permission errors on composer cache files:
[09:48:11] <MichaelG_WMDE>	 rsync: [generator] failed to set times on "/cache/.": Operation not permitted (1)
[09:48:14] <wikibugs>	 (03PS2) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[09:48:17] <MichaelG_WMDE>	 rsync: [generator] recv_generator: mkdir "/cache/composer" failed: Permission denied (13)
[09:48:23] <MichaelG_WMDE>	 rsync: [receiver] mkstemp "/cache/.phpcs.02e459ed8923.a8e9cc1cfb23.cache.JKuQZQ" failed: Permission denied (13)
[09:48:31] <MichaelG_WMDE>	 etc
[09:48:32] <MichaelG_WMDE>	 https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php81-docker/6481/console
[09:48:49] <MichaelG_WMDE>	 is this likely a one-time thing and just flakyness or is there more going on?
[09:50:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:51:01] <wikibugs>	 (03PS2) 10JMeybohm: deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033)
[09:51:24] <arturo>	 !log deploying https://gerrit.wikimedia.org/r/c/operations/homer/public/+/938819 via homer to cr-eqiad & cr-codfw
[09:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:47] <wikibugs>	 (03PS2) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064)
[09:51:49] <wikibugs>	 (03PS2) 10Jbond: install_server: updaate to use bash [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064)
[09:51:51] <wikibugs>	 (03PS2) 10Jbond: kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064)
[09:51:56] <Lucas_WMDE>	 MichaelG_WMDE: I saw the same message a few times yesterday T341998
[09:51:56] <stashbot>	 T341998: Random CI builds failing with rsync errors - https://phabricator.wikimedia.org/T341998
[09:52:06] <wikibugs>	 (03CR) 10Jbond: install_server: updaate to use bash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[09:52:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[09:52:25] <Lucas_WMDE>	 (also, I think #wikimedia-releng is the better channel for that?)
[09:52:37] <wikibugs>	 (03PS2) 10Btullis: Upgrade the search instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933088 (https://phabricator.wikimedia.org/T336286)
[09:52:39] <fabfur>	 !log disable puppet on A:cp-esams to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/939242 (T340983)
[09:52:39] <wikibugs>	 (03PS2) 10Btullis: Upgrade the research instance of airflow to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933089 (https://phabricator.wikimedia.org/T336286)
[09:52:41] <wikibugs>	 (03PS2) 10Btullis: Update the platform_eng airflow instance to version 2.6.1 [puppet] - 10https://gerrit.wikimedia.org/r/933090 (https://phabricator.wikimedia.org/T336286)
[09:52:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:42] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[09:52:43] <wikibugs>	 (03PS2) 10Btullis: Upgrade the analytics_product airflow instance to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933091 (https://phabricator.wikimedia.org/T336286)
[09:53:23] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:53:42] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:54:12] <wikibugs>	 (03PS3) 10Btullis: Upgrade the research instance of airflow to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933089 (https://phabricator.wikimedia.org/T336286)
[09:54:19] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42517/console" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:54:25] <wikibugs>	 (03PS3) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064)
[09:54:27] <wikibugs>	 (03PS3) 10Jbond: install_server: drop Bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064)
[09:54:29] <wikibugs>	 (03PS3) 10Jbond: kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064)
[09:54:35] <wikibugs>	 (03PS3) 10Btullis: Update the platform_eng airflow instance to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933090 (https://phabricator.wikimedia.org/T336286)
[09:55:05] <wikibugs>	 (03PS3) 10Btullis: Upgrade the search instance of airflow to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/933088 (https://phabricator.wikimedia.org/T336286)
[09:57:27] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10fgiunchedi)
[09:57:58] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[09:58:54] <wikibugs>	 (03PS3) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[09:58:58] <wikibugs>	 (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[09:59:51] <wikibugs>	 (03Merged) 10jenkins-bot: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1000)
[10:00:33] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1071.eqiad.wmnet with OS bullseye
[10:01:33] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] hiera: apply silent-drop on port 80 to all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/939242 (https://phabricator.wikimedia.org/T340983) (owner: 10Fabfur)
[10:02:11] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org
[10:02:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[10:02:25] <fabfur>	 !log enable puppet on A:cp-esams for https://gerrit.wikimedia.org/r/939235 (T340983) (hosts will run puppet with the usual schedule)
[10:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:29] <stashbot>	 T340983: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983
[10:02:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1002.eqiad.wmnet
[10:02:58] <fabfur>	 !log fix last entry: correct CR is https://gerrit.wikimedia.org/r/939242 
[10:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:49] <wikibugs>	 (03Merged) 10jenkins-bot: Manage TLS on network devices [cookbooks] - 10https://gerrit.wikimedia.org/r/933094 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[10:05:43] <wikibugs>	 (03PS1) 10ArielGlenn: Fix lookup ofsettig to enable sytmd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232)
[10:05:57] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org
[10:06:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:07:49] <wikibugs>	 (03PS1) 10JMeybohm: envoy: Add -service-node argument to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939248 (https://phabricator.wikimedia.org/T300033)
[10:08:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:08:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:09:30] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:10:38] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1002.eqiad.wmnet
[10:10:40] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033)
[10:10:52] <wikibugs>	 (03PS2) 10ArielGlenn: Fix lookup of setting to enable systemd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232)
[10:10:55] <wikibugs>	 (03PS1) 10Jbond: kerberos: fix carriage return [puppet] - 10https://gerrit.wikimedia.org/r/939250 (https://phabricator.wikimedia.org/T95064)
[10:11:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:11:38] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet
[10:11:46] <wikibugs>	 (03CR) 10JMeybohm: "Depends on I8fd8c34091c5c0eca18f3ddbe094f87f8c248722" [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:13:34] <wikibugs>	 10ops-knams: Inbound interface errors - https://phabricator.wikimedia.org/T342097 (10phaultfinder)
[10:15:50] <wikibugs>	 (03CR) 10ArielGlenn: "https://puppet-compiler.wmflabs.org/output/939247/42519/ for ppc. I swear I get this lookup search path stuff wrong every single time." [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[10:16:03] <wikibugs>	 10SRE, 10Traffic: port 80 paging on scheduled single host maintenance in text@esams - https://phabricator.wikimedia.org/T339898 (10Fabfur)
[10:16:06] <wikibugs>	 10SRE, 10Traffic: provide haproxy silent-drop support for port 80 as well - https://phabricator.wikimedia.org/T340983 (10Fabfur) 05Open→03Resolved a:03Fabfur The HAProxy configuration on all DCs has been updated to apply silent-drop to abusive clients hitting port 80, as been already done for port 443....
[10:17:15] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1071.eqiad.wmnet with reason: host reimage
[10:17:23] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Fix lookup of setting to enable systemd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[10:19:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:50] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:19:55] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Fix lookup of setting to enable systemd timers for primary dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939247 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[10:20:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1071.eqiad.wmnet with reason: host reimage
[10:20:22] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:20:27] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) >>! In T320390#9022500, @Jelto wrote: >>>! In T320390#9018611, @Jelto wrote: >> ... >> There are two settings which we may...
[10:21:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] kerberos: fix carriage return [puppet] - 10https://gerrit.wikimedia.org/r/939250 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[10:24:42] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cr3-knams,cr3-knams IPv6
[10:24:43] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr3-knams,cr3-knams IPv6
[10:24:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet
[10:24:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be1004.eqiad.wmnet
[10:27:22] <wikibugs>	 (03PS1) 10Cathal Mooney: repool esams: router migration completed [dns] - 10https://gerrit.wikimedia.org/r/939253 (https://phabricator.wikimedia.org/T337997)
[10:30:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1004.eqiad.wmnet
[10:35:05] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet
[10:35:45] <wikibugs>	 (03CR) 10Jbond: "lgtm see comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi)
[10:36:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] repool esams: router migration completed [dns] - 10https://gerrit.wikimedia.org/r/939253 (https://phabricator.wikimedia.org/T337997) (owner: 10Cathal Mooney)
[10:37:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] repool esams: router migration completed [dns] - 10https://gerrit.wikimedia.org/r/939253 (https://phabricator.wikimedia.org/T337997) (owner: 10Cathal Mooney)
[10:38:09] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:38:32] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:38:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10jbond)  > I think it's a perfectly valid idea, and I think it's relatively easy to do. We could just configure the...
[10:38:52] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[10:39:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:39:04] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:39:12] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:40:10] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:41:11] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:42:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add conftool::master to puppetserver - https://phabricator.wikimedia.org/T341721 (10jbond) > it can be done, but unless there's a huge compelling reason to decouple them, it seems too much work to me...
[10:42:35] <topranks>	 !log repool esams after successful move of cr3-knams to new rack T337997
[10:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:43:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:44:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:44:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:47:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] cert-manager: convert use of seed_image to image_tag [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/935696 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[10:48:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet
[10:49:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10jbond) We should retest this when everything is on puppet7
[10:49:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:49:23] <elukey>	 this is me --^
[10:49:31] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet
[10:49:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:50:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115)
[10:54:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:54:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:54:59] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:55:19] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:55:35] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[10:55:45] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:55:50] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:56:47] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:56:48] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:57:55] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:58:33] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:02:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:03:43] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2002.codfw.wmnet
[11:05:08] <jinxer-wm>	 (KubernetesAPILatency) firing: (7) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:06:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet
[11:07:05] <wikibugs>	 (03PS1) 10Elukey: knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257
[11:10:08] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:11:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:13:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1071.eqiad.wmnet with OS bullseye
[11:15:37] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2003.codfw.wmnet
[11:15:47] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2004.codfw.wmnet
[11:16:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:17:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:20:03] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye
[11:20:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1074.eqiad.wmnet with OS bullseye
[11:22:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad
[11:22:22] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad
[11:22:32] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad
[11:22:32] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad
[11:24:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2004.codfw.wmnet
[11:24:56] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[11:24:56] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad
[11:25:23] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[11:25:24] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad
[11:26:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) `name=SONiC refresh needed verbose ayounsi@cumin1001:~$ sudo cookbook -v sre.network.tls lsw1-e8-eqiad START - Cookbook sre.network.tls for network device lsw1-e8-eqiad...
[11:27:07] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad
[11:27:07] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad
[11:27:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[11:27:13] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad
[11:30:25] <wikibugs>	 (03PS1) 10Jbond: WIP: Add check to look for violating hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/939260 (https://phabricator.wikimedia.org/T181971)
[11:30:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: Add check to look for violating hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/939260 (https://phabricator.wikimedia.org/T181971) (owner: 10Jbond)
[11:31:31] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099 (10jbond)
[11:34:54] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Proton: Bump image version of Proton on Beta Cluster - https://phabricator.wikimedia.org/T342087 (10JMeybohm) 05Open→03Resolved a:03JMeybohm Updated in [[ https://horizon.wikimedia.org/project/instances/47f8bf1e-31bb-48a9-a8ad-c116e0ab6112/ | deployment-docker-pr...
[11:37:42] <wikibugs>	 (03PS1) 10Ayounsi: sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594)
[11:39:18] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[11:39:28] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f8-eqiad
[11:39:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Add -service-node argument to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939248 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[11:39:37] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[11:39:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f8-eqiad
[11:41:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "out of curiosity: why no setting for the ml-{eqiad,codfw} clusters?" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[11:41:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[11:44:29] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] deployment_server: Add certmanager defaults (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[11:46:53] <wikibugs>	 (03PS3) 10JMeybohm: deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033)
[11:46:56] <wikibugs>	 (03PS2) 10JMeybohm: deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033)
[11:47:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:48:26] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Proton: Bump image version of Proton on Beta Cluster - https://phabricator.wikimedia.org/T342087 (10DAlangi_WMF) Yes, that's right. And I just tested on beta, it's working as expected. Thanks!
[11:48:57] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42520/console" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[11:49:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall ok if hackish, please see the comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[11:50:07] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257 (owner: 10Elukey)
[11:50:54] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.network.tls for network device lsw1-f8-eqiad
[11:50:54] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-f8-eqiad
[11:51:43] <wikibugs>	 (03PS2) 10Ayounsi: sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594)
[11:55:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1015.eqiad.wmnet
[11:56:12] <wikibugs>	 (03PS4) 10JMeybohm: deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033)
[11:56:14] <wikibugs>	 (03PS3) 10JMeybohm: deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033)
[11:58:03] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42521/console" [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:01:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[12:01:25] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] kubernetes::master: Publish service-account cert to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[12:02:29] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Add -service-node argument to envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939248 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:03:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[12:04:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[12:04:04] <wikibugs>	 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10aborrero) a:05aborrero→03Jclark-ctr
[12:04:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:04:36] <wikibugs>	 (03PS4) 10JMeybohm: Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033)
[12:04:47] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1074.eqiad.wmnet with reason: host reimage
[12:05:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[12:05:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:05:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1015.eqiad.wmnet
[12:05:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:06:00] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare for new helm module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/937956 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:06:23] <wikibugs>	 (03PS8) 10JMeybohm: Use cert-manager certificates instead of cergen for tls termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/937957 (https://phabricator.wikimedia.org/T300033)
[12:07:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1074.eqiad.wmnet with reason: host reimage
[12:08:51] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:09:29] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:09:43] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:13:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:14:37] <godog>	 I'm taking a look ^
[12:14:52] <wikibugs>	 (03PS1) 10Daimona Eaytoy: prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260)
[12:16:03] <godog>	 sigh that's basically a sync on k8s ml causing the lag
[12:16:15] <wikibugs>	 (03PS2) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991
[12:17:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] deployment_server: Bump default envoy image version to 1.23.10-2 [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:17:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server: Add certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939199 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:17:14] <wikibugs>	 (03PS2) 10D3r1ck01: chromium-render: Deploy latest proton build [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297
[12:17:49] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] deployment_server: Bump default envoy image version to 1.23.10-2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939249 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:18:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:19:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar)
[12:19:08] <wikibugs>	 (03CR) 10Hashar: Recognize ~/.config/docker-pkg.yaml (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar)
[12:21:09] <wikibugs>	 (03PS1) 10Daimona Eaytoy: private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260)
[12:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:23:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[12:23:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[12:24:53] <wikibugs>	 (03PS1) 10Jforrester: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/938683 (https://phabricator.wikimedia.org/T342099)
[12:25:04] <wikibugs>	 (03PS1) 10Jforrester: Slot diff option "contentLanguage" should be a string [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/938684 (https://phabricator.wikimedia.org/T342099)
[12:25:59] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Add core parsoid HTML support config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939292 (https://phabricator.wikimedia.org/T339865)
[12:28:17] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons.
[12:30:12] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293
[12:30:17] <wikibugs>	 (03Abandoned) 10JMeybohm: Testing hack: Override envoy entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/937958 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:30:25] <wikibugs>	 (03PS9) 10JMeybohm: Testing hack: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[12:31:26] <wikibugs>	 (03PS10) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[12:32:15] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293
[12:32:23] <wikibugs>	 (03CR) 10Jgiannelos: "This patch adds config to allow choosing which page HTML endpoint to use. Enables core page HTML on staging for starters." [deployment-charts] - 10https://gerrit.wikimedia.org/r/939292 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos)
[12:34:44] <wikibugs>	 (03PS11) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[12:37:08] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:37:46] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:37:59] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[12:39:59] <wikibugs>	 (03PS12) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[12:40:23] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye
[12:41:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:42:54] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.dhcp for host analytics1073.eqiad.wmnet
[12:44:16] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage db2188-db2195 [puppet] - 10https://gerrit.wikimedia.org/r/939295 (https://phabricator.wikimedia.org/T341273)
[12:44:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2188-db2195 [puppet] - 10https://gerrit.wikimedia.org/r/939295 (https://phabricator.wikimedia.org/T341273) (owner: 10Marostegui)
[12:45:13] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293 (owner: 10Ilias Sarantopoulos)
[12:46:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:53:06] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) >>! In T320390#9022808, @jbond wrote: >  >  > The [[ https://docs.gitlab.com/ee/integration/omniauth.html#link-existing-use...
[12:54:40] <wikibugs>	 (03CR) 10Jgiannelos: "For future reference, you don't need to bump the chart version for non chart specific changes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01)
[12:54:49] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] chromium-render: Deploy latest proton build [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01)
[12:55:46] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1074.eqiad.wmnet with OS bullseye
[12:57:21] <wikibugs>	 (03PS2) 10Jbond: config-master: drop ssh-fingerprints.txt  file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947)
[12:57:23] <wikibugs>	 (03PS9) 10Jbond: ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1300).
[13:00:04] <jouncebot>	 James_F and Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1300)
[13:00:04] <jouncebot>	 xSavitar: A patch you scheduled for Mobileapps/RESTBase/Wikifeeds is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:13] <xSavitar>	 o/
[13:00:23] <Daimona>	 o/
[13:00:33] <Lucas_WMDE>	 I’m in a meeting, sorry
[13:00:42] <wikibugs>	 (03PS10) 10Jbond: ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[13:00:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons.
[13:01:18] <wikibugs>	 (03CR) 10Jbond: "Ready for review, i have squashed the changes and updated how we publish known_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[13:02:09] <James_F>	 I'm here.
[13:02:20] * TheresNoTime can't deploy today, sorry
[13:02:31] <James_F>	 I can do it.
[13:02:35] <wikibugs>	 (03CR) 10D3r1ck01: [C: 03+2] chromium-render: Deploy latest proton build (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01)
[13:02:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) (owner: 10Jforrester)
[13:03:10] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2022/2023-Q4): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) 05Open→03In progress p:05Triage→03High
[13:03:16] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri)
[13:03:22] <wikibugs>	 (03Merged) 10jenkins-bot: chromium-render: Deploy latest proton build [deployment-charts] - 10https://gerrit.wikimedia.org/r/938297 (owner: 10D3r1ck01)
[13:04:14] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) (owner: 10Jforrester)
[13:04:31] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]]
[13:04:36] <logmsgbot>	 !log derick@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[13:04:40] <stashbot>	 T147219: Wikipedia logo in Notification popup is not high-density ready - https://phabricator.wikimedia.org/T147219
[13:04:40] <James_F>	 Daimona: Did you want to do yours yourself?
[13:04:57] <Daimona>	 I'm not a deployer :)
[13:05:05] <James_F>	 Oh, really? We should fix that. :-)
[13:05:42] <James_F>	 xSavitar: BTW, we should really move the Content Transform service window to not clash…
[13:05:58] <Daimona>	 Working as intended for now :P
[13:06:02] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:06:12] <logmsgbot>	 !log derick@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[13:06:36] <xSavitar>	 xSavitar, maybe! I'll ask Tyler or RelEng. But it seems it's okay to do both concurrently?
[13:06:55] <James_F>	 The whole point of the calendar is that we shouldn't ever have concurrent windows. :-)
[13:07:02] <logmsgbot>	 !log derick@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply
[13:07:09] <James_F>	 It doesn't break things too often, but…
[13:07:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[13:07:19] <logmsgbot>	 !log btullis@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[13:07:30] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[13:07:46] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[13:07:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[13:08:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet']
[13:08:46] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet']
[13:08:48] <xSavitar>	 James_F, got it. Should we poke Rel-Eng? I can file a task if don't mind.
[13:09:18] <logmsgbot>	 !log derick@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[13:09:36] <wikibugs>	 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Create dynamic CRL - https://phabricator.wikimedia.org/T340543 (10jbond)
[13:09:40] <James_F>	 xSavitar: The trick is finding a time that works for you – would an hour earlier or later work?
[13:09:52] <James_F>	 xSavitar: If so, I can just write a patch moving the window now.
[13:10:27] <wikibugs>	 (03Merged) 10jenkins-bot: sre.network.tls: fix edge case [cookbooks] - 10https://gerrit.wikimedia.org/r/939261 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi)
[13:10:58] <logmsgbot>	 !log derick@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply
[13:12:18] <xSavitar>	 James_F, an hour earlier works for me but I'm not the entire CT team. But I can bring this up in the CT team channel and ask opinions about moving the window up by 1 hour.
[13:12:26] <James_F>	 Thanks!
[13:12:27] <logmsgbot>	 !log derick@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[13:13:12] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] (duration: 08m 40s)
[13:13:15] <stashbot>	 T147219: Wikipedia logo in Notification popup is not high-density ready - https://phabricator.wikimedia.org/T147219
[13:13:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:13:42] <wikibugs>	 (03PS7) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945)
[13:13:46] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:14:00] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:14:12] <wikibugs>	 (03Abandoned) 10Jbond: config-master: drop ssh-fingerprints.txt  file [puppet] - 10https://gerrit.wikimedia.org/r/936691 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[13:14:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:15:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[13:15:55] * James_F sighs.
[13:15:58] * xSavitar done with deployment. Service still works as expected.
[13:16:04] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[13:17:02] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]]
[13:17:08] <wikibugs>	 (03PS2) 10Ayounsi: Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028)
[13:17:08] <logmsgbot>	 !log jforrester@deploy1002 sync-world aborted: Backport for [[gerrit:934630|Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years (T147219)]] (duration: 00m 06s)
[13:17:22] * James_F sighs at network issues during a deploy. Sorry all.
[13:17:23] <wikibugs>	 (03CR) 10Ayounsi: Add cookbook to manage users SSH keys on SONiC devices (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi)
[13:17:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:17:56] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "Let's just merge this manually." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:18:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:18:11] <James_F>	 Finally.
[13:18:27] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:771622|Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy (T275945)]]
[13:18:31] <stashbot>	 T275945: Launch Wikifunctions - https://phabricator.wikimedia.org/T275945
[13:20:02] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:771622|Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy (T275945)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:20:29] <logmsgbot>	 !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided)
[13:20:33] <logmsgbot>	 !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 03s)
[13:20:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi)
[13:21:30] <James_F>	 Daimona: Do you also need me to add a key on Beta Cluster, or are you doing that // it's not installed there?
[13:21:43] <Daimona>	 It was done for beta a couple weeks ago
[13:21:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi)
[13:21:50] <James_F>	 Aha, excellent.
[13:22:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/939293 (owner: 10Ilias Sarantopoulos)
[13:22:10] <Daimona>	 (T320258)
[13:22:11] <stashbot>	 T320258: Dashboard integration: Configure the P&E Dashboard integration in beta - https://phabricator.wikimedia.org/T320258
[13:22:32] * James_F nods.
[13:24:11] <wikibugs>	 (03PS1) 10Ladsgroup: spicerack: Add config file for MySQL/MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/939302
[13:24:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall)
[13:24:47] <wikibugs>	 (03Merged) 10jenkins-bot: Add cookbook to manage users SSH keys on SONiC devices [cookbooks] - 10https://gerrit.wikimedia.org/r/938853 (https://phabricator.wikimedia.org/T338028) (owner: 10Ayounsi)
[13:25:26] <wikibugs>	 (03PS2) 10Ladsgroup: spicerack: Add config file for MySQL/MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/939302
[13:25:42] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/939302 (owner: 10Ladsgroup)
[13:26:19] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:771622|Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy (T275945)]] (duration: 07m 52s)
[13:26:23] <stashbot>	 T275945: Launch Wikifunctions - https://phabricator.wikimedia.org/T275945
[13:26:55] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host analytics1073.eqiad.wmnet
[13:28:21] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257 (owner: 10Elukey)
[13:29:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: bump up container limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939257 (owner: 10Elukey)
[13:30:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/938821 (owner: 10Volans)
[13:30:22] * James_F twiddles thumbs.
[13:30:31] <James_F>	 Daimona: Sorry each scap takes so long.
[13:30:45] <Daimona>	 That's fine :)
[13:31:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9019625, @Vgutierrez wrote: > @ayounsi @cmooney could you let DCops know which racks would be better for these boxes? Thanks!  I am on-site this week in eqiad.  Can I get...
[13:32:07] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42523/console" [puppet] - 10https://gerrit.wikimedia.org/r/937522 (https://phabricator.wikimedia.org/T341721) (owner: 10Jbond)
[13:32:37] <wikibugs>	 (03PS2) 10Jforrester: prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy)
[13:33:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy)
[13:33:33] <James_F>	 Daimona: You OK to test ^ when it lands on a canary?
[13:33:51] <Daimona>	 Yup
[13:33:59] <wikibugs>	 (03Merged) 10jenkins-bot: prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939286 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy)
[13:34:06] <James_F>	 Ace.
[13:34:16] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:939286|prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance (T320260)]]
[13:34:27] <stashbot>	 T320260: Dashboard integration: Configure the P&E Dashboard integration in prod - https://phabricator.wikimedia.org/T320260
[13:35:17] <logmsgbot>	 !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided)
[13:35:21] <logmsgbot>	 !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 03s)
[13:35:45] <logmsgbot>	 !log jforrester@deploy1002 daimona and jforrester: Backport for [[gerrit:939286|prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance (T320260)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:37:36] <James_F>	 Daimona: It's live to test.
[13:38:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/937522 (https://phabricator.wikimedia.org/T341721) (owner: 10Jbond)
[13:38:47] <Daimona>	 Thanks!
[13:38:50] <wikibugs>	 (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.6.3 [software/homer] - 10https://gerrit.wikimedia.org/r/939303
[13:39:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:40:05] <James_F>	 xSavitar: Made https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/35 to shift the window an hour earlier.
[13:40:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:40:22] <wikibugs>	 (03PS2) 10Jforrester: private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy)
[13:41:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10RobH) a:05Jclark-ctr→03Eevans @eevans,  I'm on-site in EQIAD this week (through Friday).  John has already designated where these will land:    >>! In T307035#8565052, @...
[13:41:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:42:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi)
[13:42:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[13:42:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:43:13] <wikibugs>	 (03PS4) 10Herron: promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond)
[13:43:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:43:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.6.3 [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi)
[13:43:30] <jbond>	 !log upload python3-conftool_2.2.2-1
[13:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:40] <jbond>	 !log upload python3-conftool_2.2.2-1+deb12u1
[13:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:06] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.3 [software/homer] - 10https://gerrit.wikimedia.org/r/939303 (owner: 10Ayounsi)
[13:46:17] <wikibugs>	 (03CR) 10Jbond: promethus: switch to using cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond)
[13:47:05] <James_F>	 Daimona: Still testing? OK to sync?
[13:47:14] <Daimona>	 Still testing, almost done
[13:47:21] <James_F>	 Ack.
[13:47:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:49:28] <Daimona>	 Testing done, looks great!
[13:49:43] <James_F>	 Cool.
[13:50:30] <wikibugs>	 (03CR) 10Herron: [C: 03+2] promethus: switch to using cfssl [puppet] - 10https://gerrit.wikimedia.org/r/930187 (https://phabricator.wikimedia.org/T326657) (owner: 10Jbond)
[13:50:39] <James_F>	 And then finally just a doc-only one.
[13:51:46] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10cmooney) @RobH There is no real preference on my side.  I would say pick one rack from E1/E2/E3/F1/F2/F3 and put the first 3 of them in that one, then place lvs1016 in a different rack from th...
[13:52:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:53:33] <xSavitar>	 James_F, thanks! Will share on CT Slack. I already asked there and whatever is decided, then we'll move forward.
[13:54:18] <James_F>	 xSavitar: Thank you!
[13:55:08] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:55:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10RobH)
[13:55:35] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:939286|prod: Enable wgCampaignEventsProgramsAndEventsDashboardInstance (T320260)]] (duration: 21m 19s)
[13:55:39] <stashbot>	 T320260: Dashboard integration: Configure the P&E Dashboard integration in prod - https://phabricator.wikimedia.org/T320260
[13:55:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy)
[13:55:54] <xSavitar>	 James_F, for nothing sir. Yiannis is fine with the change but he said if it's okay with EU tz friendly. So I'll give you feedback once everyone settles. :)
[13:56:22] <wikibugs>	 (03Merged) 10jenkins-bot: private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939288 (https://phabricator.wikimedia.org/T320260) (owner: 10Daimona Eaytoy)
[13:56:25] <wikibugs>	 10SRE: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ)
[13:56:32] * James_F nods.
[13:56:39] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:939288|private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret (T320260)]]
[13:56:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1001.eqiad.wmnet
[13:56:41] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[13:58:09] <wikibugs>	 10SRE: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10Jdforrester-WMF) Looking at the SAL, possible fall-out from T337997?
[13:58:21] <logmsgbot>	 !log jforrester@deploy1002 jforrester and daimona: Backport for [[gerrit:939288|private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret (T320260)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:00:08] <jinxer-wm>	 (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (PUT deployments) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:00:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) Thanks @cmooney, @Fabfur will take care of running the decom cookbook (thanks!)
[14:01:19] <icinga-wm>	 PROBLEM - Host lsw1-f2-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:01:20] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) We were unaware that these moves would require an IP change (and by extension/recommendation a reimage).  There is more than 2TB of data (per host) that would have t...
[14:02:37] <icinga-wm>	 PROBLEM - Host ps1-f2-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:07] <wikibugs>	 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10Joe)
[14:04:29] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1013.eqiad.wmnet
[14:04:44] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:939288|private/readme.php: Add $wgCampaignEventsProgramsAndEventsDashboardAPISecret (T320260)]] (duration: 08m 04s)
[14:04:47] <stashbot>	 T320260: Dashboard integration: Configure the P&E Dashboard integration in prod - https://phabricator.wikimedia.org/T320260
[14:05:19] <XioNoX>	 !log asw2-esams# set interfaces xe-4/0/4 disable - T342121
[14:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:23] <stashbot>	 T342121: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121
[14:06:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:06:20] <James_F>	 OK, finally, all done. 65 minutes for 5 patches. :-(
[14:07:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:08:29] <icinga-wm>	 PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:09:25] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) p:05Triage→03Medium
[14:10:00] <wikibugs>	 (03PS1) 10Ayounsi: Release v0.6.3 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939306
[14:10:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[14:10:33] <wikibugs>	 (03Abandoned) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937092 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz)
[14:12:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[14:16:27] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[14:16:27] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:16:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1001.eqiad.wmnet on all recursors
[14:16:30] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1001.eqiad.wmnet on all recursors
[14:16:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:17:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:55] <wikibugs>	 (03PS3) 10Hashar: Recognize ~/.config/docker-pkg.yaml [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991
[14:18:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[14:19:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver monitoring - https://phabricator.wikimedia.org/T342125 (10jbond)
[14:19:15] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1001.eqiad.wmnet - bking@cumin1001"
[14:19:15] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:19:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1001.eqiad.wmnet on all recursors
[14:19:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1001.eqiad.wmnet on all recursors
[14:19:25] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1001.eqiad.wmnet
[14:19:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppetserver monitoring - https://phabricator.wikimedia.org/T342125 (10jbond) 05Open→03In progress p:05Triage→03Medium
[14:19:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[14:21:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Release v0.6.3 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939306 (owner: 10Ayounsi)
[14:21:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Release v0.6.3 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939306 (owner: 10Ayounsi)
[14:21:34] <wikibugs>	 (03PS1) 10Jelto: gitlab: auto link existing users with OIDC [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390)
[14:22:09] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[14:22:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42524/console" [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[14:22:56] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[14:22:56] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:22:57] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1013.eqiad.wmnet
[14:23:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1013.eqiad.wmnet` - lvs1013.eqiad.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager...
[14:23:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[14:23:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[14:24:37] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42525/console" [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto)
[14:24:46] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur) lvs1013.eqiad.wmnet has been decommissioned via cookbook @Tue 18 Jul 2023 02:24:10 PM UTC
[14:25:05] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur)
[14:25:14] <wikibugs>	 (03CR) 10Hashar: Recognize ~/.config/docker-pkg.yaml (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/935991 (owner: 10Hashar)
[14:25:30] <wikibugs>	 (03PS13) 10Giuseppe Lavagetto: api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[14:27:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[14:27:35] <wikibugs>	 (03PS1) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125)
[14:28:13] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Switch to mw-api-int on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert)
[14:29:33] <wikibugs>	 (03PS2) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125)
[14:29:40] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1014.eqiad.wmnet
[14:30:33] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[14:31:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:31:49] <wikibugs>	 (03PS2) 10Ayounsi: Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082)
[14:32:26] <wikibugs>	 (03PS3) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125)
[14:33:09] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[14:33:19] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[14:33:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42528/console" [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[14:34:14] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[14:34:56] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[14:35:24] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[14:35:45] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[14:36:56] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.3 - ayounsi@cumin1001
[14:36:58] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[14:37:06] <wikibugs>	 (03CR) 10Eevans: "To be clear: The component/cassandra311 repository has been updated to 3.11.14, making this changeset a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans)
[14:37:56] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1014.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[14:37:56] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:37:57] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1014.eqiad.wmnet
[14:38:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1014.eqiad.wmnet` - lvs1014.eqiad.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager...
[14:38:23] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[14:38:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.3 - ayounsi@cumin1001
[14:38:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[14:39:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[14:39:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[14:40:05] <wikibugs>	 (03Merged) 10jenkins-bot: Convert ACL policies to YAML for Aerleon [homer/public] - 10https://gerrit.wikimedia.org/r/929330 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[14:40:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur)
[14:42:33] <wikibugs>	 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10cmooney) @TheDJ thanks for reporting this, indeed it does not look right and was an oversight by myself after we re-pooled esams earlier today.  We did some work earlier moving equipment in one of our...
[14:44:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:45:23] <sukhe>	 !log dns2004 upgrade to pdns-rec 4.8.4: T341611
[14:45:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:26] <stashbot>	 T341611: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611
[14:49:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:50:13] <wikibugs>	 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) 05Open→03Resolved a:03TheDJ Thank you, seems fixed now indeed.
[14:53:50] <icinga-wm>	 PROBLEM - Host db1198 #page is DOWN: PING CRITICAL - Packet loss = 100%
[14:54:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1015.eqiad.wmnet
[14:54:17] <sukhe>	 ^^^
[14:54:31] <jbond>	 marostegui: is this known
[14:54:47] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[14:54:56] <marostegui>	  nop
[14:54:57] <marostegui>	 checking
[14:55:02] <herron>	 I'll ack the page
[14:55:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1198', diff saved to https://phabricator.wikimedia.org/P49571 and previous config saved to /var/cache/conftool/dbconfig/20230718-145529-root.json
[14:55:30] * jbond dissconnects from mgmt port
[14:55:31] <marostegui>	 Depooled
[14:55:33] <marostegui>	 Thanks herron 
[14:55:35] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server: Fix structure for certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939311 (https://phabricator.wikimedia.org/T300033)
[14:55:42] <marostegui>	 I am going to create a task so we can follow up there
[14:55:48] <jbond>	 marostegui: ack thanks
[14:55:55] <herron>	 sounds good marostegui 
[14:56:55] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013 relocation - robh@cumin1001"
[14:57:14] <wikibugs>	 (03PS1) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527)
[14:57:17] <marostegui>	 Thanks both
[14:57:35] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42529/console" [puppet] - 10https://gerrit.wikimedia.org/r/939311 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[14:57:50] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1013 relocation - robh@cumin1001"
[14:57:50] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:58:13] <wikibugs>	 (03CR) 10JHathaway: install_server: drop Bashisms (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[14:58:40] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server: Fix structure for certmanager defaults [puppet] - 10https://gerrit.wikimedia.org/r/939311 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[14:58:41] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10ayounsi)
[14:58:47] <wikibugs>	 (03PS1) 10Marostegui: db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/939313 (https://phabricator.wikimedia.org/T342129)
[15:00:17] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Memory issues: ` Record:      15 Date/Time:   01/19/2023 16:23:12 Source:      system Severity:    Critical Description: Multi-bit memory errors are detected on the memory device at location(s) D...
[15:00:20] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1198: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/939313 (https://phabricator.wikimedia.org/T342129) (owner: 10Marostegui)
[15:00:33] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.provision for host lvs1013.mgmt.eqiad.wmnet with reboot policy FORCED
[15:01:03] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[15:01:19] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Actually I realised those errors are old
[15:01:59] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) There are no more errors on the idrac - @Jclark-ctr can you check it onsite? The host seems to be unreachable
[15:02:19] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:02:20] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1015.eqiad.wmnet
[15:02:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1015.eqiad.wmnet` - lvs1015.eqiad.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager...
[15:03:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:38] <icinga-wm>	 RECOVERY - Host db1198 #page is UP: PING WARNING - Packet loss = 90%, RTA = 0.28 ms
[15:03:39] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Jclark-ctr) a:03Jclark-ctr @Marostegui  looking at it now
[15:03:44] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Thanks!
[15:03:51] <icinga-wm>	 PROBLEM - SSH on db1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:04:47] <icinga-wm>	 RECOVERY - SSH on db1198 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:05:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur)
[15:07:16] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Jclark-ctr) 05Open→03Resolved @Marostegui Replaced cable   link is back up now
[15:08:14] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host lvs1013.mgmt.eqiad.wmnet with reboot policy FORCED
[15:09:07] <wikibugs>	 10ops-eqiad, 10DBA, 10Patch-For-Review: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Thanks John! I can reach the server now! It looks like it indeed didn't crash, the uptime is 70 days and MySQL is also up. Logs confirms it was a network issue: ` [Tue Jul 18 14:48:46 2023] tg3 0...
[15:11:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/939307 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto)
[15:12:38] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans)
[15:12:40] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] roll-restart-wikimedia-dns: Add reboot action (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall)
[15:13:01] <wikibugs>	 10ops-eqiad, 10DBA: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Jclark-ctr) @Marostegui  Sorry i did close task if you want to reopen it.  i was able to duplicate loosing link with cable. replaced sfp-t and cable
[15:13:31] <wikibugs>	 10ops-eqiad, 10DBA: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Don't worry, no need to reopen :)
[15:13:41] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:14:08] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: remove k8s stats-exporter cloning [puppet] - 10https://gerrit.wikimedia.org/r/937603 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:14:37] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: remove grafana log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937602 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:15:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:16:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42530/console" [puppet] - 10https://gerrit.wikimedia.org/r/939308 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[15:18:03] <wikibugs>	 (03PS1) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125)
[15:18:12] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye
[15:18:19] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye
[15:18:54] <wikibugs>	 (03CR) 10JHathaway: [C: 04-1] monitoring: fix bashisms and other minor lint issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[15:19:57] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe)
[15:20:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[15:20:45] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bullseye
[15:20:50] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye executed with errors: - lvs1013 (**FAIL**)   - Removed from Pup...
[15:21:11] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1013
[15:21:13] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1013
[15:22:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] kubeadm: the use of read -p suggest this should be using bash [puppet] - 10https://gerrit.wikimedia.org/r/938899 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[15:22:52] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye
[15:22:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye
[15:23:22] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.decommission for hosts lvs1016.eqiad.wmnet
[15:23:36] <wikibugs>	 (03PS13) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[15:23:38] <wikibugs>	 (03PS1) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033)
[15:24:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:24:10] <wikibugs>	 (03CR) 10Jsn.sherman: beta: log additional click events on Special:Diff|MobileDiff (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896432 (https://phabricator.wikimedia.org/T326214) (owner: 10Jsn.sherman)
[15:24:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:25:38] <wikibugs>	 (03PS2) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125)
[15:26:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42532/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[15:28:30] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.dns.netbox
[15:30:18] <wikibugs>	 (03PS1) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/939326 (https://phabricator.wikimedia.org/T301944)
[15:31:03] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[15:31:24] <wikibugs>	 (03PS2) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033)
[15:31:26] <wikibugs>	 (03PS14) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[15:31:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Fabfur)
[15:31:56] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fabfur@cumin1001"
[15:31:56] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:31:57] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs1016.eqiad.wmnet
[15:32:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by fabfur@cumin1001 for hosts: `lvs1016.eqiad.wmnet` - lvs1016.eqiad.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager...
[15:33:53] <wikibugs>	 (03PS3) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125)
[15:34:39] <wikibugs>	 (03CR) 10JHathaway: ssh: switch to using the same file we use in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[15:34:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42533/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[15:36:52] <wikibugs>	 (03PS4) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125)
[15:37:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42535/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[15:39:04] <wikibugs>	 (03PS5) 10Jbond: puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125)
[15:39:55] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra: transition 3.11.14 from 'dev' to '3.x' [puppet] - 10https://gerrit.wikimedia.org/r/938917 (owner: 10Eevans)
[15:40:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42536/console" [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[15:41:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: noc: add script to dump etcd db config (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[15:41:24] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859)
[15:41:26] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: noc/db.php: use the new etcd fetch function [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859)
[15:42:21] <wikibugs>	 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10Papaul)
[15:43:26] <wikibugs>	 (03PS3) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033)
[15:43:28] <wikibugs>	 (03PS15) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[15:43:30] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939318 (https://phabricator.wikimedia.org/T340246)
[15:43:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939318 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[15:44:14] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939318 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[15:44:40] <logmsgbot>	 !log dancy@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.18  refs T340246
[15:44:43] <stashbot>	 T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246
[15:46:24] <wikibugs>	 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10Papaul) Onsite work complete on asw-b1-codfw
[15:46:29] <wikibugs>	 (03PS4) 10Jbond: install_server: drop Bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064)
[15:48:32] <wikibugs>	 (03CR) 10JMeybohm: ".fixtures/service_proxy.yaml can actually be removed now, but I had issues doing so because the git revert code obviously recreates it as " [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[15:48:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:49:11] <wikibugs>	 (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[15:54:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] noc: add script to dump etcd db config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[15:55:11] <wikibugs>	 (03PS5) 10Cwhite: Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461)
[15:55:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/939314 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[15:59:41] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppetserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:43] <wikibugs>	 (03CR) 10JHathaway: install_server: drop Bashisms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[16:00:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1600).
[16:00:05] <jouncebot>	 dancy and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:10] <dancy>	 o/
[16:00:10] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[16:01:00] <wikibugs>	 10ops-eqiad: analytics1073 loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis)
[16:01:29] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs1013.eqiad.wmnet with OS bullseye
[16:01:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye executed with errors: - lvs1013 (**FAIL**)   - Removed from Pup...
[16:01:54] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations: analytics1073 loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis)
[16:02:32] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs10145 relocation - robh@cumin1001"
[16:02:45] <wikibugs>	 (03CR) 10Kaleem Bhatti: "merge please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[16:02:57] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH)
[16:03:10] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1014
[16:03:16] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs10145 relocation - robh@cumin1001"
[16:03:16] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:03:17] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1014
[16:03:21] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1015
[16:03:27] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1015
[16:04:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH)
[16:04:46] <wikibugs>	 (03PS1) 10Jbond: puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125)
[16:05:03] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:05:13] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10bking) Was thinking a bit more about this...would it work to do some minimal sanity-checking on the DNS changes (such as t...
[16:05:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:05:52] <dancy>	 jbond: Are you handling the puppet window today?
[16:07:00] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond) i think this will ultmatly be solved by adding locking support to cookbooks, see T341973
[16:08:07] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond) > It looks like there is work in progess to add locking to cookbooks , which would be an acceptable workaround. ind...
[16:08:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond)
[16:08:30] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10jbond)
[16:08:59] <rzl>	 dancy: sorry I had an interview run long, looking now!
[16:09:04] <dancy>	 Thanks!
[16:09:50] <rzl>	 any preferred order, or do em both at once?
[16:10:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:10:16] <dancy>	 938931 first, then a quick test, then 938939
[16:10:23] <rzl>	 👍
[16:10:37] <wikibugs>	 (03PS2) 10Jbond: puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125)
[16:10:38] <rzl>	 on gitlab-runner1003, yeah?
[16:10:56] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Use buildkit wmf-v0.11-8 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938931 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[16:11:04] <dancy>	 938931 applies to all of the WMCS and trusted runners (there are several)
[16:11:25] <dancy>	 Likewise for 938939
[16:11:43] <rzl>	 sure, I can run puppet on all of them if you like
[16:11:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42538/console" [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:11:52] <dancy>	 Yes please.
[16:12:20] <rzl>	 is this the right set? gitlab-runner[2002-2004].codfw.wmnet,gitlab-runner[1002-1004].eqiad.wmnet
[16:12:30] <rzl>	 I think you're on your own for the WMCS ones
[16:12:58] <wikibugs>	 (03PS1) 10DCausse: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939327
[16:13:02] <wikibugs>	 (03PS2) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527)
[16:13:14] <wikibugs>	 (03PS3) 10Jbond: puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125)
[16:13:53] <wikibugs>	 (03PS1) 10DCausse: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939328
[16:13:54] <dancy>	 rzl: yes that's the right set for the trusted runners
[16:14:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42539/console" [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:14:21] <dancy>	 I can wait for the regular puppet runs for the WMCS runners.
[16:15:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add jmx config [puppet] - 10https://gerrit.wikimedia.org/r/939322 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:15:32] <rzl>	 done, except puppet failed on gitlab-runner2003.codfw.wmnet, having a look
[16:16:09] <rzl>	 oh, it just didn't get the lock because a regular run was already in progress
[16:16:11] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:14] <rzl>	 cool 👍 go ahead and test
[16:16:26] <dancy>	 ok.. in progress.
[16:18:30] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] Logstash: implement availability SLO (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) (owner: 10Cwhite)
[16:19:41] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:44] <wikibugs>	 (03PS1) 10Jbond: puppetserver: add dependency [puppet] - 10https://gerrit.wikimedia.org/r/939324
[16:19:50] <rzl>	 dancy: ^ fyi
[16:20:32] <dancy>	 hmmm
[16:21:11] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[16:22:39] <wikibugs>	 (03PS3) 10Mabualruz: Run a synthetic test for client side preferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527)
[16:23:11] <dancy>	 rzl: Looks like that was a one-off glitch (a problem communicating with dockerd?).  Subsequent runs of the service seem to be ok.
[16:23:52] <rzl>	 good enough for me
[16:24:00] <rzl>	 should I go ahead with the next patch?
[16:24:17] <dancy>	 Yes.  First phase of testing passed.  Ready for the next.
[16:24:26] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Restrict buildkitd frontend gateway and allowed sourced on trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/938939 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[16:24:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1075.eqiad.wmnet with OS bullseye
[16:25:23] <rzl>	 puppet's running now
[16:26:13] <rzl>	 and done, over to you
[16:28:22] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) 05Open→03Resolved We have updated the SLI to an availability.  Changes are applied to the dash...
[16:28:38] <dancy>	 rzl: Second test completed. Thanks for deploying!
[16:28:40] <elukey>	 !log maintenance finished for kafka main-codfw
[16:28:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:48] <rzl>	 rad, thanks!
[16:29:52] <wikibugs>	 (03PS1) 10Ayounsi: Aerleon: workaround regression with includes [homer/public] - 10https://gerrit.wikimedia.org/r/939325 (https://phabricator.wikimedia.org/T337082)
[16:30:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Aerleon: workaround regression with includes [homer/public] - 10https://gerrit.wikimedia.org/r/939325 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[16:30:55] <logmsgbot>	 !log dancy@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.18  refs T340246 (duration: 46m 15s)
[16:31:01] <stashbot>	 T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246
[16:31:37] <wikibugs>	 (03Merged) 10jenkins-bot: Aerleon: workaround regression with includes [homer/public] - 10https://gerrit.wikimedia.org/r/939325 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi)
[16:33:09] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.41.0-wmf.16 (duration: 02m 11s)
[16:34:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF)
[16:34:49] <wikibugs>	 (03PS2) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125)
[16:36:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10WDoranWMF) I'm marking this as high because Thomas will need the access in order to be able to start supporting Ops work starting week...
[16:36:41] <wikibugs>	 (03PS3) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125)
[16:36:47] <wikibugs>	 (03PS1) 10Btullis: Upgrade the analytics instance of airflow to version 2.6.3 [puppet] - 10https://gerrit.wikimedia.org/r/939347 (https://phabricator.wikimedia.org/T336286)
[16:37:05] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:39:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:39:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) p:05Triage→03High
[16:39:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:39:57] <wikibugs>	 (03PS4) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125)
[16:42:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10JAllemandou) Indeed, being part of Data Engineering team, Thomas will be in charge during his ops-week time to restart jobs as the `ana...
[16:44:24] <wikibugs>	 (03PS5) 10Jbond: puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125)
[16:44:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) Happily, I can also approve this change, as per: https://gerrit.wikimedia.org/r/c/operations/puppet/+/933976 I'll merge this a...
[16:44:35] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update approvers for analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/933976 (owner: 10Ottomata)
[16:45:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42543/console" [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:45:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:46:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetserver: drop monitoring profile and handle jmx in modules [puppet] - 10https://gerrit.wikimedia.org/r/939324 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[16:48:42] <wikibugs>	 (03PS1) 10Btullis: Add tchin to the analytics-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/939348 (https://phabricator.wikimedia.org/T342146)
[16:49:02] <wikibugs>	 (03PS1) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027)
[16:50:40] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) p:05Triage→03High I have created: https://gerrit.wikimedia.org/r/939348
[16:56:52] <wikibugs>	 (03CR) 10Jdlrobson: "Jan: per standup can you run tests locally and compare results with Mo on the ticket. Thanks in advance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939312 (https://phabricator.wikimedia.org/T336527) (owner: 10Mabualruz)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1700)
[17:01:39] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.55.0" for 605 hosts
[17:02:35] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.55.0" completed for 605 hosts
[17:04:02] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough-drmrs and A:wikidough
[17:07:20] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner
[17:07:34] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:07:52] <sukhe>	 ^ expected
[17:07:56] <wikibugs>	 (03PS6) 10Jdlrobson: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260)
[17:08:14] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:08:52] <wikibugs>	 (03PS1) 10Jdlrobson: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081)
[17:09:01] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10lmata)
[17:09:02] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:09:42] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:10:12] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10lmata) p:05Triage→03Medium
[17:10:20] <wikibugs>	 (03CR) 10Jdlrobson: "FYI. I'll backport this today. Intentionally doing this on English Wikipedia but not wmf18 so we get a couple of days of data since the er" [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[17:10:37] <wikibugs>	 (03PS1) 10Jdlrobson: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081)
[17:13:18] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) a:03BTullis
[17:14:45] <wikibugs>	 10SRE, 10Observability-Alerting: Setup some alert mechanism when some 'critical' cron jobs fail - https://phabricator.wikimedia.org/T187101 (10lmata) Understood, we will make a note of it in our backlog and carefully evaluate it when the opportunity arises
[17:14:56] <wikibugs>	 (03PS1) 10Jdlrobson: Fixes: Mobile login watermark large and uncentered [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812)
[17:16:22] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:16:25] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1002.eqiad.wmnet
[17:16:26] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[17:16:36] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:18:06] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:19:09] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[17:19:20] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:19:46] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough-drmrs and A:wikidough
[17:19:55] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[17:19:55] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:19:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1002.eqiad.wmnet on all recursors
[17:19:58] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1002.eqiad.wmnet on all recursors
[17:20:23] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[17:20:36] <wikibugs>	 (03CR) 10Ssingh: "So we were testing this today and observed that when the host reboots, Puppet is still disabled. And because it is disabled, it won't find" [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall)
[17:20:50] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: disable unprivileged_userns [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334)
[17:21:06] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[17:25:18] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1002.eqiad.wmnet with OS bookworm
[17:25:25] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm
[17:27:05] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[17:28:46] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10lmata) p:05Triage→03Medium
[17:29:01] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016 relocation - robh@cumin1001"
[17:29:08] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10lmata)
[17:29:46] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs1016 relocation - robh@cumin1001"
[17:29:47] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:30:23] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10lmata) Since this is part of core work would this be better fitted as "high" priority?
[17:30:28] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host lvs1016
[17:30:36] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs1016
[17:30:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH)
[17:33:58] <wikibugs>	 (03CR) 10Jelto: "I found two more references here:" [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) (owner: 10EoghanGaffney)
[17:40:10] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42544/console" [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[17:42:23] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] Add tchin to the analytics-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/939348 (https://phabricator.wikimedia.org/T342146) (owner: 10Btullis)
[17:45:06] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1075.eqiad.wmnet with OS bullseye
[17:46:31] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner
[17:47:49] <wikibugs>	 (03CR) 10KaleemBot: [C: 03+1] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[17:52:56] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm. Do you have a ldap_group_sync_bot_token I can put into private puppet? I can also create a token myself if you have a bot-user/proje" [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[17:56:00] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:57:04] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10andrea.denisse) Hi @Mpossoupe , could you please fill the details of your request using the [[ https://phabricator.wikimedia.org/project/profile/1564/ | LDAP-Access-Requests templates ]] and tag your manage...
[17:57:11] <wikibugs>	 (03CR) 10Ahmon Dancy: Run LDAP group sync periodically on gitlab replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932343 (https://phabricator.wikimedia.org/T319211) (owner: 10Ahmon Dancy)
[17:57:20] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.294 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:57:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10andrea.denisse) a:03andrea.denisse
[17:58:08] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, will merge this on Thursday if that's fine for you Antoine" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[18:00:04] <jouncebot>	 dancy and dduvall: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T1800).
[18:01:56] <dancy>	 🚂
[18:02:11] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939358 (https://phabricator.wikimedia.org/T340246)
[18:02:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939358 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[18:03:10] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939358 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[18:03:36] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: disable unprivileged_userns [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) (owner: 10Jelto)
[18:03:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe) Hi @andrea.denisse , Noted. Will do and let you know.  Thanks
[18:09:23] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:10:03] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.18  refs T340246
[18:10:07] <stashbot>	 T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246
[18:16:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1002.eqiad.wmnet with OS bookworm
[18:16:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1002.eqiad.wmnet
[18:16:32] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1002.eqiad.wmnet with OS bookworm executed w...
[18:28:29] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh)
[18:28:38] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) p:05Triage→03Medium
[18:36:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:38:27] <wikibugs>	 (03CR) 10Gergő Tisza: IP Masking: Enable for cswiki beta (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[18:40:42] <wikibugs>	 (03CR) 10Gergő Tisza: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[18:41:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:43:30] <wikibugs>	 (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[18:51:01] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[18:51:18] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 17s)
[18:54:41] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[18:54:46] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s)
[18:57:20] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[18:57:39] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 18s)
[19:08:12] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH)
[19:10:35] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) a:03ssingh Please note parent task 341588 has the range of cp1[090-105] however, cp1090 is already live/in use.  Additionally, we have 4 cp hosts from eqsin to use for CP in eqiad (so c...
[19:10:44] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH)
[19:10:59] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Traffic: Q3:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH)
[19:14:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:19:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:21:55] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10RobH)
[19:22:01] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10RobH)
[19:34:31] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10RobH)
[19:34:47] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10RobH)
[19:37:31] <wikibugs>	 (03PS7) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945)
[19:37:33] <wikibugs>	 (03PS6) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945)
[19:38:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[19:39:58] <wikibugs>	 (03PS7) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945)
[19:45:14] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10RobH)
[19:45:26] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10RobH)
[19:49:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet']
[19:49:13] <logmsgbot>	 !log btullis@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['analytics1073.mgmt.eqiad.wmnet']
[19:49:42] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[19:50:15] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[19:52:48] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[19:53:06] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1073.eqiad.wmnet']
[19:53:33] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['analytics1075.eqiad.wmnet']
[19:53:53] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['analytics1075.eqiad.wmnet']
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230718T2000). Please do the needful.
[20:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:45] <urbanecm>	 i can deploy today
[20:00:47] <urbanecm>	 hi Jdlrobson 
[20:00:59] <Jdlrobson>	 here
[20:01:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:02:31] <urbanecm>	 Jdlrobson: your comment on the Popups patch says "doing this on English Wikipedia but not wmf18", but a wmf.18 patch is in the calendar as well. is that intentional?
[20:02:46] <Jdlrobson>	 yes that's intentional
[20:02:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Fixes: Mobile login watermark large and uncentered [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812) (owner: 10Jdlrobson)
[20:03:19] <urbanecm>	 okay, so +2'ing the other patch too then.
[20:03:22] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:03:57] <wikibugs>	 (03PS1) 10Jbond: puppetserver: use FQDN in metric [puppet] - 10https://gerrit.wikimedia.org/r/939362 (https://phabricator.wikimedia.org/T342125)
[20:04:07] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10RobH)
[20:04:18] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10RobH)
[20:04:29] <wikibugs>	 (03PS7) 10Urbanecm: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson)
[20:04:35] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson)
[20:05:08] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42545/console" [puppet] - 10https://gerrit.wikimedia.org/r/939362 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[20:05:51] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937480 (https://phabricator.wikimedia.org/T341260) (owner: 10Jdlrobson)
[20:06:24] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:937480|Deploy new logos (T341260 T341243 T341912)]]
[20:06:32] <stashbot>	 T341243: Design: Get icons for remaining Wiktionary,  Wikiversity, Wikibooks projects - https://phabricator.wikimedia.org/T341243
[20:06:32] <stashbot>	 T341912: Update knwikisource logos - https://phabricator.wikimedia.org/T341912
[20:06:32] <stashbot>	 T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260
[20:06:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939329 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:07:52] <logmsgbot>	 !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:937480|Deploy new logos (T341260 T341243 T341912)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:08:09] <urbanecm>	 Jdlrobson: your config patch is at mwdebug1001, can you test? :)
[20:08:21] <Jdlrobson>	 yep
[20:10:22] <Jdlrobson>	 urbanecm: LGTM
[20:10:30] <urbanecm>	 proceeding
[20:16:14] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:937480|Deploy new logos (T341260 T341243 T341912)]] (duration: 09m 50s)
[20:16:20] <urbanecm>	 and done
[20:16:21] <stashbot>	 T341243: Design: Get icons for remaining Wiktionary,  Wikiversity, Wikibooks projects - https://phabricator.wikimedia.org/T341243
[20:16:21] <stashbot>	 T341912: Update knwikisource logos - https://phabricator.wikimedia.org/T341912
[20:16:22] <stashbot>	 T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260
[20:16:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:17:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:17:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812) (owner: 10Jdlrobson)
[20:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: Fixes: Mobile login watermark large and uncentered [extensions/MobileFrontend] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939331 (https://phabricator.wikimedia.org/T341812) (owner: 10Jdlrobson)
[20:17:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add additional debugging closest bug [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939330 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:17:54] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:939329|Add additional debugging closest bug (T340081)]], [[gerrit:939330|Add additional debugging closest bug (T340081)]], [[gerrit:939331|Fixes: Mobile login watermark large and uncentered (T341812)]]
[20:17:59] <stashbot>	 T341812: Mobile login watermark large and uncentered - https://phabricator.wikimedia.org/T341812
[20:17:59] <stashbot>	 T340081: TypeError: n.closest is not a function 	 - https://phabricator.wikimedia.org/T340081
[20:19:28] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:939329|Add additional debugging closest bug (T340081)]], [[gerrit:939330|Add additional debugging closest bug (T340081)]], [[gerrit:939331|Fixes: Mobile login watermark large and uncentered (T341812)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes de
[20:19:28] <logmsgbot>	 ployment (accessible via k8s-experimental XWD option)
[20:19:45] <urbanecm>	 Jdlrobson: all three backports are at mwdebug1001, can you test them now please?
[20:20:01] <Jdlrobson>	 looking :)
[20:21:16] <Jdlrobson>	 all of them LGTM
[20:21:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:21:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:22:36] <urbanecm>	 great, proceeding
[20:25:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:26:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1002.eqiad.wmnet
[20:26:54] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[20:28:21] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10andrea.denisse) Hi! I see this task in the SRE Clinic Duty Triage, feel free to let me know if you would like me to help with creating the VMs. :)
[20:28:22] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:939329|Add additional debugging closest bug (T340081)]], [[gerrit:939330|Add additional debugging closest bug (T340081)]], [[gerrit:939331|Fixes: Mobile login watermark large and uncentered (T341812)]] (duration: 10m 28s)
[20:28:27] <stashbot>	 T341812: Mobile login watermark large and uncentered - https://phabricator.wikimedia.org/T341812
[20:28:27] <stashbot>	 T340081: TypeError: n.closest is not a function 	 - https://phabricator.wikimedia.org/T340081
[20:28:27] <urbanecm>	 Jdlrobson: and deployed
[20:28:34] <urbanecm>	 anything else?
[20:28:43] <Jdlrobson>	 thanks urbanecm ill keep an eye on logstash. Should need 10 mins to double check everything is good
[20:29:06] <urbanecm>	 sounds good. feel free to ping me should a revert become necessary.
[20:29:07] <wikibugs>	 (03CR) 10Gergő Tisza: IP Masking: Enable for cswiki beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[20:30:04] <Jdlrobson>	 urbanecm: looks like it's working to me
[20:30:07] <urbanecm>	 awesome
[20:30:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:30:47] <Jdlrobson>	 urbanecm: hm.. it does look a lot higher than expected thoug
[20:30:56] <urbanecm>	 higher?
[20:31:22] <Jdlrobson>	 urbanecm: yeh i think we might need to follow up or revert.
[20:31:35] <Jdlrobson>	 hang on
[20:31:38] <urbanecm>	 waiting
[20:32:06] <Jdlrobson>	 yeh :(
[20:32:08] <Jdlrobson>	 at least for enwiki for now
[20:32:16] <Jdlrobson>	 I can follow up on the deployment branch
[20:33:23] <urbanecm>	 I'd prefer a revert, unless the fix is very easy -- B&C is usually not intended for code review.
[20:33:41] <Jdlrobson>	 ive got a follow up ready
[20:33:47] <Jdlrobson>	 the codes in the wrong place
[20:33:51] <Jdlrobson>	 it should be moved down
[20:33:58] <Jdlrobson>	 so a trivial bug on my part :/
[20:34:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:34:30] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/939364
[20:35:19] <Jdlrobson>	 ^ urbanecm how do you feel about backporting that?
[20:35:25] <Jdlrobson>	 i can get it merged to master today
[20:35:29] <Jdlrobson>	 just might take longer than 30 mins
[20:35:51] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/939364/2/src/index.js is the significant part of the change
[20:36:03] <urbanecm>	 Jdlrobson: let's try backporting
[20:36:13] <Jdlrobson>	 we can try enwiki first 
[20:36:21] <Jdlrobson>	 1.41.0-wmf.17
[20:36:29] <Jdlrobson>	 if that works then we'll put it on 1.41.0-wmf.18
[20:36:41] <wikibugs>	 (03PS1) 10Jdlrobson: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081)
[20:36:57] <urbanecm>	 sounds good
[20:37:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:39:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:39:35] <wikibugs>	 (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[20:40:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:40:49] <urbanecm>	 Jdlrobson: ^^^
[20:41:07] <urbanecm>	 i guess let's restart, as it's selenium?
[20:41:12] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[20:42:22] <urbanecm>	 main passed, restarting...
[20:42:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:42:57] <wikibugs>	 (03PS1) 10Jdlrobson: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081)
[20:43:09] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[20:43:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:43:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1002.eqiad.wmnet on all recursors
[20:43:13] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1002.eqiad.wmnet on all recursors
[20:43:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[20:46:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:08] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10RobH)
[20:47:37] <wikibugs>	 (03Merged) 10jenkins-bot: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939332 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[20:47:55] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10RobH)
[20:48:00] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.dhcp for host analytics1073.eqiad.wmnet
[20:50:53] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata)
[20:52:53] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:939332|Don't log for documentElement (nodeType 9) (T340081)]]
[20:52:57] <stashbot>	 T340081: TypeError: n.closest is not a function 	 - https://phabricator.wikimedia.org/T340081
[20:54:10] <wikibugs>	 10SRE, 10Observability-Metrics: node_cpu_frequency_hertz metric no longer present in Bullseye - https://phabricator.wikimedia.org/T286768 (10lmata)
[20:54:24] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:939332|Don't log for documentElement (nodeType 9) (T340081)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:54:38] <urbanecm>	 Jdlrobson: finally on mwdebug. can you test?
[20:55:38] <Jdlrobson>	 if not urbanecm yep
[20:56:33] <Jdlrobson>	 https://en.wikipedia.org/wiki/Main_Page looking good..
[20:57:14] <urbanecm>	 so, let's proceed then
[20:59:38] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add tchin to the analytics-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/939348 (https://phabricator.wikimedia.org/T342146) (owner: 10Btullis)
[21:00:35] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[21:02:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[21:02:55] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:939332|Don't log for documentElement (nodeType 9) (T340081)]] (duration: 10m 01s)
[21:02:58] <stashbot>	 T340081: TypeError: n.closest is not a function 	 - https://phabricator.wikimedia.org/T340081
[21:03:28] <urbanecm>	 so, deployed.
[21:03:31] <urbanecm>	 let's do .18 then?
[21:03:34] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1002.eqiad.wmnet - bking@cumin1001"
[21:03:34] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:03:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1002.eqiad.wmnet on all recursors
[21:03:37] <Jdlrobson>	 urbanecm: just verifying the volume goes down
[21:03:37] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1002.eqiad.wmnet on all recursors
[21:03:42] <urbanecm>	 sure
[21:03:44] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1002.eqiad.wmnet
[21:03:54] <Jdlrobson>	 group 0 / 1 is not a problem for volume as I dont think any projects run page previews
[21:04:07] <urbanecm>	 ack
[21:04:16] <Jdlrobson>	 group 0 sorry
[21:04:21] <Jdlrobson>	 hewiki and cawiki run it and are in group 1
[21:04:52] <Jdlrobson>	 so far so good... https://usercontent.irccloud-cdn.com/file/KDQ2brpU/Screenshot%202023-07-18%20at%202.04.41%20PM.png
[21:05:07] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE, 10Patch-For-Review: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10BTullis) @tchin that is now done. Welcome to the `analytics-admins` group.
[21:05:47] <urbanecm>	 👍
[21:06:51] <Jdlrobson>	 ok yep this looks good
[21:06:54] <Jdlrobson>	 we can backport the other one
[21:07:01] <Jdlrobson>	 im seeing the tail :)
[21:08:23] <urbanecm>	 great, let's go for it
[21:08:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[21:08:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[21:10:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[21:10:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[21:13:18] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:13:57] <wikibugs>	 (03Merged) 10jenkins-bot: Don't log for documentElement (nodeType 9) [extensions/Popups] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939333 (https://phabricator.wikimedia.org/T340081) (owner: 10Jdlrobson)
[21:14:03] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:14:03] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:14:04] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors
[21:14:07] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors
[21:14:25] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:939333|Don't log for documentElement (nodeType 9) (T340081)]]
[21:14:29] <stashbot>	 T340081: TypeError: n.closest is not a function 	 - https://phabricator.wikimedia.org/T340081
[21:14:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:15:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:15:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm
[21:15:34] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[21:15:37] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install X - https://phabricator.wikimedia.org/T342176 (10RobH)
[21:15:55] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10RobH)
[21:15:56] <logmsgbot>	 !log urbanecm@deploy1002 jdlrobson and urbanecm: Backport for [[gerrit:939333|Don't log for documentElement (nodeType 9) (T340081)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[21:16:29] <urbanecm>	 proceeding, additional testing seems unnecessary
[21:16:38] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10RobH)
[21:19:43] <Jdlrobson>	 urbanecm: thanks
[21:22:08] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:939333|Don't log for documentElement (nodeType 9) (T340081)]] (duration: 07m 42s)
[21:22:12] <stashbot>	 T340081: TypeError: n.closest is not a function 	 - https://phabricator.wikimedia.org/T340081
[21:22:21] <urbanecm>	 Jdlrobson: and should be all done
[21:22:28] <urbanecm>	 anything else i can help with today?
[21:28:05] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host analytics1073.eqiad.wmnet
[21:28:13] <Jdlrobson>	 thanks urbanecm so sorry this overran
[21:28:21] <urbanecm>	 no worries
[21:28:47] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10RobH)
[21:29:17] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: restore program field to node logs [puppet] - 10https://gerrit.wikimedia.org/r/937605 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[21:32:30] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[21:36:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) I've upgraded:  * the iDRAC version * the NIC firmware * the BIOS  I tried two versions of the NIC firmware, in case it was t...
[21:46:41] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342071 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue
[21:51:38] <wikibugs>	 (03CR) 10Subramanya Sastry: Set default for UseLegacyMediaStyles and disable on officewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra)
[21:52:30] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[21:59:24] <wikibugs>	 (03PS1) 10Subramanya Sastry: Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433)
[22:00:05] <wikibugs>	 (03CR) 10C. Scott Ananian: Set default for UseLegacyMediaStyles and disable on officewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937544 (https://phabricator.wikimedia.org/T318433) (owner: 10Arlolra)
[22:00:55] <wikibugs>	 (03CR) 10Subramanya Sastry: "To be merged as part of the backport window tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry)
[22:01:09] <wikibugs>	 (03CR) 10C. Scott Ananian: [C: 03+1] "agreed w/ diagnosis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry)
[22:06:55] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm
[22:06:55] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[22:07:00] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[22:07:21] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10MediaWiki-extensions-LdapAuthentication, and 2 others: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10Pppery)
[22:09:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:12:04] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[22:12:05] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[22:16:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[22:18:33] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[22:18:34] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:18:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors
[22:18:37] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors
[22:18:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[22:23:25] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[22:24:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[22:24:10] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:24:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors
[22:24:13] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors
[22:24:20] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[22:28:15] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove pybal log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937600 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[22:31:43] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] "After-the-fact bug report created:" [cookbooks] - 10https://gerrit.wikimedia.org/r/937173 (owner: 10BCornwall)
[22:32:29] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[22:32:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[22:34:21] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[22:34:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[22:44:28] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on P{doh5002*} and A:wikidough
[22:44:40] <wikibugs>	 (03CR) 10Mabualruz: [C: 03+1] Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia)
[22:46:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis we had the same issue with sessionstore2001 in codw see task below what we did was to replace the 1G RJ45/SFP convert...
[22:48:35] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 (10lmata)
[22:49:20] <wikibugs>	 10SRE, 10Observability-Logging: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10lmata)
[22:50:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops, and 2 others: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350 (10lmata)
[22:50:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10lmata)
[22:51:05] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on P{doh5002*} and A:wikidough
[22:51:57] <wikibugs>	 10SRE, 10Observability-Alerting: alertmanager silence confirmation page links to localhost - https://phabricator.wikimedia.org/T328869 (10lmata)
[22:52:05] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "This was run successfully on doh5002: Adding disable_puppet_on_reboot = True behaved as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[22:52:15] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "This was run successfully on doh5002: Adding disable_puppet_on_reboot = True behaved as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[22:52:43] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10MediaWiki-extensions-LdapAuthentication, and 2 others: wikitech logins set the email address every time - https://phabricator.wikimedia.org/T339917 (10taavi) 05Open→03Resolved a:03taavi Verified my fix works on labtestwikitech, it'll roll out to wikitech...
[22:56:56] <wikibugs>	 (03PS3) 10Mabualruz: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia)
[23:11:00] <wikibugs>	 10SRE-OnFire, 10Incident Tooling: Provide mechanism to join/leave oncall - https://phabricator.wikimedia.org/T322636 (10lmata)
[23:16:12] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Observability-Alerting, 10observability: RAID check opened a ticket for kubernetes2012 while it was being reimaged - https://phabricator.wikimedia.org/T330150 (10lmata)
[23:19:13] <wikibugs>	 10SRE, 10Observability-Alerting, 10observability: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10lmata)
[23:19:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10lmata)
[23:22:02] <wikibugs>	 (03CR) 10Jdlrobson: "Kim: do you need help backporting this change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia)
[23:22:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10lmata)
[23:24:44] <wikibugs>	 10SRE, 10Traffic, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10lmata)
[23:25:34] <wikibugs>	 10SRE, 10Incident Tooling: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata)