[00:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:05:49] <brett>	 :O
[00:08:34] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9798051 (10Eevans) The array has rebuilt, but I could swear I hear it ticking...  `lang=sh-session eevans@aqs1013:~$ sudo mdadm --detail /dev/md2  /dev/md2:            Version : 1.2      Creation Time : Thu...
[00:13:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P62397 and previous config saved to /var/cache/conftool/dbconfig/20240515-001352-ladsgroup.json
[00:15:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:29:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T352010)', diff saved to https://phabricator.wikimedia.org/P62398 and previous config saved to /var/cache/conftool/dbconfig/20240515-002900-ladsgroup.json
[00:29:03] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[00:29:05] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[00:29:17] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[00:29:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T352010)', diff saved to https://phabricator.wikimedia.org/P62399 and previous config saved to /var/cache/conftool/dbconfig/20240515-002923-ladsgroup.json
[01:13:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:33:38] <icinga-wm>	 RECOVERY - Disk space on mw1445 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops
[02:38:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:38] <icinga-wm>	 PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 9305 MB (2% inode=99%): /tmp 9305 MB (2% inode=99%): /var/tmp 9305 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops
[03:03:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:03:02] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:03:28] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:04:25] <wikibugs>	 (03PS2) 10Cwhite: wmerrors: add config and code to copy stats to dogstatsd [puppet] - 10https://gerrit.wikimedia.org/r/1017078 (https://phabricator.wikimedia.org/T356814)
[03:18:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:18:34] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:18:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on mw1423:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1423 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:23:36] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:23:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on mw1423:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1423 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:24:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:03:38] <icinga-wm>	 RECOVERY - Disk space on mw1445 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops
[04:08:06] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:09:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 5.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:26:45] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr1-magru.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[04:26:46] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[04:31:45] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr1-magru.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[04:31:46] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device asw1-b4-magru.mgmt.magru.wmnet recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[05:33:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0600)
[06:05:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:07:22] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364948 (10phaultfinder) 03NEW
[06:08:28] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798282 (10phaultfinder)
[06:10:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:12:25] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798284 (10phaultfinder)
[06:41:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] parsoid/testing: Enable profile::auto_restarts::service for envoy [puppet] - 10https://gerrit.wikimedia.org/r/1028793 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[06:42:57] <wikibugs>	 (03PS1) 10KartikMistry: Enable Content/Section translation in io, nds, nds-nl and, mwl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031758 (https://phabricator.wikimedia.org/T354666)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0700). Please do the needful.
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:04:20] <logmsgbot>	 !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[07:04:27] <logmsgbot>	 !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:04:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[07:04:51] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[07:07:45] <wikibugs>	 (03PS1) 10Muehlenhoff: mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145)
[07:08:35] <wikibugs>	 (03PS1) 10Muehlenhoff: vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145)
[07:12:08] <wikibugs>	 (03PS1) 10Fabfur: cache:benthos: switch to production topic names [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117)
[07:13:05] <wikibugs>	 (03CR) 10LSobanski: mx: Stop ignoring errors from alias sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[07:14:36] <kart_>	 Sorry for late joining. I'll self-deploy
[07:14:53] <kart_>	 jouncebot now
[07:14:53] <jouncebot>	 For the next 0 hour(s) and 45 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0700)
[07:17:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031758 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry)
[07:17:48] <wikibugs>	 (03CR) 10Muehlenhoff: mx: Stop ignoring errors from alias sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[07:18:08] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Content/Section translation in io, nds, nds-nl and, mwl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031758 (https://phabricator.wikimedia.org/T354666) (owner: 10KartikMistry)
[07:18:42] <wikibugs>	 (03PS2) 10Muehlenhoff: mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145)
[07:19:07] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]]
[07:19:07] <wikibugs>	 (03PS2) 10Muehlenhoff: vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145)
[07:19:12] <stashbot>	 T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666
[07:19:35] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d1-codfw
[07:19:35] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d1-codfw
[07:19:47] <wikibugs>	 (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur)
[07:20:11] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device lsw1-d1-codfw
[07:20:12] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device lsw1-d1-codfw
[07:21:31] <wikibugs>	 (03CR) 10Fabfur: [V:03+1 C:04-2] "Do not merge until ready to switch to production topics" [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur)
[07:21:51] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:23:56] <wikibugs>	 (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[07:26:36] <wikibugs>	 (03PS3) 10David Caro: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494
[07:26:36] <wikibugs>	 (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[07:26:45] <wikibugs>	 (03CR) 10David Caro: openstack_apis: use a higher value for rgw (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[07:29:06] <wikibugs>	 (03CR) 10David Caro: [C:03+2] openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[07:30:20] <wikibugs>	 (03Merged) 10jenkins-bot: openstack_apis: use a higher value for rgw [alerts] - 10https://gerrit.wikimedia.org/r/1031494 (owner: 10David Caro)
[07:30:33] <logmsgbot>	 !log kartik@deploy1002 Sync cancelled.
[07:31:23] <kart_>	 ah. wrong keypress :/
[07:31:28] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]]
[07:31:32] <stashbot>	 T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666
[07:34:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1172.eqiad.wmnet
[07:34:08] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:35:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1172 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031802 (https://phabricator.wikimedia.org/T349619)
[07:36:36] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[07:37:08] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:37:17] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[07:38:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: "No real reason except that it isn't needed AFAIK, I'll put it back though since it isn't really a relevant change and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1031462 (owner: 10Filippo Giunchedi)
[07:38:43] <moritzm>	 !log installing curl security updates
[07:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:20] <wikibugs>	 (03PS1) 10KartikMistry: Section Translation: Fix nds-nl language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031803
[07:39:39] <wikibugs>	 (03CR) 10LSobanski: [C:03+1] mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[07:39:48] <wikibugs>	 (03CR) 10LSobanski: [C:03+1] vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[07:43:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:44:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1172 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031802 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:49:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1172.eqiad.wmnet
[07:49:35] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] (duration: 18m 06s)
[07:49:38] <stashbot>	 T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666
[07:50:42] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Looks good, and PCC shows a NOOP. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff)
[07:52:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1177.eqiad.wmnet
[07:53:05] <kart_>	 I would like to deploy another quick fix (wrong language code) from previous patch.
[07:53:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1177 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031804 (https://phabricator.wikimedia.org/T349619)
[07:55:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031803 (owner: 10KartikMistry)
[07:55:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "Thank you Scott for the review" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi)
[07:56:01] <wikibugs>	 (03Merged) 10jenkins-bot: Section Translation: Fix nds-nl language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031803 (owner: 10KartikMistry)
[07:56:31] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1031803|Section Translation: Fix nds-nl language code]]
[07:58:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1177 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031804 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:59:16] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1031803|Section Translation: Fix nds-nl language code]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:00:05] <jouncebot>	 hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T0800)
[08:00:15] <andre>	 o/
[08:00:32] <wikibugs>	 (03PS3) 10Filippo Giunchedi: utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462
[08:00:32] <wikibugs>	 (03PS4) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463
[08:00:32] <wikibugs>	 (03PS4) 10Filippo Giunchedi: zookeeper: add Bookworm compat [puppet] - 10https://gerrit.wikimedia.org/r/1031465
[08:00:36] <wikibugs>	 (03PS6) 10Ayounsi: Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325)
[08:01:26] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[08:01:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] utils: use HEAD for get_config7.sh [puppet] - 10https://gerrit.wikimedia.org/r/1031462 (owner: 10Filippo Giunchedi)
[08:01:46] <hashar>	 andre: good morning. I am in the deployment google meet :)
[08:01:50] <kart_>	 andre: I'm finishing my deployment..
[08:02:10] <hashar>	 no worries kart_ , let us know when it has completed
[08:03:06] <kart_>	 sure
[08:03:52] <moritzm>	 !log installing nodejs security updates on buster
[08:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1177.eqiad.wmnet
[08:04:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update chart to 3.0.7 / f3c883908e576 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030950 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[08:05:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[08:05:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[08:05:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update aux values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030951 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[08:06:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] jaeger: update bitnami/common to 2.19.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030952 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[08:07:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T364299)', diff saved to https://phabricator.wikimedia.org/P62401 and previous config saved to /var/cache/conftool/dbconfig/20240515-080700-marostegui.json
[08:07:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Spicerack module for gNMI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1015334 (https://phabricator.wikimedia.org/T344325) (owner: 10Ayounsi)
[08:07:06] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[08:07:37] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[08:13:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: jaeger: add back port names for otlp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031805 (https://phabricator.wikimedia.org/T364477)
[08:13:46] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1031803|Section Translation: Fix nds-nl language code]] (duration: 17m 14s)
[08:13:49] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[08:13:52] <kart_>	 hashar: done
[08:14:00] <hashar>	 excellent!
[08:14:01] <hashar>	 :)
[08:14:23] * hashar looks at logs
[08:15:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw
[08:15:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1178.eqiad.wmnet
[08:15:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] jaeger: add back port names for otlp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031805 (https://phabricator.wikimedia.org/T364477) (owner: 10Filippo Giunchedi)
[08:16:03] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[08:16:10] <godog>	 the !log there from me actually did nothing, I pressed "n" at confirmation time
[08:16:29] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[08:16:46] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031806 (https://phabricator.wikimedia.org/T361399)
[08:16:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031806 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[08:17:15] <logmsgbot>	 !log filippo@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[08:17:46] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031806 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot)
[08:18:52] <wikibugs>	 (03Merged) 10jenkins-bot: flink-kubernetes-operator: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1029573 (https://phabricator.wikimedia.org/T287491) (owner: 10Effie Mouzeli)
[08:18:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1178 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031807 (https://phabricator.wikimedia.org/T349619)
[08:19:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T352010)', diff saved to https://phabricator.wikimedia.org/P62402 and previous config saved to /var/cache/conftool/dbconfig/20240515-081934-ladsgroup.json
[08:19:39] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[08:20:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[08:21:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad
[08:22:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P62403 and previous config saved to /var/cache/conftool/dbconfig/20240515-082209-marostegui.json
[08:22:45] <hashar>	 oh joy
[08:22:47] <hashar>	 httpbb fails
[08:22:50] <hashar>	 fun
[08:24:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1178 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031807 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:25:25] <hashar>	 hmm
[08:26:06] <hashar>	 so the scap deployment failed again due to httpbb
[08:26:19] <hashar>	 mwdebug2002 yields a 503 error for one of the test page
[08:26:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[08:26:51] <hashar>	 and looking at logstash for that host,  php7.4-fpm sent a rsyslog message  `[NOTICE] exiting, bye-bye!`
[08:27:04] <hashar>	 which is hmm.. confusing
[08:27:30] <hashar>	 oh that is the php fpm restart
[08:27:41] <hashar>	 but that also mean after restarting the server is not immediately available
[08:27:50] <hashar>	 and we tend to ignore those timeout/503 errors in log spam
[08:29:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1178.eqiad.wmnet
[08:29:57] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[08:30:10] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[08:30:48] <moritzm>	 !log installing openjdk-17/jetty9 security updates on Bookworm
[08:30:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1192.eqiad.wmnet
[08:32:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1192 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031808 (https://phabricator.wikimedia.org/T349619)
[08:34:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P62404 and previous config saved to /var/cache/conftool/dbconfig/20240515-083443-ladsgroup.json
[08:35:41] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.5  refs T361399
[08:35:44] <stashbot>	 T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399
[08:37:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P62405 and previous config saved to /var/cache/conftool/dbconfig/20240515-083717-marostegui.json
[08:38:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[08:38:19] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[08:38:43] <wikibugs>	 (03PS1) 10JMeybohm: rdf-streaming-updater: Remove duplicate definition of k8s api-servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491)
[08:40:18] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[08:40:36] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[08:42:21] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[08:45:47] <wikibugs>	 (03PS1) 10JMeybohm: Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491)
[08:48:07] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/refinery@88ed505]: Regular analytics weekly train [analytics/refinery@88ed505e]
[08:48:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1192 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031808 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:49:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P62406 and previous config saved to /var/cache/conftool/dbconfig/20240515-084950-ladsgroup.json
[08:52:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T364299)', diff saved to https://phabricator.wikimedia.org/P62407 and previous config saved to /var/cache/conftool/dbconfig/20240515-085224-marostegui.json
[08:52:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[08:52:29] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[08:52:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1192.eqiad.wmnet
[08:52:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[08:52:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T364299)', diff saved to https://phabricator.wikimedia.org/P62408 and previous config saved to /var/cache/conftool/dbconfig/20240515-085247-marostegui.json
[08:58:39] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813
[08:58:44] <wikibugs>	 (03PS1) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257)
[09:00:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on seaborgium.wikimedia.org with reason: OS update
[09:01:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on seaborgium.wikimedia.org with reason: OS update
[09:01:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9798634 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=be009031-0cc0-4a4d-97a0-f4d990831efe) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with...
[09:02:30] <wikibugs>	 (03PS1) 10JMeybohm: Remove kubestagetcd200[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1031816 (https://phabricator.wikimedia.org/T363307)
[09:02:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[09:02:46] <wikibugs>	 (03PS1) 10Superpes15: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708)
[09:02:48] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/refinery@88ed505]: Regular analytics weekly train [analytics/refinery@88ed505e] (duration: 14m 41s)
[09:03:51] <moritzm>	 !log upgrade seaborgium to bullseye T364823
[09:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:58] <stashbot>	 T364823: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823
[09:04:44] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] "In my tests a couple of seconds (in staging). It's been a narrow race there and publish-sa-certs failed on 1 of 3 new masters." [puppet] - 10https://gerrit.wikimedia.org/r/1031507 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[09:04:57] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/refinery@88ed505] (thin): Regular analytics weekly train THIN [analytics/refinery@88ed505e]
[09:04:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T352010)', diff saved to https://phabricator.wikimedia.org/P62409 and previous config saved to /var/cache/conftool/dbconfig/20240515-090458-ladsgroup.json
[09:05:01] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[09:05:07] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[09:05:14] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[09:05:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T352010)', diff saved to https://phabricator.wikimedia.org/P62410 and previous config saved to /var/cache/conftool/dbconfig/20240515-090522-ladsgroup.json
[09:07:59] <wikibugs>	 (03PS2) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257)
[09:09:15] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/refinery@88ed505] (thin): Regular analytics weekly train THIN [analytics/refinery@88ed505e] (duration: 04m 17s)
[09:09:19] <wikibugs>	 (03PS1) 10Fabfur: benthos:cache: removed unused fields from grok pattern [puppet] - 10https://gerrit.wikimedia.org/r/1031818 (https://phabricator.wikimedia.org/T358109)
[09:10:55] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.copy (exit_code=97) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[09:11:07] <logmsgbot>	 !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubestagemaster200[12].codfw.wmnet
[09:13:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[09:13:51] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:14:00] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[09:14:07] <wikibugs>	 (03PS3) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257)
[09:14:35] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[09:19:41] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[09:20:21] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.copy (exit_code=97) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[09:20:22] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I think this is fine to merge!" [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French)
[09:22:02] <Dreamy_Jazz>	 !log Starting MediaModeration script on group2 wikis for a test
[09:22:03] <wikibugs>	 (03PS1) 10Mvolz: Update Zotero to 2024-04-30-130428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031819 (https://phabricator.wikimedia.org/T350880)
[09:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:12] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff)
[09:23:02] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:23:51] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:25:07] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[09:25:42] <icinga-wm>	 PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[09:26:59] <hashar>	 MediaWiki trains looks good
[09:28:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org
[09:28:51] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:29:21] <wikibugs>	 (03PS1) 10Effie Mouzeli: flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821
[09:32:15] <wikibugs>	 (03PS2) 10Zabe: filtered_tables: Remove gu_salt [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435)
[09:32:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org
[09:33:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:33:51] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:34:12] <wikibugs>	 (03CR) 10Ladsgroup: "I'd say drop it in prod first, don't worry about this here, it won't break things AFAIK" [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe)
[09:35:44] <wikibugs>	 (03CR) 10DCausse: [C:03+1] flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli)
[09:36:25] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli)
[09:36:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli)
[09:36:46] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] "Chart version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli)
[09:37:53] <wikibugs>	 (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.3 [puppet] - 10https://gerrit.wikimedia.org/r/1031822 (https://phabricator.wikimedia.org/T354656)
[09:38:02] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service kubestagemaster2001:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:34] <wikibugs>	 (03PS1) 10Effie Mouzeli: flink-kubernetes-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031823
[09:39:33] <wikibugs>	 (03Merged) 10jenkins-bot: flink-kubernetes-operator: fix typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031821 (owner: 10Effie Mouzeli)
[09:40:13] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2451/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031822 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[09:40:26] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/refinery@88ed505] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@88ed505e]
[09:41:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07LDAP: Upgrade r/w LDAP servers to Bullseye - https://phabricator.wikimedia.org/T364823#9798753 (10MoritzMuehlenhoff) 05Open→03Resolved Both production LDAP r/w servers have been migrated to Bullseye.
[09:42:31] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] flink-kubernetes-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031823 (owner: 10Effie Mouzeli)
[09:42:49] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: cleanup duplicated network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031824
[09:42:51] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: bump exporter version to v1.0.3 [puppet] - 10https://gerrit.wikimedia.org/r/1031822 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[09:43:14] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagemaster[2001-2002].codfw.wmnet
[09:43:19] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/refinery@88ed505] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@88ed505e] (duration: 02m 53s)
[09:43:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete certs [puppet] - 10https://gerrit.wikimedia.org/r/1031451 (owner: 10Muehlenhoff)
[09:44:55] <wikibugs>	 (03Merged) 10jenkins-bot: flink-kubernetes-operator: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031823 (owner: 10Effie Mouzeli)
[09:47:20] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:47:33] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:48:51] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ldap in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:49:20] <wikibugs>	 (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: cleanup duplicated network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031824 (owner: 10DCausse)
[09:49:45] <claime>	 !log Manually relaunching mediawiki_job_update_special_pages_s5.service 
[09:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:59] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: cleanup duplicated network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031824 (owner: 10DCausse)
[09:50:03] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[09:50:30] <wikibugs>	 (03PS1) 10JMeybohm: Decom kubestagemaster200[12] [puppet] - 10https://gerrit.wikimedia.org/r/1031825 (https://phabricator.wikimedia.org/T363307)
[09:52:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[09:53:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster[2001-2002].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[09:53:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:53:48] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagemaster[2001-2002].codfw.wmnet
[09:53:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Decom kubestagemaster200[12] [puppet] - 10https://gerrit.wikimedia.org/r/1031825 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[09:54:11] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:54:44] <wikibugs>	 (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031826
[09:54:47] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:57:06] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:57:13] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[09:58:43] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257)
[09:59:20] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[09:59:30] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1000)
[10:00:20] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[10:02:13] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[10:02:24] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[10:03:31] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "do not merge till tcp-mss-clamper is ready" [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[10:06:00] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics_test@ecf603d]: (no justification provided)
[10:06:11] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics_test@ecf603d]: (no justification provided) (duration: 00m 11s)
[10:06:23] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics@ecf603d]: (no justification provided)
[10:06:45] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "+1 for me" [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[10:06:54] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@ecf603d]: (no justification provided) (duration: 00m 30s)
[10:08:42] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798853 (10phaultfinder)
[10:09:34] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:09:51] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:09:51] <wikibugs>	 (03PS2) 10JMeybohm: Remove kubestagetcd200[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1031816 (https://phabricator.wikimedia.org/T363307)
[10:12:40] <wikibugs>	 10ops-codfw, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T364948#9798897 (10phaultfinder)
[10:15:12] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[10:15:31] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:20:04] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[10:21:15] <wikibugs>	 (03PS2) 10DCausse: rdf-streaming-updater: Remove duplicate definition of k8s and zk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm)
[10:22:10] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "testing on staging showed that these policies are indeed no longer needed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm)
[10:26:47] <wikibugs>	 (03PS5) 10Slyngshede: Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318)
[10:27:00] <wikibugs>	 (03CR) 10Ladsgroup: configure parsercache servers via dbconfig in etcd (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (owner: 10Scott French)
[10:27:30] <wikibugs>	 (03CR) 10Slyngshede: "This still needs to be hooked up to the build pipeline, but that will happen in another CR." [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede)
[10:27:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: preseed for kafka-main10(0[6789]|10) [puppet] - 10https://gerrit.wikimedia.org/r/1031832 (https://phabricator.wikimedia.org/T363212)
[10:28:00] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:28:25] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:29:00] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device cloudsw1-e4-eqiad
[10:31:12] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.network.tls (exit_code=99) for network device cloudsw1-e4-eqiad
[10:32:10] <wikibugs>	 (03PS1) 10Zabe: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974)
[10:32:30] <wikibugs>	 (03PS1) 10Zabe: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974)
[10:32:32] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:32:43] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:32:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] preseed for kafka-main10(0[6789]|10) [puppet] - 10https://gerrit.wikimedia.org/r/1031832 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris)
[10:32:58] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Configuration for disabling signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030891 (owner: 10Slyngshede)
[10:34:20] <wikibugs>	 (03Merged) 10jenkins-bot: Configuration for disabling signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030891 (owner: 10Slyngshede)
[10:34:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9799039 (10akosiaris) >>! In T363212#9797805, @Jclark-ctr wrote: > @akosiaris   could you please update preseed.yaml file?  Done. Note t...
[10:37:51] <Lucas_WMDE>	 jouncebot: now
[10:37:51] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1000)
[10:38:20] <Lucas_WMDE>	 zabe, Amir1: should we just deploy the FlaggedRevs fix now? (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1031484 and wmf.5)
[10:39:02] <zabe>	 sure
[10:40:31] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[10:40:45] <Amir1>	 go ahead and thanks for the fix
[10:40:47] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe)
[10:40:49] <Lucas_WMDE>	 just checking if I can reproduce the issue at the moment
[10:40:51] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe)
[10:40:58] <Lucas_WMDE>	 yup
[10:41:05] <claime>	 Lucas_WMDE: Thanks for the fix <3
[10:41:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe)
[10:41:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe)
[10:41:17] <Lucas_WMDE>	 np :)
[10:44:44] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880)
[10:44:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Clarify totoro.wikimedia.org test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE))
[10:45:56] <wikibugs>	 (03PS1) 10Slyngshede: P:ganeti Prometheus monitoring of ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694)
[10:47:23] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE))
[10:49:36] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:49:39] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:50:04] <wikibugs>	 (03Merged) 10jenkins-bot: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031484 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe)
[10:50:06] <wikibugs>	 (03Merged) 10jenkins-bot: Fix capitalization of Subquery [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031483 (https://phabricator.wikimedia.org/T364974) (owner: 10Zabe)
[10:50:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1031484|Fix capitalization of Subquery (T364974)]], [[gerrit:1031483|Fix capitalization of Subquery (T364974)]]
[10:50:44] <stashbot>	 T364974: mediawiki_job_update_special_pages crashes with Error: Class 'Wikimedia\Rdbms\SubQuery' not found - https://phabricator.wikimedia.org/T364974
[10:52:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release flink-operator/flink-operator on k8s-staging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=flink-operator - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[10:52:41] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm
[10:52:57] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799115 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm
[10:53:21] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 zabe and lucaswerkmeister-wmde: Backport for [[gerrit:1031484|Fix capitalization of Subquery (T364974)]], [[gerrit:1031483|Fix capitalization of Subquery (T364974)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:53:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Zookeeper: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1031429 (owner: 10Muehlenhoff)
[10:53:49] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:53:51] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:53:55] <Lucas_WMDE>	 oop, different error
[10:54:33] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[10:54:36] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[10:54:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1041: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1031836 (https://phabricator.wikimedia.org/T319184)
[10:55:00] <Lucas_WMDE>	 pasted the error at https://phabricator.wikimedia.org/T364974#9799122
[10:56:43] <Lucas_WMDE>	 at this point I’m tempted to say let’s just revert the SQB migration there
[10:56:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1042: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1031839 (https://phabricator.wikimedia.org/T319184)
[10:59:07] <taavi>	 reverting at least in wmf.5 seems reasonable to me
[10:59:08] <wikibugs>	 (03PS1) 10Muehlenhoff: an-test-druid: Use firewall::service for Zookeeper firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1031842
[10:59:16] <taavi>	 i posted an untested patch to fix that specific error
[10:59:31] <Lucas_WMDE>	 I’m about to test a very similar patch
[10:59:51] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031842 (owner: 10Muehlenhoff)
[11:00:04] <jouncebot>	 mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1100).
[11:00:39] <Lucas_WMDE>	 I’m still deploying, sorry mvolz :/
[11:00:53] <Lucas_WMDE>	 okay, now the script worked
[11:01:46] <Lucas_WMDE>	 taavi: do you prefer ...$timeCondition or ->andWhere( $timeCondition )?
[11:01:57] <Lucas_WMDE>	 in either case I’d say let’s deploy that rather than revert
[11:02:12] <Lucas_WMDE>	 since it then seems to work
[11:02:54] <taavi>	 Lucas_WMDE: I think I slightly prefer andWhere() since that'd work even if $timeCondition would be changed to be something else than an array
[11:03:01] <Lucas_WMDE>	 yeah, makes sense
[11:03:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Sync cancelled.
[11:03:43] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031485 (https://phabricator.wikimedia.org/T364974)
[11:03:54] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031846 (https://phabricator.wikimedia.org/T364974)
[11:04:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031485 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE))
[11:04:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031846 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE))
[11:05:36] <logmsgbot>	 !log aborrero@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1041.eqiad.wmnet with OS bookworm
[11:05:53] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799155 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with...
[11:06:00] <Lucas_WMDE>	 and what lesson do we learn from this? FlaggedRevs is cursed, avoid it like the plague
[11:06:18] <Lucas_WMDE>	 (or slightly more seriously, FlaggedRevs is severely undertested, so be extra careful when making changes to it)
[11:06:40] <Lucas_WMDE>	 hi mvolz! I’m deploying in your window because a backport took longer than expected, sorry :/
[11:06:58] <mvolz>	 Lucas_WMDE: no worries
[11:07:34] <mvolz>	 if you'd ping when you're done that's be great 
[11:07:41] <Lucas_WMDE>	 sure, can do
[11:08:49] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2452/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[11:09:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1193.eqiad.wmnet
[11:09:56] <Lucas_WMDE>	 heh, the backport will merge before the master change does, because the master change is chained behind a core change which has slower CI
[11:10:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1193 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031843 (https://phabricator.wikimedia.org/T349619)
[11:10:59] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm
[11:11:11] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet...
[11:12:37] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323)
[11:12:38] <wikibugs>	 (03Merged) 10jenkins-bot: backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031485 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE))
[11:12:40] <wikibugs>	 (03Merged) 10jenkins-bot: backend: Fix Unknown column 'Array' in 'where clause' [extensions/FlaggedRevs] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031846 (https://phabricator.wikimedia.org/T364974) (owner: 10Lucas Werkmeister (WMDE))
[11:13:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1031485|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]], [[gerrit:1031846|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]]
[11:13:18] <stashbot>	 T364974: mediawiki_job_update_special_pages crashes with Error: Class 'Wikimedia\Rdbms\SubQuery' not found - https://phabricator.wikimedia.org/T364974
[11:14:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1193 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031843 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:15:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1031485|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]], [[gerrit:1031846|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:16:02] <Lucas_WMDE>	 testing the script on mwdebug2002 again…
[11:16:14] <Lucas_WMDE>	 works
[11:16:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync
[11:16:50] <wikibugs>	 (03PS2) 10Slyngshede: P:ganeti Prometheus monitoring of ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694)
[11:18:12] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2454/console" [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[11:18:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1193.eqiad.wmnet
[11:24:50] <icinga-wm>	 PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[11:28:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudvirt1041: move to modern single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1031836 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[11:28:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1031485|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]], [[gerrit:1031846|backend: Fix Unknown column 'Array' in 'where clause' (T364974)]] (duration: 15m 36s)
[11:28:53] <stashbot>	 T364974: mediawiki_job_update_special_pages crashes with Error: Class 'Wikimedia\Rdbms\SubQuery' not found - https://phabricator.wikimedia.org/T364974
[11:28:56] * Lucas_WMDE done deploying
[11:28:58] <Lucas_WMDE>	 mvolz: all yours :)
[11:29:05] <mvolz>	 ty
[11:29:11] <Lucas_WMDE>	 claime: want to try starting the service again? (I doubt I have permission to do it ^^)
[11:29:15] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Update Zotero to 2024-04-30-130428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031819 (https://phabricator.wikimedia.org/T350880) (owner: 10Mvolz)
[11:29:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: openldap::rw
[11:30:03] <claime>	 Lucas_WMDE: Sure, I'll find a broken section with a reasonable runtime
[11:30:05] <wikibugs>	 (03Merged) 10jenkins-bot: Update Zotero to 2024-04-30-130428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031819 (https://phabricator.wikimedia.org/T350880) (owner: 10Mvolz)
[11:30:16] <Lucas_WMDE>	 hehe
[11:30:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch openldap::rw to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031868 (https://phabricator.wikimedia.org/T349619)
[11:31:03] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[11:31:25] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:31:27] * Lucas_WMDE afk for lunch
[11:31:35] <claime>	 Which is none of them x)
[11:31:59] <claime>	 Well I'll relaunch s5, it seems to be the shortest
[11:32:44] <claime>	 The rest may have to wait until the next scheduled run during the night because I don't want to hammer the dbs
[11:32:44] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[11:33:14] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[11:33:49] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[11:34:03] <Lucas_WMDE>	 yeah, sounds fair
[11:34:21] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[11:39:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch openldap::rw to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031868 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:42:32] <wikibugs>	 (03PS4) 10Jsn.sherman: extension-list: Add AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034)
[11:42:32] <wikibugs>	 (03PS4) 10Jsn.sherman: InitialiseSettings.php: Add wmgUseAutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034)
[11:42:33] <wikibugs>	 (03PS4) 10Jsn.sherman: InitialiseSettings-labs.php: Deploy AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034)
[11:42:33] <wikibugs>	 (03PS5) 10Jsn.sherman: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034)
[11:44:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: openldap::rw
[11:45:54] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984 (10aborrero) 03NEW
[11:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:49:11] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9799282 (10aborrero)
[11:50:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1031872
[11:50:58] <icinga-wm>	 PROBLEM - snapshot of s7 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s7 at eqiad (db1171) taken on 2024-05-15 10:57:54 is 871 GiB, but the previous one was 1058 GiB, a change of -17.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[11:52:35] <logmsgbot>	 !log aborrero@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1041.eqiad.wmnet with OS bookworm
[11:52:49] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9799293 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with...
[11:58:07] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9799298 (10aborrero) p:05Triage→03Medium hey @Jclark-ctr or @Jhancock.wm could you please advice / help with this server? thanks in advance.
[11:58:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1031872 (owner: 10Muehlenhoff)
[11:59:11] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9799310 (10MoritzMuehlenhoff)
[12:02:40] <wikibugs>	 (03PS15) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434)
[12:03:10] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9799319 (10aborrero) additional information: when reimaging the server, the debian installer failed, complaining about the volume group name being in use already.  To try to workarou...
[12:04:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031826 (owner: 10Muehlenhoff)
[12:05:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "Double-checked package names." [puppet] - 10https://gerrit.wikimedia.org/r/1031826 (owner: 10Muehlenhoff)
[12:05:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff)
[12:06:07] <wikibugs>	 (03PS1) 10Clément Goubert: httpbb: Add tests for new redirects [puppet] - 10https://gerrit.wikimedia.org/r/1031874 (https://phabricator.wikimedia.org/T25216)
[12:08:52] <icinga-wm>	 PROBLEM - Etcd cluster health on kubestagetcd2001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[12:08:52] <icinga-wm>	 PROBLEM - Etcd cluster health on kubestagetcd2003 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[12:08:52] <icinga-wm>	 PROBLEM - Etcd cluster health on kubestagetcd2002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd
[12:10:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1031826 (owner: 10Muehlenhoff)
[12:10:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Look good! Nice work :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[12:10:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Undeploy openldap prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1031875
[12:11:08] <claime>	 Lucas_WMDE: Well it didn't crash on dewiki this time so I'm inclined to call this solved
[12:11:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff)
[12:13:02] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:14:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Undeploy openldap prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1031875
[12:14:43] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[12:16:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff)
[12:16:43] <wikibugs>	 (03PS1) 10Majavah: P:openstack: neutron: add required control plane config for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373)
[12:18:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:openstack: neutron: add required control plane config for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah)
[12:18:44] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 5 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah)
[12:19:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1203.eqiad.wmnet
[12:19:49] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah)
[12:20:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1203 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031881 (https://phabricator.wikimedia.org/T349619)
[12:20:13] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307)
[12:20:23] <wikibugs>	 (03CR) 10Krinkle: "I believe, if the Varnish approach works out, this would not be needed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024932 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[12:21:27] <wikibugs>	 (03PS3) 10Phuedx: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia)
[12:21:33] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Remove kubestagetcd200[123] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1031816 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[12:22:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia)
[12:23:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on kubestagetcd[2001-2003].codfw.wmnet with reason: decom
[12:23:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on kubestagetcd[2001-2003].codfw.wmnet with reason: decom
[12:25:12] <wikibugs>	 (03CR) 10Phuedx: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia)
[12:25:20] <Lucas_WMDE>	 claime: nice \o/
[12:26:47] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307)
[12:28:19] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2457/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[12:30:26] <wikibugs>	 (03CR) 10Phuedx: "The latest PS should have the effect that you want :) To test this locally, run composer buildConfigCache and check values in tests/data/c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia)
[12:31:27] <wikibugs>	 (03CR) 10Majavah: [C:03+1] httpbb: Add tests for new redirects [puppet] - 10https://gerrit.wikimedia.org/r/1031874 (https://phabricator.wikimedia.org/T25216) (owner: 10Clément Goubert)
[12:37:38] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307)
[12:38:47] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[12:45:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[12:46:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagetcd[2001-2003].codfw.wmnet
[12:51:32] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: remove zk network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031892 (https://phabricator.wikimedia.org/T287491)
[12:51:36] <wikibugs>	 (03CR) 10Elukey: [C:03+1] kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[12:52:50] <wikibugs>	 (03PS2) 10DCausse: cirrus-streaming-updater: remove zk network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031892 (https://phabricator.wikimedia.org/T287491)
[12:53:58] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[12:54:00] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] kubernetes::master: Make etcd_urls optional [puppet] - 10https://gerrit.wikimedia.org/r/1031882 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm)
[12:56:25] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagetcd[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[12:57:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:57:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagetcd[2001-2003].codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[12:57:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:57:58] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagetcd[2001-2003].codfw.wmnet
[12:58:26] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] "Ack on naming - LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1031762 (https://phabricator.wikimedia.org/T351117) (owner: 10Fabfur)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1300).
[13:00:04] <jouncebot>	 JSherman and Jdrewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:18] <wikibugs>	 (03PS2) 10JMeybohm: Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491)
[13:00:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:00:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:00:45] <JSherman>	 Roan has agreed to pair with me on my patches
[13:01:10] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[13:01:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye
[13:02:09] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release flink-operator/flink-operator on k8s-staging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-staging&var-namespace=flink-operator - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:02:14] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS bullseye
[13:02:29] <Lucas_WMDE>	 JSherman: I was looking at your patches too
[13:02:37] <jan_drewniak>	 0/ 
[13:02:38] <Lucas_WMDE>	 (but happy to let RoanKattouw take the lead and deploy)
[13:02:53] <Lucas_WMDE>	 some of the “how to new extension” docs on wikitech seem quite outdated :S
[13:03:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:03:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:03:30] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[13:04:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye
[13:06:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[13:07:13] <vgutierrez>	 !log uploaded golang-github-florianl-go-tc 0.4.4-0.20240511074908-d584238bf6cb to apt.wm.o (bookworm-wikimedia)
[13:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:20] <wikibugs>	 (03Merged) 10jenkins-bot: extension-list: Add AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026972 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[13:09:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:09:40] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:09:51] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1026972|extension-list: Add AutoModerator (T364034)]]
[13:09:56] <stashbot>	 T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034
[13:10:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:10:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:11:12] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:11:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:11:35] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:11:41] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:12:22] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:12:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:14:58] <wikibugs>	 (03CR) 10Hashar: [C:03+1] Clarify totoro.wikimedia.org test [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE))
[13:15:21] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:15:21] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:16:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979)
[13:16:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff)
[13:16:46] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:16:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:17:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:17:40] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.reboot_sanitaria (exit_code=99) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:18:02] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job kubetcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:18:37] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:18:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:19:14] <wikibugs>	 (03PS1) 10Brouberol: admin_ng: decommision the flink-operator on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031900 (https://phabricator.wikimedia.org/T365010)
[13:19:23] <wikibugs>	 (03PS3) 10JMeybohm: Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491)
[13:19:23] <wikibugs>	 (03PS1) 10JMeybohm: Remove kubernetesMasters definition from staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031901 (https://phabricator.wikimedia.org/T287491)
[13:21:36] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979)
[13:22:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage
[13:23:02] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Remove kubernetesMasters definition from staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031901 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm)
[13:24:13] <jan_drewniak>	 JSherman: Hi, let me know when you're done deploying your patches
[13:24:57] <JSherman>	 jan_drewniak: will do; still waiting on the extension list build steps
[13:25:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage
[13:25:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:25:41] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.reboot_sanitaria (exit_code=99) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:26:14] <jan_drewniak>	 JSherman: ok, gotcha
[13:26:21] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:26:23] <wikibugs>	 (03CR) 10Ssingh: "Looks good, thanks! One minor nit in-line; feel free to fix later/ignore." [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:26:29] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] admin_ng: decommision the flink-operator on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031900 (https://phabricator.wikimedia.org/T365010) (owner: 10Brouberol)
[13:26:42] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.reboot_sanitaria (exit_code=97) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:27:15] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:27:17] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:27:27] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.reboot_sanitaria Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:27:29] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.reboot_sanitaria (exit_code=0) Will restart a pool of Sanitarium MariaDB instances and/or hosts.
[13:29:43] <wikibugs>	 (03PS1) 10Cathal Mooney: Enable gNMI / gRPC on cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1031904 (https://phabricator.wikimedia.org/T365012)
[13:30:53] <JSherman>	 the docker_pull_k8s step has been taking forever, but we're at 85% now
[13:31:01] <wikibugs>	 (03CR) 10Hashar: [C:03+1] Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979) (owner: 10Muehlenhoff)
[13:31:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] admin_ng: decommision the flink-operator on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031900 (https://phabricator.wikimedia.org/T365010) (owner: 10Brouberol)
[13:31:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove auto restarts for containerd/docker [puppet] - 10https://gerrit.wikimedia.org/r/1031899 (https://phabricator.wikimedia.org/T364979) (owner: 10Muehlenhoff)
[13:32:08] <wikibugs>	 (03PS1) 10Cathal Mooney: Add cloudsw to list of roles we enable gnmic telemtry for [puppet] - 10https://gerrit.wikimedia.org/r/1031905 (https://phabricator.wikimedia.org/T365012)
[13:32:35] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:32:46] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:33:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:33:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:ganeti Prometheus monitoring of ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[13:34:21] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=thanos-fe1001.eqiad.wmnet
[13:34:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1203 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031881 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:34:46] <elukey>	 !log depool thanos-fe1001 and move envoy to PKI TLS cert
[13:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:06] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Move Swift on thanos-fe1001 to PKI TLS cert (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1031439 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[13:37:16] <wikibugs>	 (03PS4) 10Vgutierrez: pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257)
[13:38:31] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:39:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1203.eqiad.wmnet
[13:40:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1209.eqiad.wmnet
[13:40:27] <JSherman>	 we're on scap-cdb-rebuild
[13:40:31] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thanos-fe1001.eqiad.wmnet
[13:40:38] <wikibugs>	 (03CR) 10Eevans: [C:03+2] echostore: update cluster hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans)
[13:41:19] <moritzm>	 !log installing libpgjava security updates
[13:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:29] <wikibugs>	 (03Merged) 10jenkins-bot: echostore: update cluster hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans)
[13:41:48] <wikibugs>	 (03CR) 10Ayounsi: [C:04-1] "I was wondering why I didn't do it sooner, but remembered why." [homer/public] - 10https://gerrit.wikimedia.org/r/1031904 (https://phabricator.wikimedia.org/T365012) (owner: 10Cathal Mooney)
[13:42:00] <elukey>	 brouberol: o/ saw https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1030175 passing by, this can move to the new awesome calico netpolicies right?
[13:42:09] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[13:42:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1209 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031927 (https://phabricator.wikimedia.org/T349619)
[13:42:26] <logmsgbot>	 !log jsn@deploy1002 jsn: Backport for [[gerrit:1026972|extension-list: Add AutoModerator (T364034)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:42:29] <stashbot>	 T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034
[13:42:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:43:00] <brouberol>	 elukey indeed this is a prime candidate
[13:43:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1209 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031927 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:43:17] <brouberol>	 but we need to expose the restbase host IPs to k8s via puppet's global_config manifest first
[13:43:38] <logmsgbot>	 !log jsn@deploy1002 jsn: Continuing with sync
[13:44:01] <brouberol>	 oh wait, I'm seeing port 9042, so that smells like cassandra, which is already exposed
[13:44:28] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply
[13:44:31] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 93 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:44:32] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[13:44:34] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[13:44:42] <elukey>	 brouberol: it is cassandra yes
[13:44:48] <brouberol>	 root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get svc -n external-services | grep rest
[13:44:48] <brouberol>	 cassandra-restbase-a-codfw                            ClusterIP   None         <none>        9042/TCP                     16d
[13:44:48] <brouberol>	 cassandra-restbase-a-eqiad                            ClusterIP   None         <none>        9042/TCP                     16d
[13:44:48] <brouberol>	 cassandra-restbase-b-codfw                            ClusterIP   None         <none>        9042/TCP                     16d
[13:44:48] <brouberol>	 cassandra-restbase-b-eqiad                            ClusterIP   None         <none>        9042/TCP                     16d
[13:44:49] <brouberol>	 cassandra-restbase-c-codfw                            ClusterIP   None         <none>        9042/TCP                     16d
[13:44:49] <brouberol>	 cassandra-restbase-c-eqiad                            ClusterIP   None         <none>        9042/TCP                     16d
[13:44:52] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[13:44:58] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[13:45:27] <logmsgbot>	 !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply
[13:45:49] <brouberol>	 so, theoretically, all you have to do is specify external_services.cassandra: [cassandra-restbase-a-codfw, cassandra-restbase-a-eqiad, cassandra-restbase-b-codfw, cassandra-restbase-b-eqiad, cassandra-restbase-c-codfw, cassandra-restbase-c-eqiad]  
[13:46:02] <wikibugs>	 (03CR) 10TChin: [C:03+2] "Self-merging now :) Anything else we can fix later" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[13:46:09] <brouberol>	 (with a nested dict and not a dot, but I formatted it this way because IRC)
[13:46:21] <elukey>	 brouberol: ack thanks!
[13:46:28] <brouberol>	 that diff will be nice :D
[13:46:32] <brouberol>	 yw
[13:46:45] <wikibugs>	 (03CR) 10Elukey: "Same comment that Hugh made - let's use the new external-services-networkpolicies, so we can drop all IPs and let puppet populate the rela" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans)
[13:46:47] <wikibugs>	 (03Merged) 10jenkins-bot: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[13:48:26] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply
[13:48:27] <icinga-wm>	 RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.01 ms
[13:48:29] <icinga-wm>	 PROBLEM - ps1-c2-codfw-infeed-load-tower-A-phase-X on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:29] <icinga-wm>	 PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-Y on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:29] <icinga-wm>	 PROBLEM - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:32] <wikibugs>	 (03CR) 10Cathal Mooney: Enable gNMI / gRPC on cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1031904 (https://phabricator.wikimedia.org/T365012) (owner: 10Cathal Mooney)
[13:48:35] <icinga-wm>	 RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[13:48:35] <icinga-wm>	 RECOVERY - ps1-c2-codfw-infeed-load-tower-A-phase-X on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-A-phase-X 374 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:35] <icinga-wm>	 RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-Y on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-Y 282 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:35] <icinga-wm>	 RECOVERY - ps1-c2-codfw-infeed-load-tower-B-phase-X on ps1-c2-codfw is OK: SNMP OK - ps1-c2-codfw-infeed-load-tower-B-phase-X 277 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:48:37] <icinga-wm>	 RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.87 ms
[13:49:01] <icinga-wm>	 PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[13:49:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[13:49:10] <sukhe>	 hmmm
[13:49:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2006.codfw.wmnet with OS bullseye
[13:49:31] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 75 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:49:42] <wikibugs>	 (03CR) 10TChin: [C:03+2] Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[13:49:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[13:49:54] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[13:50:04] <wikibugs>	 (03PS11) 10TChin: Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434)
[13:50:44] <wikibugs>	 (03CR) 10TChin: [V:03+2 C:03+2] Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[13:51:25] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply
[13:51:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[13:52:06] <brouberol>	 elukey: thanks for pointing this out. I'm really glad to see that this is getting traction :)
[13:52:14] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC output is outdated in terms of exported hosts, seaborgium was reimaged to Bookworm (like serpens yesterday( to Bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff)
[13:52:17] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "The two existing entries are there for legacy reasons. That was for mobile repositories which once were hosted on Gerrit and got migrated " [puppet] - 10https://gerrit.wikimedia.org/r/1029212 (https://phabricator.wikimedia.org/T333029) (owner: 10Addshore)
[13:52:37] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply
[13:53:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1209.eqiad.wmnet
[13:54:42] <wikibugs>	 (03CR) 10Brouberol: "As I mentioned it to Elukey on IRC, this would entail having the following block in your values (wherever it makes sense)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans)
[13:54:45] <moritzm>	 !log installing nghttp2 security updates
[13:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1211.eqiad.wmnet
[13:57:40] <wikibugs>	 (03CR) 10Eevans: [C:03+2] "Sorry, I only just noticed your comment.  I would definitely like to learn more!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans)
[13:57:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1211 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031929 (https://phabricator.wikimedia.org/T349619)
[13:58:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1211 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031929 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1400)
[14:00:23] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply
[14:00:33] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM. No alerts appear to be based on the exporter. Didn't check dashboard though." [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff)
[14:00:38] <hnowlan>	 jouncebot: nowandnext
[14:00:38] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1400)
[14:00:38] <jouncebot>	 In 2 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1700)
[14:01:00] <vgutierrez>	 !log disable puppet on A:lvs before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031814 - T357257
[14:01:01] <JSherman>	 we're running over on backport
[14:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:04] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[14:01:32] <JSherman>	 My first patch is nearly done, but it's been running for 50 mintutes
[14:01:35] <Lucas_WMDE>	 JSherman: I guess it’s rebuilding the full l10n cache due to the new extension?
[14:01:36] <logmsgbot>	 !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1026972|extension-list: Add AutoModerator (T364034)]] (duration: 51m 44s)
[14:01:36] <RoanKattouw>	 We synced an extension-list change adding a new extension, which caused the i18n caches to be rebuilt, and deploying that apparently takes 50 minutes and counting
[14:01:39] <stashbot>	 T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034
[14:01:44] <Lucas_WMDE>	 jinx
[14:01:44] <RoanKattouw>	 Lucas_WMDE: Yes exactly
[14:01:46] <Lucas_WMDE>	 :/
[14:02:05] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] pybal,wmflib: Allow toggling IPIP per site and svc [puppet] - 10https://gerrit.wikimedia.org/r/1031814 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:02:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye
[14:02:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye
[14:02:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye
[14:02:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye
[14:02:35] <JSherman>	 hnowlan: are you here for the wikifunction services deploy? I was hoping to finish this extension deploy
[14:02:47] <hnowlan>	 JSherman: nah I'd like to deploy restbase, but there's no rush 
[14:02:57] <jan_drewniak>	 I was also hoping to backport my changes :/ 
[14:02:59] <hnowlan>	 take your time 
[14:03:11] <RoanKattouw>	 Hopefully the next few changes will be faster
[14:03:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:04-1] "We can't do this yet as the devices in racks c8 and d5 eqiad do not fully support enabling gNMI via mgmt routing instance.  Need to upgrad" [puppet] - 10https://gerrit.wikimedia.org/r/1031905 (https://phabricator.wikimedia.org/T365012) (owner: 10Cathal Mooney)
[14:03:18] <JSherman>	 hnowlan: thanks!
[14:03:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1211.eqiad.wmnet
[14:04:23] <wikibugs>	 (03CR) 10Muehlenhoff: P:ganeti Prometheus monitoring of ganeti services. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[14:04:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[14:05:37] <wikibugs>	 (03CR) 10Muehlenhoff: "We have https://grafana.wikimedia.org/d/DnxQ26qmk/ldap?orgId=1, I'll delete it once the patch is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff)
[14:05:38] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings.php: Add wmgUseAutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026973 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[14:05:49] <JSherman>	 jan_drewniak: I can deploy for you after I get through this; I have RoanKattouw pairing with me for training.
[14:06:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1214.eqiad.wmnet
[14:06:06] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1026973|InitialiseSettings.php: Add wmgUseAutoModerator (T364034)]]
[14:06:38] <jan_drewniak>	 JSherman:  yeah sure, that'd be great. don't worry about it taking so long (these things usually do)
[14:07:29] <wikibugs>	 (03CR) 10Bking: [C:03+1] cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse)
[14:07:50] <jan_drewniak>	 at leas mine can all be done with one command now :) `scap backport 1031477 1031479 1031478` 
[14:07:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1214 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031930 (https://phabricator.wikimedia.org/T349619)
[14:08:23] <claime>	 JSherman: for future deployments, you could probably have backported 1026973 1026974 and 1026975 together in one command
[14:08:36] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "An alternative is to use the hostname in the configuration file and when `profile::ci::manager_host` changes on one of the hosts, restart " [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[14:08:50] <claime>	 Oh you haven't started with those yet
[14:08:59] <wikibugs>	 (03PS1) 10Btullis: Remove kubernetesMasters definition from dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031593 (https://phabricator.wikimedia.org/T287491)
[14:09:56] <logmsgbot>	 !log jsn@deploy1002 jsn: Backport for [[gerrit:1026973|InitialiseSettings.php: Add wmgUseAutoModerator (T364034)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:09:57] <claime>	 That would avoid 3 image rebuild and pulls and make it only one
[14:09:58] <vgutierrez>	 !log re-enable puppet on A:lvs - T357257
[14:10:00] <stashbot>	 T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034
[14:10:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[14:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:03] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[14:10:25] <logmsgbot>	 !log jsn@deploy1002 jsn: Continuing with sync
[14:11:28] <wikibugs>	 (03PS1) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069)
[14:11:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1214 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1031930 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:11:50] <JSherman>	 claime: I was just planning on doing a git pull for the labs-only changes
[14:11:55] <claime>	 fair enough
[14:14:12] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[14:14:31] <RoanKattouw>	 claime: Thanks for the tip, I forgot that scap backport could do that
[14:14:41] <Lucas_WMDE>	 I think `scap backport` does that anyway if the changes only touch *-labs.php
[14:14:51] <RoanKattouw>	 Oh really, is it that smart?
[14:14:53] <Lucas_WMDE>	 (“does that” = only pulling)
[14:15:01] <Lucas_WMDE>	 yeah
[14:15:24] <Lucas_WMDE>	 I wasn’t initially fond of it, since it does mean outdated code on the other servers (even if it’s only files that should™ never be used)
[14:15:29] <Lucas_WMDE>	 but it’s at least a timesaver
[14:15:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1214.eqiad.wmnet
[14:15:51] <JSherman>	 Lucas_WMDE: ah thanks, I'll backport the remaining patches together
[14:16:10] <Lucas_WMDE>	 (if you backport all three it’ll still be a full deploy because of InitialiseSettings.php, but that’s fine)
[14:16:23] <RoanKattouw>	 We're already running the InitialiseSettings.php now
[14:16:29] <Lucas_WMDE>	 ok
[14:16:33] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:17:21] <vgutierrez>	 !log uploaded tcp-mss-clamper 0.5.1 to bullseye-wikimedia (apt.wm.o) - T357257
[14:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:24] <RoanKattouw>	 Damn I keep learning about new `scap backport` smartness every time I do a deploy
[14:17:26] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[14:17:31] <JSherman>	 Lucas_WMDE: after this, can I backport my remaining two changes + jan's in one swoop?
[14:18:02] <Lucas_WMDE>	 I’d just to your two remaining changes together
[14:18:11] <Lucas_WMDE>	 then you can see that (I think) scap backport will do the smart thing ^^
[14:18:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[14:18:25] <Lucas_WMDE>	 and then maybe jan’s three changes together
[14:18:53] <JSherman>	 ack
[14:19:02] <Lucas_WMDE>	 (oops, “I’d just do” → “to” ^^)
[14:19:22] <wikibugs>	 (03CR) 10Vgutierrez: hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:19:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] mx: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031760 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[14:19:35] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257)
[14:19:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage
[14:20:15] <logmsgbot>	 !log fab@deploy1002 Started deploy [airflow-dags/research@ecf603d]: (no justification provided)
[14:20:48] <logmsgbot>	 !log fab@deploy1002 Finished deploy [airflow-dags/research@ecf603d]: (no justification provided) (duration: 00m 32s)
[14:20:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage
[14:22:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2007.codfw.wmnet with reason: host reimage
[14:22:12] <JSherman>	 Lucas_WMDE: would +2ing jan's patches ahead of time save time for scap?
[14:22:21] <Lucas_WMDE>	 yes, that’s a good idea
[14:22:27] <Lucas_WMDE>	 Vector CI probably takes a while
[14:22:34] <wikibugs>	 (03PS2) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069)
[14:22:48] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+2] [Follow-up] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson)
[14:22:51] <logmsgbot>	 !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1026973|InitialiseSettings.php: Add wmgUseAutoModerator (T364034)]] (duration: 16m 44s)
[14:22:55] <stashbot>	 T364034: Deploy the AutoModerator extension to Beta Cluster - https://phabricator.wikimedia.org/T364034
[14:23:14] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+2] Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[14:23:40] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+2] Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[14:24:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage
[14:24:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[14:24:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[14:24:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage
[14:25:17] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Deploy AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026974 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[14:25:21] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026975 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman)
[14:25:59] <wikibugs>	 (03PS1) 10TChin: Add datasets-config values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434)
[14:26:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[14:26:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage
[14:27:18] <JSherman>	 okay, we're merged, going to do jan_drewniak: patches now
[14:27:22] <wikibugs>	 (03PS1) 10Vgutierrez: "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257)
[14:27:41] <wikibugs>	 (03PS2) 10Vgutierrez: "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257)
[14:27:51] <wikibugs>	 (03PS3) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257)
[14:27:55] <wikibugs>	 (03PS4) 10Vgutierrez: depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257)
[14:28:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson)
[14:28:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[14:28:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[14:28:44] <JSherman>	 zuul says we have about a 15 minute wait
[14:28:51] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:31:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:32:28] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] depool upload@ulsfo before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1031917 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:32:41] <vgutierrez>	 !log depool upload@ulsfo before enabling IPIP encapsulation - T357257
[14:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:45] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[14:33:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "mx: Stop ignoring errors from alias sync" [puppet] - 10https://gerrit.wikimedia.org/r/1031942 (https://phabricator.wikimedia.org/T284145)
[14:36:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "mx: Stop ignoring errors from alias sync" [puppet] - 10https://gerrit.wikimedia.org/r/1031942 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[14:37:24] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP on upload@ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257)
[14:37:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:38:02] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:26] <claime>	 !log Removing downtime on mw2286.codfw.wmnet - T364863
[14:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:30] <stashbot>	 T364863: InterfaceSpeedError - mw2286 - https://phabricator.wikimedia.org/T364863
[14:38:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for mw2286.codfw.wmnet
[14:38:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mw2286.codfw.wmnet
[14:39:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:39:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2007.codfw.wmnet with OS bullseye
[14:39:37] <claime>	 !log Repooling mw2286.codfw.wmnet - T364863
[14:39:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:41:19] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] Add datasets-config values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[14:41:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:41:54] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2008.codfw.wmnet with OS bullseye
[14:41:56] <wikibugs>	 (03CR) 10Gmodena: [C:04-1] Add datasets-config values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[14:42:17] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:43:02] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:43:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:43:36] <wikibugs>	 (03CR) 10Gmodena: "nit: deployment-charts is a multi-project repo. Could you specify the subsystem you are touching in your commit message?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[14:43:51] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[14:44:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2010.codfw.wmnet with OS bullseye
[14:45:19] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP on upload@ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257)
[14:45:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete wmflabs certs [puppet] - 10https://gerrit.wikimedia.org/r/1031947
[14:45:55] <wikibugs>	 (03Merged) 10jenkins-bot: [Follow-up] Override VE overlays in night-mode [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031477 (https://phabricator.wikimedia.org/T363861) (owner: 10Jdlrobson)
[14:46:33] <wikibugs>	 (03Merged) 10jenkins-bot: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031479 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[14:46:51] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:47:37] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 75 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:47:52] <wikibugs>	 (03Merged) 10jenkins-bot: Mark night mode as a valid beta feature [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1031478 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[14:48:27] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1031477|[Follow-up] Override VE overlays in night-mode (T363861)]], [[gerrit:1031479|Mark night mode as a valid beta feature (T363814)]], [[gerrit:1031478|Mark night mode as a valid beta feature (T363814)]]
[14:48:33] <stashbot>	 T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861
[14:48:34] <stashbot>	 T363814: Release dark mode as a beta feature on desktop (May 15th)  - https://phabricator.wikimedia.org/T363814
[14:48:43] <wikibugs>	 (03PS2) 10TChin: datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434)
[14:48:55] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Remove obsolete wmflabs certs [puppet] - 10https://gerrit.wikimedia.org/r/1031947 (owner: 10Muehlenhoff)
[14:48:59] <wikibugs>	 (03CR) 10TChin: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[14:50:07] <wikibugs>	 (03CR) 10Majavah: [C:03+1] Remove obsolete wmflabs certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031947 (owner: 10Muehlenhoff)
[14:51:08] <logmsgbot>	 !log jsn@deploy1002 jsn and jdlrobson: Backport for [[gerrit:1031477|[Follow-up] Override VE overlays in night-mode (T363861)]], [[gerrit:1031479|Mark night mode as a valid beta feature (T363814)]], [[gerrit:1031478|Mark night mode as a valid beta feature (T363814)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:51:41] <vgutierrez>	 !log disable puppet on A:lvs before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031827- T357257
[14:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:46] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[14:52:04] <wikibugs>	 (03PS3) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069)
[14:52:21] <JSherman>	 jan_drewniak: your patches are ready for testing
[14:52:32] <jan_drewniak>	 JSherman: ok one sec
[14:53:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP encapsulation on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031827 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:53:46] <jan_drewniak>	 JSherman: okn good to sync
[14:53:55] <JSherman>	 syncing
[14:53:56] <logmsgbot>	 !log jsn@deploy1002 jsn and jdlrobson: Continuing with sync
[14:55:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[14:57:40] <vgutierrez>	 !log re-enable puppet on A:lvs - T357257
[14:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:45] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[14:58:29] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on upload@ulsfo cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1031944 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[14:58:51] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:10] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[15:05:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye
[15:06:54] <logmsgbot>	 !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1031477|[Follow-up] Override VE overlays in night-mode (T363861)]], [[gerrit:1031479|Mark night mode as a valid beta feature (T363814)]], [[gerrit:1031478|Mark night mode as a valid beta feature (T363814)]] (duration: 18m 26s)
[15:07:01] <stashbot>	 T363861: Visual Editor overlays do not work in night theme - https://phabricator.wikimedia.org/T363861
[15:07:01] <stashbot>	 T363814: Release dark mode as a beta feature on desktop (May 15th)  - https://phabricator.wikimedia.org/T363814
[15:07:19] <JSherman>	 jan_drewniak: you should be good to go!
[15:08:10] <jan_drewniak>	 JSherman: thank you! sorry for making you stick around so long :P hopefully it won't usually take this long.
[15:09:35] <wikibugs>	 (03PS1) 10CDanis: otelcol: Use the version tag with the v prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031951 (https://phabricator.wikimedia.org/T364907)
[15:09:37] <wikibugs>	 (03PS1) 10CDanis: otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407)
[15:09:39] <wikibugs>	 (03PS1) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953
[15:10:03] <wikibugs>	 (03PS1) 10C. Scott Ananian: [ParserCache] Preserve information from the JsonException when logging failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031918 (https://phabricator.wikimedia.org/T365036)
[15:10:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[15:10:46] <JSherman>	 jan_drewniak: no worries, it was that i18n cache rebuild on the new extension that ate up most of the time
[15:12:36] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: Use the version tag with the v prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031951 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis)
[15:14:22] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: Use the version tag with the v prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031951 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis)
[15:14:37] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: disable rp_filter on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031955 (https://phabricator.wikimedia.org/T357257)
[15:15:57] <wikibugs>	 (03CR) 10CDanis: "I've tested this locally, and it seems to "work" inasmuch it doesn't do anything I don't expect.  But I haven't yet managed to locally rep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[15:16:09] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2463/co" [puppet] - 10https://gerrit.wikimedia.org/r/1031955 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[15:16:23] <wikibugs>	 (03CR) 10JMeybohm: sre.hosts.rename: initial commit (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi)
[15:16:49] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: disable rp_filter on upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1031955 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez)
[15:18:33] <wikibugs>	 (03CR) 10JMeybohm: sre.hosts.rename: initial commit (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi)
[15:20:55] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:21:11] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:21:46] <wikibugs>	 (03PS2) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953
[15:22:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[15:23:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[15:24:39] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 104 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:24:48] <claime>	 jouncebot: nowandnext
[15:24:48] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 35 minute(s)
[15:24:49] <jouncebot>	 In 1 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1700)
[15:25:03] <icinga-wm>	 PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[15:25:23] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[15:25:46] <vgutierrez>	 !log rolling restart of pybal on lvs4010 and lvs4009 - T357257
[15:25:48] <hnowlan>	 claime: mind if I do a restbase deploy? 
[15:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:51] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[15:26:02] <claime>	 hnowlan: Nah, I can wait until after you do it, np
[15:26:03] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:26:05] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:26:05] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[15:26:10] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@92abb6a]: Deploying new wikis T360304 T360311 T363244 T363250 T363257 T363264 T363271
[15:26:12] <claime>	 it'll just sit there for a minute
[15:26:25] <stashbot>	 T360304: Add kuswiki to RESTBase - https://phabricator.wikimedia.org/T360304
[15:26:26] <stashbot>	 T360311: Add bewwiki to RESTBase - https://phabricator.wikimedia.org/T360311
[15:26:26] <stashbot>	 T363244: Add kawikisource to RESTBase - https://phabricator.wikimedia.org/T363244
[15:26:26] <stashbot>	 T363250: Post-creation work for mswikisource - https://phabricator.wikimedia.org/T363250
[15:26:27] <stashbot>	 T363257: Add kaawiktionary to RESTBase - https://phabricator.wikimedia.org/T363257
[15:26:27] <stashbot>	 T363264: Add iglwiki to RESTBase - https://phabricator.wikimedia.org/T363264
[15:26:28] <stashbot>	 T363271: Add mywikisource to RESTBase - https://phabricator.wikimedia.org/T363271
[15:26:31] <cdanis>	 claime: thanks for the +1
[15:26:45] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:26:49] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: Bump maxUnavailable to 6% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031844 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[15:26:58] <claime>	 cdanis: figured you'd want to test this quickly, and since it's borked anyways
[15:27:30] <cdanis>	 claime: oh if I didn't get a +1 soon I was just going to deploy it by hand without merging 😅
[15:27:36] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: attempt to fix service name confusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031952 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[15:27:37] <claime>	 lol
[15:28:37] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[15:28:42] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[15:29:37] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 80 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:29:52] <wikibugs>	 (03CR) 10TChin: [C:03+2] datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[15:29:55] <wikibugs>	 (03PS1) 10CDanis: otelcol: reference the changed transformprocessor name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031956 (https://phabricator.wikimedia.org/T363407)
[15:30:03] <wikibugs>	 (03PS3) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953
[15:30:03] <wikibugs>	 (03PS1) 10David Caro: openstack: use bobcat for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957
[15:30:05] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: reference the changed transformprocessor name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031956 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[15:30:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[15:30:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: use bobcat for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 (owner: 10David Caro)
[15:30:50] <wikibugs>	 (03Merged) 10jenkins-bot: datasets-config: add values files for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031938 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[15:31:13] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: reference the changed transformprocessor name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031956 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis)
[15:31:40] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[15:31:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1377 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:31:53] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[15:32:42] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[15:32:42] <wikibugs>	 (03PS2) 10David Caro: openstack: use bobcat/bookworm for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957
[15:32:42] <wikibugs>	 (03PS4) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953
[15:32:44] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[15:33:02] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:33:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[15:33:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: use bobcat/bookworm for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 (owner: 10David Caro)
[15:33:59] <wikibugs>	 (03PS1) 10CDanis: I got YAMLed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031959
[15:34:19] <wikibugs>	 (03CR) 10CDanis: [C:03+2] I got YAMLed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031959 (owner: 10CDanis)
[15:34:57] * Lucas_WMDE resets “days since last YAML accident” to 0
[15:35:32] <wikibugs>	 (03Merged) 10jenkins-bot: I got YAMLed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031959 (owner: 10CDanis)
[15:35:41] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[15:35:41] <claime>	 " This file only contains whitespace changes. Modify the whitespace setting to see the changes. "
[15:35:43] <claime>	 yeah
[15:35:49] <claime>	 yaml
[15:35:51] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[15:36:29] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply
[15:36:34] <logmsgbot>	 !log cdanis@deploy1002 helmfile [staging] DONE helmfile.d/services/opentelemetry-collector: apply
[15:36:59] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/services/opentelemetry-collector: apply
[15:37:18] <logmsgbot>	 !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/opentelemetry-collector: apply
[15:38:16] <cdanis>	 sorry for the noise
[15:38:40] <James_F>	 I blame scap.
[15:38:55] <claime>	 narrator: He didn't use scap
[15:39:05] <James_F>	 I blame it for not doing helm stuff.
[15:39:07] <wikibugs>	 (03PS1) 10Vgutierrez: Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919
[15:40:29] <wikibugs>	 (03PS3) 10David Caro: openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957
[15:40:29] <wikibugs>	 (03PS5) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953
[15:40:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[15:41:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 (owner: 10David Caro)
[15:41:39] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 99 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:41:47] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919 (owner: 10Vgutierrez)
[15:42:07] <wikibugs>	 (03CR) 10BBlack: [C:03+1] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919 (owner: 10Vgutierrez)
[15:42:15] <wikibugs>	 (03PS1) 10Kosta Harlan: AbuseFilterHooks: Provide feature flags for AF custom actions [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110)
[15:43:03] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@92abb6a]: Deploying new wikis T360304 T360311 T363244 T363250 T363257 T363264 T363271 (duration: 16m 52s)
[15:43:14] <stashbot>	 T360304: Add kuswiki to RESTBase - https://phabricator.wikimedia.org/T360304
[15:43:15] <stashbot>	 T360311: Add bewwiki to RESTBase - https://phabricator.wikimedia.org/T360311
[15:43:15] <stashbot>	 T363244: Add kawikisource to RESTBase - https://phabricator.wikimedia.org/T363244
[15:43:16] <stashbot>	 T363250: Post-creation work for mswikisource - https://phabricator.wikimedia.org/T363250
[15:43:16] <stashbot>	 T363257: Add kaawiktionary to RESTBase - https://phabricator.wikimedia.org/T363257
[15:43:17] <stashbot>	 T363264: Add iglwiki to RESTBase - https://phabricator.wikimedia.org/T363264
[15:43:17] <stashbot>	 T363271: Add mywikisource to RESTBase - https://phabricator.wikimedia.org/T363271
[15:43:19] <claime>	 James_F: We do use it for helmfile stuff for mw-on-k8s, but I don't know if we want it to be used for all helmfile things
[15:43:30] <James_F>	 Ack.
[15:43:43] <James_F>	 It's just helmfile is so much a black-box compared to scap's wrapper.
[15:43:49] <claime>	 'cause well, it'd be a wrapper around helmfile, that's a wrapper around helm, that's a wrapper around kube yaml
[15:43:56] <James_F>	 Fair.
[15:44:02] <James_F>	 But progress meters!
[15:44:04] <claime>	 lol
[15:45:08] <cdanis>	 James_F: yeahhhhhh
[15:45:25] <James_F>	 `helmfile apply` … and wait minutes to see if you maybe just broke stuff.
[15:46:15] <sukhe>	 what's up with gerrit
[15:46:22] <claime>	 was about to ask
[15:46:28] * sukhe looks
[15:46:30] <Daimona>	 +1
[15:46:39] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 79 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:47:01] <claime>	 hnowlan: all good for me to proceed with my deployment?
[15:47:08] <claime>	 (I don't need no gerrit :p)
[15:47:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:47:34] <sukhe>	 hmm I see nothing wrong on the dashboard though
[15:47:35] <sukhe>	 oh hello
[15:47:52] <hnowlan>	 claime: yep, thanks! 
[15:47:55] <cdanis>	 sukhe: I see no data points on the dashbaord for the past few minutes
[15:48:02] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:48:03] <sukhe>	 something is up for sure yeah
[15:49:34] <logmsgbot>	 !log cgoubert@deploy1002 Started scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323
[15:49:35] <sukhe>	 weird
[15:49:40] <stashbot>	 T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323
[15:50:38] <sukhe>	 The last Puppet run was at Fri Apr 26 17:27:53 UTC 2024 (19 minutes ago). 
[15:50:59] <logmsgbot>	 !log cgoubert@deploy1002 Finished scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323 (duration: 02m 01s)
[15:53:02] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:53:08] <Dreamy_Jazz>	 Is gerrit down?
[15:53:12] <sukhe>	 yeah it is 
[15:53:16] <sukhe>	 looking into it
[15:53:20] <Dreamy_Jazz>	 Thanks
[15:54:02] <claime>	 hashar is filing a task for it btw
[15:54:08] <wikibugs>	 (03PS4) 10David Caro: openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957
[15:54:09] <wikibugs>	 (03PS6) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953
[15:54:10] <sukhe>	 The last Puppet run was at Tue Apr 30 07:18:32 UTC 2024 (0 minutes ago).  
[15:54:13] <sukhe>	 fun
[15:54:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] AbuseFilterHooks: Provide feature flags for AF custom actions [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan)
[15:54:56] <Dreamy_Jazz>	 Seems back up for me now
[15:55:11] <claime>	 https://phabricator.wikimedia.org/T365041
[15:55:15] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply
[15:55:16] <mutante>	 it's back
[15:55:19] <sukhe>	 thanks claime 
[15:55:24] <vgutierrez>	 same here 
[15:55:25] <sukhe>	 mutante: did you restart?
[15:55:30] <mutante>	 I was about to restart the service but then did not.. after I saw discussion in -releng
[15:55:39] <sukhe>	 I see. so do we know what happened?
[15:55:41] <mutante>	 and then hashar made https://phabricator.wikimedia.org/T365041#9800848
[15:55:42] <wikibugs>	 (03CR) 10Dreamy Jazz: "recheck" [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan)
[15:56:09] <sukhe>	 the system time is correct but the last puppet runs are certainly not
[15:56:37] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "Interesting, I wonder who the original owner of this is? I'd like to understand if Data Products can also mine this." [puppet] - 10https://gerrit.wikimedia.org/r/1031416 (https://phabricator.wikimedia.org/T364820) (owner: 10Btullis)
[15:56:40] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/opentelemetry-collector: apply
[15:56:54] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/opentelemetry-collector: apply
[15:57:26] <sukhe>	 so I guess two unanswered questions:
[15:57:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:57:35] <hashar>	 hi, Gerrit had some issue between 15:42 and 15:55 it is recovering
[15:57:41] <hashar>	 filed as v
[15:57:42] <sukhe>	 hashar: hi
[15:57:46] <hashar>	 T365041
[15:57:46] <stashbot>	 T365041: Gerrit not reachable over HTTPS - https://phabricator.wikimedia.org/T365041
[15:58:02] <jinxer-wm>	 FIRING: [6x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:58:08] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@ulsfo before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1031919 (owner: 10Vgutierrez)
[15:58:13] <sukhe>	 so two things though, we didn't get paged for it (maybe we should) and the puppet agent run motd is not correct, even though the system time is
[15:58:17] <sukhe>	 just as an fyi for what is so far
[15:58:20] <hashar>	 and please don't blindly restart services :)
[15:58:32] <sukhe>	 hashar: ok, nothing was restarted here
[15:58:36] <vgutierrez>	 !log repool upload@ulsfo with IPIP encapsulation enabled - T357257
[15:58:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:40] <stashbot>	 T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257
[15:59:30] <wikibugs>	 (03CR) 10David Caro: openstack::bobcat: apply cloud yaml patch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[16:00:17] <mutante>	 sukhe: The last Puppet run was at Fri Apr 26 17:27:53 UTC 2024 (19 minutes ago). 
[16:00:24] <mutante>	 what the heck
[16:00:42] <sukhe>	 yeah. the system time is correct though
[16:00:55] <claime>	 That's fun
[16:01:14] <sukhe>	      Active: inactive (dead) since Wed 2024-05-15 15:48:20 UTC; 12min ago
[16:01:18] <sukhe>	      Loaded: loaded (/lib/systemd/system/puppet-agent-timer.service; static)
[16:01:39] <mutante>	 no failed units though
[16:01:43] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1377 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:02:27] <mutante>	 runs the same command that this service uses to run puppet
[16:04:43] <mutante>	 well, that finished but changed nothing.. then let's see the command that builds the motd from the snippets
[16:06:15] <hashar>	 so that Gerrit issue was transient. It has self recovered and I have closed the task
[16:06:48] <hashar>	 !log Gerrit was briefly unreachable between 15:42 and 15:55 UTC | T365041
[16:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:54] <stashbot>	 T365041: Gerrit not reachable over HTTPS - https://phabricator.wikimedia.org/T365041
[16:07:10] <sukhe>	 that doesn't explain the wrong puppet run timer though but we will look at the independently 
[16:07:49] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datasets-config - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:11:39] <wikibugs>	 (03PS1) 10Superpes15: [pswiki] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851)
[16:12:54] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1031842 (owner: 10Muehlenhoff)
[16:17:17] <mutante>	 so regarding the not-updating MOTD mystery:
[16:17:32] <mutante>	 when I manually run "run-parts /etc/update-motd.d" I get the right parts:
[16:17:35] <mutante>	 The last Puppet run was at Wed May 15 16:02:53 UTC 2024 (13 minutes ago). 
[16:17:46] <mutante>	 and other bullseye hosts dont have this issue
[16:24:26] <mutante>	 !log gerrit1003 - MOTD wasn't updating anymore but manual "run-parts /etc/update-motd.d" showed updated data - while /run/motd.dynamic was outdated. fixed by manually renaming /run/motd.dynamic.new to /run/motd.dynamic and logging in because it's triggered by PAM.. but .. why 
[16:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:22] <mutante>	 same issue on the other gerrit server.. so it's the puppet role? what
[16:27:38] <sukhe>	 so far just these two hosts yep
[16:28:05] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] "Applies cleanly locally, and from a quick random test seems correct too. :D Again thanks a lot for your patience!" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery)
[16:28:30] <wikibugs>	 (03Abandoned) 10Brennen Bearnes: WIP: gitlab: enable agent server for kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/767249 (https://phabricator.wikimedia.org/T283894) (owner: 10Brennen Bearnes)
[16:28:51] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Undo qqq.json overwrites [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/975438 (https://phabricator.wikimedia.org/T351363) (owner: 10Pppery)
[16:30:16] <mutante>	 also not a permission issue on those files in /run 
[16:30:50] <mutante>	 even #debian is confused because motd became way too complext .. when it used to be a simple file to edit :)
[16:31:04] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Use correct wikiids parameter name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031966
[16:31:46] <wikibugs>	 (03CR) 10Peter Fischer: [C:03+2] cirrus: Use correct wikiids parameter name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031966 (owner: 10Ebernhardson)
[16:31:53] <mutante>	 !log gerrit2002 - mv /run/motd.dynamic.new /run/motd.dynamic
[16:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:37] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Use correct wikiids parameter name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031966 (owner: 10Ebernhardson)
[16:33:32] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
[16:33:37] <sukhe>	 thanks mutante! 
[16:34:10] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply
[16:34:39] <mutante>	 sukhe: yw, so it seems like it works normal again. it creates a new "motd.dynamic.new" file on login 
[16:34:52] <sukhe>	 the mystery remains on why just gerrit but I guess ... :)
[16:37:52] <logmsgbot>	 !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[16:37:59] <logmsgbot>	 !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[16:39:07] <mutante>	 I confirmed the relevant line is in /etc/pam.d/login and /etc/pam.d/sshd as normal
[16:40:39] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply
[16:47:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T352010)', diff saved to https://phabricator.wikimedia.org/P62416 and previous config saved to /var/cache/conftool/dbconfig/20240515-164713-ladsgroup.json
[16:47:21] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[16:48:23] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 (owner: 10Ryan Kemper)
[16:50:46] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply
[16:51:54] <wikibugs>	 06SRE, 10Scap, 06serviceops-radar, 13Patch-For-Review: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#9801162 (10hashar) I had a similar issue while deploying the train this morning. One of the httpbb test failed due to mwdebug2002...
[16:59:07] <wikibugs>	 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9801220 (10SMMPakPanel) Its all-encompassing strategy for social media marketing in Pakistan makes [[ https://smmpakpanel.com/ | SMM Pak Panel ]] unique. It helps businesses efficiently improve their we...
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1700)
[17:01:55] <mutante>	 deleted phab spam
[17:02:22] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P62417 and previous config saved to /var/cache/conftool/dbconfig/20240515-170221-ladsgroup.json
[17:02:50] <wikibugs>	 07Puppet, 06SRE: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870#9801243 (10hashar)
[17:07:17] <TheresNoTime>	 phab spam on my humorous (?) task, how sad
[17:07:55] <wikibugs>	 (03PS1) 10Eevans: Add user xcollazo to cassandra-staging-devs group [puppet] - 10https://gerrit.wikimedia.org/r/1031976 (https://phabricator.wikimedia.org/T364588)
[17:08:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9801254 (10Eevans)
[17:10:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801259 (10VRiley-WMF) a:03VRiley-WMF
[17:15:57] <wikibugs>	 (03PS4) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069)
[17:17:09] <wikibugs>	 (03PS5) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069)
[17:17:17] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1007.eqiad.wmnet with OS bullseye
[17:17:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P62418 and previous config saved to /var/cache/conftool/dbconfig/20240515-171729-ladsgroup.json
[17:17:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye
[17:20:16] <wikibugs>	 (03PS6) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069)
[17:21:37] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
[17:22:19] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons.
[17:24:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[17:26:45] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 13), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9801327 (10Scott_French) Hi @SGupta-WMF and @mforns - Any updates on the timel...
[17:28:48] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply
[17:32:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T352010)', diff saved to https://phabricator.wikimedia.org/P62419 and previous config saved to /var/cache/conftool/dbconfig/20240515-173236-ladsgroup.json
[17:32:39] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[17:32:41] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:32:52] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[17:33:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T352010)', diff saved to https://phabricator.wikimedia.org/P62420 and previous config saved to /var/cache/conftool/dbconfig/20240515-173259-ladsgroup.json
[17:33:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:34:23] <sukhe>	 ^ has anyone looked at this and why it alerts so frequently? it's a signficant contribute to AlertFatigue ratio
[17:35:10] <mutante>	 sukhe: I made https://phabricator.wikimedia.org/T364931 yesterday for that
[17:35:33] <sukhe>	 wow thank you
[17:35:47] <sukhe>	 adding my .1 cents :)
[17:36:31] <wikibugs>	 (03PS7) 10Andrew Bogott: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[17:36:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[17:37:44] <wikibugs>	 (03CR) 10Andrew Bogott: "latest version uses openstack::patch for consistency.  I tested the patch application on bookworm and it worked with fuzz 1." [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[17:37:46] <sukhe>	 mutante: the real lesson here is to not care about alerts that are not pagin g I guess :)
[17:38:11] <icinga-wm>	 PROBLEM - MediaWiki CirrusSearch update rate - codfw on alert1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[17:39:01] <wikibugs>	 (03CR) 10Dzahn: contint: set new default docker version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[17:40:05] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply
[17:41:58] <wikibugs>	 (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[17:43:54] <wikibugs>	 (03PS8) 10Andrew Bogott: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[17:43:54] <wikibugs>	 (03PS1) 10Andrew Bogott: Update cinder_backup_spec to test with bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031981
[17:44:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[17:44:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update cinder_backup_spec to test with bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031981 (owner: 10Andrew Bogott)
[17:46:05] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet
[17:49:17] <wikibugs>	 (03CR) 10Scott French: "Thanks, Filippo!" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi)
[18:00:05] <jouncebot>	 hashar and andre: Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T1800)
[18:01:25] <wikibugs>	 (03PS1) 10CDanis: Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924
[18:01:31] <wikibugs>	 (03PS2) 10CDanis: Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924
[18:01:37] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924 (owner: 10CDanis)
[18:02:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "otelcol: reference the changed transformprocessor name" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031924 (owner: 10CDanis)
[18:03:28] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1007.eqiad.wmnet with OS bullseye
[18:03:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9801460 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye executed...
[18:06:57] <wikibugs>	 (03PS1) 10Andrew Bogott: Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985
[18:08:15] <wikibugs>	 (03PS2) 10Andrew Bogott: Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985
[18:08:38] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Update cinder_backup_spec to test with bobcat [puppet] - 10https://gerrit.wikimedia.org/r/1031981 (owner: 10Andrew Bogott)
[18:09:53] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott)
[18:10:03] <wikibugs>	 (03Abandoned) 10BCornwall: testing, please ignore [dns] - 10https://gerrit.wikimedia.org/r/1031071 (owner: 10BCornwall)
[18:10:11] <wikibugs>	 (03PS1) 10BCornwall: [ncmonitor] Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1031991
[18:10:35] <wikibugs>	 (03Abandoned) 10BCornwall: [ncmonitor] Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1031991 (owner: 10BCornwall)
[18:11:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott)
[18:11:59] <wikibugs>	 (03PS3) 10Andrew Bogott: Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985
[18:12:05] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott)
[18:13:56] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons.
[18:18:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Rip out code for cinder-backup [puppet] - 10https://gerrit.wikimedia.org/r/1031985 (owner: 10Andrew Bogott)
[18:20:55] <wikibugs>	 (03PS9) 10Andrew Bogott: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro)
[18:22:00] <wikibugs>	 (03CR) 10Dzahn: "Yea, fair enough. Though restarting zuul-merger might also be forgotten and puppet at least runs by itself after a while." [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[18:22:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] ci: avoid hardcoded IP in Hiera, lookup contint.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[18:24:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "duh!  DNS lookup failed for 127.0.0.1 Resolv::DNS::Resource::IN::A" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[18:39:04] <wikibugs>	 (03PS1) 10Dzahn: ci/zuul: use localhost as gearman server [puppet] - 10https://gerrit.wikimedia.org/r/1032010 (https://phabricator.wikimedia.org/T334517)
[18:39:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] ci/zuul: use localhost as gearman server [puppet] - 10https://gerrit.wikimedia.org/r/1032010 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[18:44:45] <logmsgbot>	 !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply
[18:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:50:35] <wikibugs>	 (03PS1) 10Dzahn: zuul: add DNS lookup for gearman server IP [puppet] - 10https://gerrit.wikimedia.org/r/1032013 (https://phabricator.wikimedia.org/T334517)
[18:52:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] zuul: add DNS lookup for gearman server IP [puppet] - 10https://gerrit.wikimedia.org/r/1032013 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[18:56:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "needed 2 follow-ups but is working now:" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[18:56:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "[contint2002:~] $ sudo grep -A1 "\[gearman\]" /etc/zuul/zuul-*.conf" [puppet] - 10https://gerrit.wikimedia.org/r/1032010 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[18:57:49] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[18:59:32] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "first applied on contint2002 - fixed issues - then applied on contint1002 and it was complete noop. configs are the same before and after " [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[19:02:16] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "looks like the function name was copied but not adjusted yet from "debian_php_version" to "wmf_php_version"." [puppet] - 10https://gerrit.wikimedia.org/r/1029900 (owner: 10Muehlenhoff)
[19:03:06] <wikibugs>	 (03PS1) 10Jdlrobson: [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887)
[19:04:04] <wikibugs>	 (03PS2) 10Jdlrobson: [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887)
[19:05:13] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+2] [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson)
[19:06:02] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta cluster] Set wgVectorFontSizeConfigurableOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032019 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson)
[19:09:55] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:10:12] <wikibugs>	 (03PS3) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583
[19:13:47] <wikibugs>	 (03PS2) 10Dzahn: contint: set new default docker version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517)
[19:15:55] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:17:56] <wikibugs>	 (03PS1) 10Dzahn: ci: set puppet7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1032023 (https://phabricator.wikimedia.org/T334517)
[19:18:25] <wikibugs>	 (03PS2) 10Dzahn: ci: set puppet7 at role level [puppet] - 10https://gerrit.wikimedia.org/r/1032023 (https://phabricator.wikimedia.org/T334517)
[19:26:39] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[19:40:06] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031594
[19:41:03] <TheresNoTime>	 jouncebot: nowandnext
[19:41:03] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 18 minute(s)
[19:41:04] <jouncebot>	 In 0 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T2000)
[19:42:18] <cscott>	 i'm here for the backport window
[19:45:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T364299)', diff saved to https://phabricator.wikimedia.org/P62423 and previous config saved to /var/cache/conftool/dbconfig/20240515-194514-marostegui.json
[19:45:20] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[19:48:28] <TheresNoTime>	 cscott: shall we start your patch merging now?
[19:48:36] <cscott>	 Sure!  Thanks!
[19:48:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031918 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian)
[19:54:08] <cscott76>	 Sorry for the aliasing, joining from my phone as well as my desktop
[19:55:11] <TheresNoTime>	 np :)
[19:55:32] <wikibugs>	 (03PS2) 10Jdlrobson: Enable night mode as a desktop beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814)
[19:55:44] <wikibugs>	 (03PS2) 10Superpes15: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708)
[19:56:12] <wikibugs>	 (03CR) 10Scott French: "Thanks, Amir!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (owner: 10Scott French)
[19:58:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:58:07] <wikibugs>	 (03PS8) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440
[19:58:52] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "no change on the current prod server:  https://puppet-compiler.wmflabs.org/output/1020344/2465/" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[19:59:06] <wikibugs>	 (03PS4) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (https://phabricator.wikimedia.org/T362786)
[19:59:20] <wikibugs>	 (03PS2) 10Jsn.sherman: CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031999 (https://phabricator.wikimedia.org/T364034)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T2000).
[20:00:04] <jouncebot>	 Jdlrobson, Superpes, cscott, and Dreamy_Jazz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:15] <Dreamy_Jazz>	 \o
[20:00:21] <cscott>	 You know, I've never gotten my sticker(s)
[20:00:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P62424 and previous config saved to /var/cache/conftool/dbconfig/20240515-200022-marostegui.json
[20:00:28] <Dreamy_Jazz>	 :D
[20:00:39] <cscott>	 and i've broken wikis back when i should have been rewarded with a t-shirt for the feat
[20:00:57] <Dreamy_Jazz>	 I am happy to deploy, but don't mind if someone wants to combine my patch with another deploy.
[20:01:02] <TheresNoTime>	 (currently merging cscott's patch, another 10 minutes or so)
[20:01:12] <Dreamy_Jazz>	 👍
[20:01:39] <Jdlrobson>	 o/
[20:01:42] <wikibugs>	 (03CR) 10Scott French: "Thanks for the feedback on I0a62da18de21b609b7f07b075bd9be99cd8b8b9f, Amir. Let's go that route and continue the conversation there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French)
[20:02:17] <wikibugs>	 (03Abandoned) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1030440 (owner: 10Scott French)
[20:09:55] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "no change on contint1002 - error on contint2002 will resolve once I reimage tomorrow morning" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[20:11:23] <wikibugs>	 (03Merged) 10jenkins-bot: [ParserCache] Preserve information from the JsonException when logging failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031918 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian)
[20:12:15] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:1031918|[ParserCache] Preserve information from the JsonException when logging failures (T365036)]]
[20:12:17] <TheresNoTime>	 here's hoping cscott returns..
[20:12:19] <stashbot>	 T365036: JSON serialization failures on media files - https://phabricator.wikimedia.org/T365036
[20:15:01] <logmsgbot>	 !log samtar@deploy1002 cscott and samtar: Backport for [[gerrit:1031918|[ParserCache] Preserve information from the JsonException when logging failures (T365036)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:15:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P62425 and previous config saved to /var/cache/conftool/dbconfig/20240515-201529-marostegui.json
[20:16:12] <logmsgbot>	 !log samtar@deploy1002 cscott and samtar: Continuing with sync
[20:16:52] <wikibugs>	 (03PS7) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255)
[20:16:52] <wikibugs>	 (03CR) 10Andrea Denisse: "Thanks for your review and the explanation on what is expected of these metrics. I've grouped them under the same metric name (`wmfstatic_" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse)
[20:18:50] <wikibugs>	 (03PS8) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255)
[20:20:05] <wikibugs>	 (03PS6) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202)
[20:20:44] <cscott>	 i'm back, sorry libera.chat bounced me
[20:20:55] <TheresNoTime>	 cscott: hi, I went ahead and started the sync for your patch, but you're welcome to test it on mwdebug now while it syncs
[20:21:04] <cscott>	 i'm back, sorry libera.chat bounced me
[20:21:12] <cscott>	 ok, testing now
[20:21:39] <wikibugs>	 (03PS7) 10Dzahn: stewards: add rsync server, let lists primary host pull data [puppet] - 10https://gerrit.wikimedia.org/r/1031565 (https://phabricator.wikimedia.org/T351202)
[20:24:11] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801879 (10Eevans)
[20:25:46] <wikibugs>	 (03PS1) 10Dzahn: lists: move definition of primary and standby host to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/1032032
[20:26:58] <wikibugs>	 (03PS2) 10Dzahn: lists: move definition of primary and standby host to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/1032032
[20:28:28] <cscott>	 TheresNoTime: tested, looks good. thanks.
[20:28:38] <TheresNoTime>	 cscott: ack :)
[20:28:42] <TheresNoTime>	 Jdlrobson: doing your config patch next (combined with Superpes' throttle rule)
[20:28:56] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1031918|[ParserCache] Preserve information from the JsonException when logging failures (T365036)]] (duration: 16m 41s)
[20:29:00] <stashbot>	 T365036: JSON serialization failures on media files - https://phabricator.wikimedia.org/T365036
[20:29:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[20:29:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708) (owner: 10Superpes15)
[20:30:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable night mode as a desktop beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031561 (https://phabricator.wikimedia.org/T363814) (owner: 10Jdlrobson)
[20:30:02] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), 07Service-deployment-requests: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801884 (10Eevans) A Docker image is now published:  ` docker pull docker-registry.wikimedia.org/repos/sre...
[20:30:04] <wikibugs>	 (03Merged) 10jenkins-bot: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031817 (https://phabricator.wikimedia.org/T364708) (owner: 10Superpes15)
[20:30:37] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:1031561|Enable night mode as a desktop beta feature (T363814)]], [[gerrit:1031817|[enwiki] Throttle exemption for Editathon (T364708)]]
[20:30:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T364299)', diff saved to https://phabricator.wikimedia.org/P62426 and previous config saved to /var/cache/conftool/dbconfig/20240515-203037-marostegui.json
[20:30:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[20:30:46] <stashbot>	 T363814: Release dark mode as a beta feature on desktop (May 15th)  - https://phabricator.wikimedia.org/T363814
[20:30:46] <stashbot>	 T364708: Temp lift of IP cap for Chronobiology Edit-a-thon 18th May 2024 - https://phabricator.wikimedia.org/T364708
[20:30:51] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[20:30:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[20:30:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[20:31:00] <Jdlrobson>	 exciting stuff
[20:31:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[20:31:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T364299)', diff saved to https://phabricator.wikimedia.org/P62427 and previous config saved to /var/cache/conftool/dbconfig/20240515-203116-marostegui.json
[20:33:15] <logmsgbot>	 !log samtar@deploy1002 samtar and superpes and jdlrobson: Backport for [[gerrit:1031561|Enable night mode as a desktop beta feature (T363814)]], [[gerrit:1031817|[enwiki] Throttle exemption for Editathon (T364708)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:33:19] <TheresNoTime>	 Jdlrobson: live on mwdebug :)
[20:34:27] <Jdlrobson>	 TheresNoTime: looking!
[20:35:09] <Jdlrobson>	 TheresNoTime: please sync!
[20:35:15] <logmsgbot>	 !log samtar@deploy1002 samtar and superpes and jdlrobson: Continuing with sync
[20:35:18] <wikibugs>	 (03PS1) 10Eevans: cassandra: add data_gateway Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921)
[20:36:28] <TheresNoTime>	 Dreamy_Jazz: will start your patch merging now
[20:36:36] <Dreamy_Jazz>	 Thanks
[20:36:44] <wikibugs>	 (03CR) 10Samtar: [C:03+2] "prep for deploy" [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan)
[20:36:46] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801918 (10Eevans)
[20:44:38] <Jdlrobson>	 thanks for the deploy!
[20:44:47] <TheresNoTime>	 np!
[20:48:13] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1031561|Enable night mode as a desktop beta feature (T363814)]], [[gerrit:1031817|[enwiki] Throttle exemption for Editathon (T364708)]] (duration: 17m 35s)
[20:48:18] <stashbot>	 T363814: Release dark mode as a beta feature on desktop (May 15th)  - https://phabricator.wikimedia.org/T363814
[20:48:18] <stashbot>	 T364708: Temp lift of IP cap for Chronobiology Edit-a-thon 18th May 2024 - https://phabricator.wikimedia.org/T364708
[20:50:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan)
[20:51:56] <Superpes>	 Thanks TheresNoTime
[20:51:57] <Superpes>	 :3
[20:52:07] <TheresNoTime>	 no worries :D
[20:54:54] <TheresNoTime>	 Dreamy_Jazz: about 5mins to merging, you okay to hang on?
[20:55:12] <Dreamy_Jazz>	 Yes, I can hang around till then.
[20:56:07] <Dreamy_Jazz>	 gate-and-submit is certainly taking it's time :D
[20:56:22] <wikibugs>	 (03PS9) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255)
[20:56:41] <TheresNoTime>	 didn't think ConfirmEdit was that slow to merge..!
[20:57:27] <Dreamy_Jazz>	 Ikr. I'd expect this for CheckUser or something, but ConfirmEdit seems an odd one. Perhaps it's because it now loads AbuseFilter as a dependency?
[20:58:08] <Sohom_Datta>	 Um, about the dark mode patch, is it expected to be automatic by default ?
[20:58:46] <Dreamy_Jazz>	 Do you have all beta features enabled?
[20:59:03] <wikibugs>	 (03Merged) 10jenkins-bot: AbuseFilterHooks: Provide feature flags for AF custom actions [extensions/ConfirmEdit] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1031921 (https://phabricator.wikimedia.org/T20110) (owner: 10Kosta Harlan)
[20:59:39] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:1031921|AbuseFilterHooks: Provide feature flags for AF custom actions (T20110)]]
[20:59:51] <stashbot>	 T20110: Define AbuseFilter consequence to display a CAPTCHA - https://phabricator.wikimedia.org/T20110
[21:00:02] <Sohom_Datta>	 I had the "Accessibility for Reading (Vector 2022)" feature enabled by default I guess
[21:00:03] <Dreamy_Jazz>	 At least for me Dark Mode isn't enabled by default on production
[21:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240515T2100)
[21:00:12] <Dreamy_Jazz>	 That might be it then
[21:00:51] <TheresNoTime>	 turning on "Accessibility for Reading (Vector 2022)" sets it to "Automatic" for me fwiw (so currently dark mode)
[21:01:26] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801985 (10Eevans) p:05Triage→03High
[21:01:34] <wikibugs>	 (03PS10) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255)
[21:02:20] <logmsgbot>	 !log samtar@deploy1002 samtar and kharlan: Backport for [[gerrit:1031921|AbuseFilterHooks: Provide feature flags for AF custom actions (T20110)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:02:29] <TheresNoTime>	 Dreamy_Jazz: on mwdebug
[21:02:36] <Dreamy_Jazz>	 Ty
[21:02:38] <Dreamy_Jazz>	 Testing
[21:03:05] <Sohom_Datta>	 Yeah, I guess it was more sudden than I expected (also I was on Special:Watchlist which looks particularly bad (to me))
[21:03:27] <Dreamy_Jazz>	 TheresNoTime: Test successful.
[21:03:32] <logmsgbot>	 !log samtar@deploy1002 samtar and kharlan: Continuing with sync
[21:03:50] <wikibugs>	 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9801987 (10Eevans) p:05High→03Triage
[21:03:51] <TheresNoTime>	 Dreamy_Jazz: syncing :)
[21:04:00] <Dreamy_Jazz>	 :D
[21:04:23] <Dreamy_Jazz>	 I do see what you are saying about the Watchlist :)
[21:04:58] <Dreamy_Jazz>	 Some of the colours seem to not yet be adapted for dark mode
[21:06:58] <wikibugs>	 (03PS1) 10Dzahn: admin: add Dennis Mburugu to ldap_only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/1032047 (https://phabricator.wikimedia.org/T364320)
[21:07:07] <Sohom_Datta>	 Yep, looks like anything that has hardcoded styles is in pretty bad shape lol
[21:07:27] <Sohom_Datta>	 Special:NewPagesFeed is also pretty bad
[21:07:48] <TheresNoTime>	 lucky you, getting to fix it! :D
[21:08:50] <Sohom_Datta>	 :)
[21:09:20] <wikibugs>	 (03CR) 10Dzahn: [V:03+1] "https://wikimedia.namely.com/people/bc0ae9bc-9dd7-4390-afae-8bab4dc49684/show/personal/employee-information/" [puppet] - 10https://gerrit.wikimedia.org/r/1032047 (https://phabricator.wikimedia.org/T364320) (owner: 10Dzahn)
[21:10:22] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9802031 (10Dzahn) 05Open→03In progress
[21:14:02] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm, has the approvals now" [puppet] - 10https://gerrit.wikimedia.org/r/1031976 (https://phabricator.wikimedia.org/T364588) (owner: 10Eevans)
[21:14:18] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[21:16:10] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1031921|AbuseFilterHooks: Provide feature flags for AF custom actions (T20110)]] (duration: 16m 31s)
[21:16:14] <stashbot>	 T20110: Define AbuseFilter consequence to display a CAPTCHA - https://phabricator.wikimedia.org/T20110
[21:16:16] <TheresNoTime>	 and done
[21:16:24] <TheresNoTime>	 !log UTC late backport window complete
[21:16:24] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1-d1-codfw mgmt ip - cmooney@cumin1002"
[21:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:13] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1-d1-codfw mgmt ip - cmooney@cumin1002"
[21:17:13] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:17:17] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:21:11] <wikibugs>	 (03PS3) 10JHathaway: postfix: prometheus ops config [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395)
[21:22:46] <wikibugs>	 (03PS4) 10JHathaway: postfix: prometheus ops config for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395)
[21:23:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:27:23] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[21:44:07] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@718b2dd]: specify analytics-hadoop in hdfs urls
[21:44:33] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@718b2dd]: specify analytics-hadoop in hdfs urls (duration: 00m 25s)
[21:51:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074 (10Milimetric) 03NEW
[21:54:13] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: delete ssw1-d1-codfw mgmt ip - cmooney@cumin1002"
[21:55:07] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: delete ssw1-d1-codfw mgmt ip - cmooney@cumin1002"
[21:55:07] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:07:32] <wikibugs>	 (03CR) 10Eevans: [C:03+2] Add user xcollazo to cassandra-staging-devs group [puppet] - 10https://gerrit.wikimedia.org/r/1031976 (https://phabricator.wikimedia.org/T364588) (owner: 10Eevans)
[22:13:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9802260 (10Eevans) This is now done.  The document is here: https://wikitech.wikimedia.org/wiki/Cassandra/Staging (it's still quite bare, so if you have any q...
[22:13:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9802261 (10Eevans) 05In progress→03Resolved
[22:13:55] <wikibugs>	 (03PS11) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255)
[22:17:17] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:18:02] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:26:28] <wikibugs>	 (03PS2) 10Jdlrobson: Disable wgParserEnableLegacyMediaDOM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031610 (https://phabricator.wikimedia.org/T363597)
[22:34:23] <jinxer-wm>	 FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:39:23] <jinxer-wm>	 RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:40:49] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@12e0cb9]: bump discolytics to 0.19.0
[22:41:17] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@12e0cb9]: bump discolytics to 0.19.0 (duration: 00m 27s)
[22:48:02] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:58:04] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:16:09] <wikibugs>	 (03PS1) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1032086
[23:21:25] <icinga-wm>	 PROBLEM - carbon-cache write error on graphite1005 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [8.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30
[23:28:32] <wikibugs>	 (03PS1) 10Jdlrobson: Disable font size configuration on talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032088 (https://phabricator.wikimedia.org/T364887)
[23:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031596
[23:38:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031596 (owner: 10TrainBranchBot)
[23:43:25] <icinga-wm>	 RECOVERY - carbon-cache write error on graphite1005 is OK: OK: Less than 80.00% above the threshold [1.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30
[23:43:32] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Eric! Two minor questions, but otherwise looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans)
[23:47:25] <icinga-wm>	 PROBLEM - carbon-cache write error on graphite1005 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [8.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30
[23:58:02] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable