[00:00:33] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:04:26] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2035 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:10:48] <jinxer-wm>	 (DatasourceError) resolved: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:18:30] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 60849 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[00:38:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962240
[00:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962240 (owner: 10TrainBranchBot)
[00:52:39] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962240 (owner: 10TrainBranchBot)
[01:16:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:32:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:16:02] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 131 probes of 705 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:19:24] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 211 probes of 712 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:21:30] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 77 probes of 705 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:25:48] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:22] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 85 probes of 712 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:38:48] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:03:48] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:22:52] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68973 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[03:34:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[03:39:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[04:03:48] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 57509 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[04:24:33] <wikibugs>	 (03PS1) 10Andrew Bogott: Neutron: update init script for api service in Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963429
[04:27:41] <wikibugs>	 (03PS2) 10Andrew Bogott: Neutron: update init script for api service in Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963429
[04:29:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Neutron: update init script for api service in Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963429 (owner: 10Andrew Bogott)
[05:10:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T343198)', diff saved to https://phabricator.wikimedia.org/P52827 and previous config saved to /var/cache/conftool/dbconfig/20231005-051056-arnaudb.json
[05:11:01] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[05:17:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10DDeSouza)
[05:21:23] <wikibugs>	 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10Nikerabbit) 1) I can't reproduce {F37983700}  2) Not in scope for the Language team either, maybe SRE?
[05:24:07] <wikibugs>	 (03PS1) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791)
[05:26:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P52828 and previous config saved to /var/cache/conftool/dbconfig/20231005-052602-arnaudb.json
[05:28:00] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 135 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:29:05] <denisse>	 !log Deleting old Jenkins builds on pcc-worker1002 to free disk space
[05:29:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:41:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P52829 and previous config saved to /var/cache/conftool/dbconfig/20231005-054109-arnaudb.json
[05:46:12] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 61088 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[05:47:26] <wikibugs>	 (03PS2) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791)
[05:55:14] <wikibugs>	 (03PS3) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791)
[05:56:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T343198)', diff saved to https://phabricator.wikimedia.org/P52830 and previous config saved to /var/cache/conftool/dbconfig/20231005-055615-arnaudb.json
[05:56:18] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[05:56:20] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[05:56:31] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[05:56:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52831 and previous config saved to /var/cache/conftool/dbconfig/20231005-055637-arnaudb.json
[05:57:51] <wikibugs>	 (03PS4) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791)
[05:58:48] <wikibugs>	 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10R4356th) Can repro. {F37984723} {F37984772} And it's affecting more than just English and French entries. {F37984865}
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600).
[06:03:40] <wikibugs>	 (03PS5) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791)
[06:05:56] <wikibugs>	 (03PS6) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791)
[06:11:47] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/963432/43891/" [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse)
[06:14:06] <wikibugs>	 (03PS1) 10Umherirrender: Revert "Use HookHandlers for core hooks" [extensions/WikibaseCirrusSearch] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963354 (https://phabricator.wikimedia.org/T348181)
[06:14:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "two possible improvement, but overall lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) (owner: 10Ssingh)
[06:26:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ayounsi) Thanks, I think the scope should be larger than just those two variables if we want to remove the term "anycast" as much as possible....
[06:28:26] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:46:07] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10ayounsi) > Secondary Link Migration Looking at link usage, it's fine to drop the secondary link and keep it at 10G. https://librenm...
[06:47:38] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 63195 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[06:56:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/963413 (https://phabricator.wikimedia.org/T345220) (owner: 10Subramanya Sastry)
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0700).
[07:00:22] <apergos>	 Morning!  There are no trainees signed up for this morning's awesome opportunity to learn all about deployment.  But that's ok, because there are no patches scheduled for deployment either. Have a great day and we'll see you here next time! 
[07:03:18] <hashar>	 apergos: I was going to ask :-]
[07:03:48] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:04:07] <apergos>	 it's wikimedia connect, I assume that may have cut into the appetite for deployment somewhat
[07:09:24] <wikibugs>	 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10Pamputt) I submitted a feedback to Google 10 days ago but nothing has changed since then.
[07:13:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10Jelto) p:05Triage→03Medium
[07:16:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10Jelto) Thanks for the access request.  I need approval from @Bethany and @thcipriani here to proceed.
[07:18:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff
[07:27:15] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10ayounsi) > I would propose using free port ge-0/0/3 to add the new routed link, and bringing up the BGP peering before we touch the existing link....
[07:27:42] <wikibugs>	 (03PS1) 10Krinkle: logging: Remove redundant  setTimezone() call for UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581)
[07:51:35] <godog>	 !log bounce vopsbot on alert1001
[07:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse)
[07:55:54] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "I got confused in my earlier comment, this set the timestamp of files for the application being build rather than the dependency wheels :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[07:59:14] <moritzm>	 !log installing jetty9 security updates
[07:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse)
[08:09:38] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 69024 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[08:19:50] <wikibugs>	 (03PS2) 10Cathal Mooney: Add ns0 and ns1 /32 routes to anycast_prefixes list [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041)
[08:20:48] <wikibugs>	 (03CR) 10Cathal Mooney: Add ns0 and ns1 /32 routes to anycast_prefixes list (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) (owner: 10Cathal Mooney)
[08:26:13] <wikibugs>	 (03PS1) 10Volans: openldap: cross-validate-accounts wording [puppet] - 10https://gerrit.wikimedia.org/r/963666
[08:29:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/963666 (owner: 10Volans)
[08:29:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] openldap: cross-validate-accounts wording [puppet] - 10https://gerrit.wikimedia.org/r/963666 (owner: 10Volans)
[08:30:12] <volans>	 moritzm: can I puppet-merge your change too? parsoid-rt-client: Further reduce worker pool to 16 clients
[08:30:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Jelto)
[08:30:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Jelto) p:05Triage→03Medium
[08:33:48] <volans>	 mine can be merged anytime
[08:34:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Peachey88) See also {T346921}
[08:36:20] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) p:05Triage→03Medium
[08:37:11] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney)
[08:37:27] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney)
[08:37:33] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[08:40:36] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney)
[08:43:22] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[08:47:10] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney)
[08:47:50] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[08:50:12] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:50:34] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68273 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[08:51:18] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:51:34] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:52:38] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:58:47] <wikibugs>	 (03PS2) 10Jelto: [gitlab/switchover] Update DNS for gitlab/gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney)
[08:59:02] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[09:00:05] <jouncebot>	 jelto and eoghan: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab DC switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0900).
[09:01:13] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab2002.wikimedia.org
[09:05:02] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[09:05:48] <icinga-wm>	 PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[09:06:28] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro)
[09:08:16] <icinga-wm>	 ACKNOWLEDGEMENT - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused Jelto switchover to codfw - T345531 - The acknowledgement expires at: 2023-10-06 11:25:35. https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[09:08:48] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:08:52] <volans>	 moritzm: re: puppet-merge, there is still your patch pending, mine can be merged anytime, so I'll leave it to you to merge when ready yours
[09:18:07] <moritzm>	 oh yes, sorry. just merged both
[09:19:14] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[09:19:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:19:38] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:21:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Reinstate absented package list bullseye without ISC libraries [puppet] - 10https://gerrit.wikimedia.org/r/961983 (owner: 10Muehlenhoff)
[09:28:04] <wikibugs>	 (03PS1) 10Volans: spicerack: improve cookbooks help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669
[09:29:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014 - https://phabricator.wikimedia.org/T348225 (10cmooney) p:05Triage→03Medium
[09:30:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014 - https://phabricator.wikimedia.org/T348225 (10cmooney)
[09:30:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[09:33:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[09:34:57] <wikibugs>	 (03CR) 10DCausse: "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[09:36:57] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Bump the maximum number of HDFS files before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis)
[09:37:16] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the maximum number of HDFS files before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis)
[09:44:29] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[09:50:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[09:51:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[09:51:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "merged" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar)
[09:52:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/963258 (owner: 10Slyngshede)
[09:53:23] <jbond>	 hashar: fyi i merged the gerrit cr and ran puppet on gerrit1003
[09:54:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[09:55:09] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores1001.eqiad.wmnet
[09:55:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669 (owner: 10Volans)
[09:56:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "OK, so this should add druid1009 and druid1010 to the cluster, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[09:57:18] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:59:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[09:59:49] <moritzm>	 !log installing python2.7 security updates
[09:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:05] <jouncebot>	 jelto and eoghan: Your horoscope predicts another unfortunate GitLab DC switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0900).
[10:00:05] <jouncebot>	 mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1000)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1000)
[10:00:47] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[10:02:00] <wikibugs>	 (03CR) 10Stevemunene: druid: Bring druid1010.eqiad.wmnet into service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[10:02:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10jbond) @hashar could you check if this is on the jenkins side pcc-worker1002 looks healthy to me
[10:03:51] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] druid: Bring druid1010.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene)
[10:08:08] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[10:09:14] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[10:09:14] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:09:15] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores1001.eqiad.wmnet
[10:09:45] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet
[10:13:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[10:14:24] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[10:15:21] <wikibugs>	 (03PS1) 10Slyngshede: Implement Codex design for properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/963681
[10:16:18] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.dns.netbox
[10:16:36] <wikibugs>	 (03CR) 10Klausman: site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[10:19:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:21:02] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[10:21:52] <wikibugs>	 (03PS2) 10Klausman: site.pp: Move ORES machines to spare::system for decomming [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144)
[10:21:58] <wikibugs>	 (03CR) 10Klausman: site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[10:23:18] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001"
[10:23:18] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:23:19] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet
[10:24:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:27:21] <wikibugs>	 (03Abandoned) 10Ladsgroup: mariadb: Add grants for testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup)
[10:30:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:35:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:39:51] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] [gitlab/switchover] Change profile::gitlab::service_name for switchover [puppet] - 10https://gerrit.wikimedia.org/r/963160 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney)
[10:40:06] <wikibugs>	 (03CR) 10Majavah: site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[10:40:42] <wikibugs>	 (03PS3) 10Klausman: site.pp: Remove ORES machines (real and VMs) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144)
[10:40:44] <wikibugs>	 (03PS13) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[10:40:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] site.pp: Remove ORES machines (real and VMs) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[10:41:23] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] site.pp: Remove ORES machines (real and VMs) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman)
[10:42:14] <wikibugs>	 (03CR) 10Btullis: airflow-wmde: configure wmde airflow instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[10:43:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[10:46:41] <wikibugs>	 (03PS14) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[10:49:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:49:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:49:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:49:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[10:50:18] <wikibugs>	 (03CR) 10Hnowlan: thumbor: add imagemagick policy file (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan)
[10:50:35] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Don't ignore imagemagick exit status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/961968 (https://phabricator.wikimedia.org/T344233) (owner: 10Tim Starling)
[10:53:48] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:57:37] <wikibugs>	 (03PS16) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373)
[10:58:01] <wikibugs>	 (03PS10) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373)
[10:58:28] <wikibugs>	 (03PS12) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373)
[10:59:09] <wikibugs>	 (03Merged) 10jenkins-bot: Don't ignore imagemagick exit status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/961968 (https://phabricator.wikimedia.org/T344233) (owner: 10Tim Starling)
[11:01:35] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:02:04] <wikibugs>	 (03PS10) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373)
[11:03:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43895/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:04:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mirrors: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/962032 (owner: 10Muehlenhoff)
[11:04:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43896/console" [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:05:24] <jinxer-wm>	 (ProbeDown) firing: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:06:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43897/console" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:07:31] <wikibugs>	 (03Abandoned) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[11:07:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:10:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:14:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:14:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43898/console" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:14:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove pentesters group [puppet] - 10https://gerrit.wikimedia.org/r/963688
[11:15:04] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Mark pentesters as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960547 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff)
[11:17:07] <wikibugs>	 (03PS4) 10Jbond: redis::slave: switch to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373)
[11:17:11] <wikibugs>	 (03PS6) 10Jbond: prometheus::pdu_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373)
[11:17:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:17:57] <wikibugs>	 (03PS15) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[11:18:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43899/console" [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:19:11] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:20:02] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] redis::slave: switch to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:20:27] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233)
[11:21:08] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] [gitlab/switchover] Update DNS for gitlab/gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney)
[11:22:11] <wikibugs>	 (03PS16) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[11:22:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "We still need to add a constant for NETWORK_INFRA for nftables, -1ing for now" [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff)
[11:22:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43900/console" [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:23:04] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors
[11:23:08] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors
[11:23:34] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors
[11:23:39] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors
[11:24:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::pdu_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:24:03] <wikibugs>	 (03PS1) 10Jbond: prometheus::cluster_config: sort targets [puppet] - 10https://gerrit.wikimedia.org/r/963693
[11:24:38] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors
[11:24:42] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors
[11:25:47] <icinga-wm>	 RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:29:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43901/console" [puppet] - 10https://gerrit.wikimedia.org/r/963693 (owner: 10Jbond)
[11:29:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::cluster_config: sort targets [puppet] - 10https://gerrit.wikimedia.org/r/963693 (owner: 10Jbond)
[11:29:55] <wikibugs>	 (03PS2) 10Slyngshede: Implement Codex design for properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/963681
[11:30:53] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:32:24] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan)
[11:33:12] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan)
[11:34:41] <icinga-wm>	 PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:34:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond)
[11:35:19] <wikibugs>	 (03PS1) 10Jelto: gitlab: make gitlab2002 the active host [puppet] - 10https://gerrit.wikimedia.org/r/963706 (https://phabricator.wikimedia.org/T345531)
[11:36:37] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[11:36:37] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] gitlab: make gitlab2002 the active host [puppet] - 10https://gerrit.wikimedia.org/r/963706 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto)
[11:36:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[11:36:50] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab: make gitlab2002 the active host [puppet] - 10https://gerrit.wikimedia.org/r/963706 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto)
[11:36:53] <icinga-wm>	 RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:43:00] <wikibugs>	 (03CR) 10WMDE-Fisch: "Exciting and thanks for taking care! When will this be live in production?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan)
[11:46:26] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab1004.wikimedia.org to gitlab2002.wikimedia.org
[11:47:57] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:43] <wikibugs>	 (03PS1) 10Jbond: puppetdbquery: remove all refrences to query_resources [puppet] - 10https://gerrit.wikimedia.org/r/963709 (https://phabricator.wikimedia.org/T341373)
[11:48:53] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:50:24] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:54:23] <wikibugs>	 (03PS1) 10Jbond: puppetdbquery: drop module [puppet] - 10https://gerrit.wikimedia.org/r/963710 (https://phabricator.wikimedia.org/T341373)
[11:54:37] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdbquery: remove all refrences to query_resources [puppet] - 10https://gerrit.wikimedia.org/r/963709 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:56:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdbquery: drop module [puppet] - 10https://gerrit.wikimedia.org/r/963710 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond)
[11:57:16] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet2005-dev
[11:57:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:57:18] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudnet2005-dev
[12:00:05] <jouncebot>	 jelto and eoghan: It is that lovely time of the day again! You are hereby commanded to deploy GitLab DC switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0900).
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1200)
[12:01:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[12:01:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye
[12:01:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[12:01:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye
[12:01:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye
[12:02:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye
[12:03:34] <wikibugs>	 (03PS5) 10Muehlenhoff: scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749
[12:06:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[12:06:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063']
[12:07:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1064']
[12:07:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) 05Open→03Resolved a:03jbond puppetdbquery has now been removed
[12:07:34] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1064']
[12:07:37] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1063']
[12:07:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063']
[12:08:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[12:08:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1063']
[12:08:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff)
[12:09:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond) 05Open→03In progress p:05Triage→03Medium
[12:10:34] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetboard2002.codfw.wmnet,puppetboard1002.eqiad.wmnet
[12:11:58] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudlb2001-dev and cloudlb2002-dev connected at different speeds - https://phabricator.wikimedia.org/T348173 (10LSobanski)
[12:12:31] <wikibugs>	 (03PS1) 10Jbond: hieradata: remove puppetdb[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/963714 (https://phabricator.wikimedia.org/T347286)
[12:13:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hieradata: remove puppetdb[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/963714 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[12:13:37] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet
[12:14:29] <wikibugs>	 (03CR) 10DCausse: cirrus streaming updater service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson)
[12:17:27] <wikibugs>	 (03PS1) 10Jbond: hieradata: rename puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963716
[12:17:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hieradata: rename puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963716 (owner: 10Jbond)
[12:21:56] <wikibugs>	 (03PS17) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[12:22:24] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.netbox
[12:23:24] <wikibugs>	 (03PS1) 10Jbond: puppetboard: use global aka site specific puppetdb_host [puppet] - 10https://gerrit.wikimedia.org/r/963717 (https://phabricator.wikimedia.org/T347286)
[12:24:12] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.netbox
[12:26:29] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001"
[12:26:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43908/console" [puppet] - 10https://gerrit.wikimedia.org/r/963717 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[12:26:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetboard: use global aka site specific puppetdb_host [puppet] - 10https://gerrit.wikimedia.org/r/963717 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[12:27:05] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:27:06] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts puppetboard2002.codfw.wmnet,puppetboard1002.eqiad.wmnet
[12:27:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetboard2002.codfw.wmnet,p...
[12:27:31] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001"
[12:27:31] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:27:32] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet
[12:27:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): decomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet` - p...
[12:29:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond) > Failed to run the sre.dns.netbox cookbook, run it manually This was completed by a different cookbook run
[12:29:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Blacklist exfat [puppet] - 10https://gerrit.wikimedia.org/r/950145 (owner: 10Muehlenhoff)
[12:38:39] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[12:39:04] <wikibugs>	 (03PS18) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[12:40:27] <wikibugs>	 (03PS1) 10Jbond: pupetboard: move to puppetboard role [puppet] - 10https://gerrit.wikimedia.org/r/963718 (https://phabricator.wikimedia.org/T347286)
[12:41:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists1001.wikimedia.org
[12:41:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43909/console" [puppet] - 10https://gerrit.wikimedia.org/r/963718 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[12:41:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] pupetboard: move to puppetboard role [puppet] - 10https://gerrit.wikimedia.org/r/963718 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond)
[12:42:57] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[12:45:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists1001.wikimedia.org
[12:48:12] <wikibugs>	 (03PS1) 10Jbond: puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285)
[12:49:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43910/console" [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond)
[12:49:58] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[12:50:51] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[12:51:31] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[12:51:39] <wikibugs>	 (03PS1) 10Ladsgroup: Switch ES cluster to cluster28 and cluster29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685)
[12:51:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet
[12:54:14] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet
[12:57:11] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:57:47] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:58:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez)
[12:58:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[12:58:34] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-2] "Until Jaime is back from ooo so he can switch the backups." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup)
[12:58:38] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469)
[12:58:44] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469)
[12:59:30] <wikibugs>	 (03PS19) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[12:59:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:18] <Lucas_WMDE>	 in a meeting at the moment, but I might deploy a config change later
[13:00:22] <Lucas_WMDE>	 maybe in 20 minutes or so
[13:00:42] <urbanecm>	 Lucas_WMDE: that's perfect, i'll do one now, as i'll have to leave in 30 mins or so :)
[13:00:47] <Lucas_WMDE>	 ok \o/
[13:02:14] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] enwiki: Enable mentorship for 50% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399)
[13:02:22] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Growth] enwiki: Enable mentorship for 50% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm)
[13:02:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:02:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm)
[13:03:04] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] enwiki: Enable mentorship for 50% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm)
[13:04:22] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:963722|[Growth] enwiki: Enable mentorship for 50% of new users (T341399)]]
[13:04:29] <stashbot>	 T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399
[13:04:44] <wikibugs>	 (03PS20) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[13:04:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host matomo1002.eqiad.wmnet
[13:04:54] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet
[13:05:46] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-master1002.eqiad.wmnet
[13:05:50] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:963722|[Growth] enwiki: Enable mentorship for 50% of new users (T341399)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:05:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: cloudgw: rename resource to avoid clash [puppet] - 10https://gerrit.wikimedia.org/r/963723 (https://phabricator.wikimedia.org/T347469)
[13:06:11] <wikibugs>	 (03PS11) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[13:06:31] <wikibugs>	 (03PS12) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[13:07:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] wmcs: cloudgw: rename resource to avoid clash [puppet] - 10https://gerrit.wikimedia.org/r/963723 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:08:05] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[13:08:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:08:47] <claime>	 !log respawning two misbehaving thumbor pods in codfw
[13:08:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1002.eqiad.wmnet
[13:09:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: cloudgw: rename resource to avoid clash [puppet] - 10https://gerrit.wikimedia.org/r/963723 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:11:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[13:11:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[13:11:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[13:12:50] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1002.eqiad.wmnet
[13:12:52] <wikibugs>	 (03PS13) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[13:13:08] <wikibugs>	 (03PS2) 10Jbond: puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285)
[13:13:47] <wikibugs>	 (03PS21) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[13:14:30] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:963722|[Growth] enwiki: Enable mentorship for 50% of new users (T341399)]] (duration: 10m 08s)
[13:14:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:14:35] <stashbot>	 T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399
[13:14:39] * urbanecm done
[13:14:44] <urbanecm>	 Lucas_WMDE: feel fee to go ahead once done with your meeting
[13:14:50] <wikibugs>	 (03PS3) 10Jbond: puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285)
[13:15:09] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:15:18] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1002.eqiad.wmnet
[13:16:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43912/console" [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond)
[13:17:17] <Lucas_WMDE>	 urbanecm: ok thanks!
[13:18:36] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151)
[13:19:01] <wikibugs>	 (03PS14) 10Ilias Sarantopoulos: team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[13:19:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond)
[13:19:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:21:57] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye
[13:22:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1002.eqiad.wmnet
[13:22:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[13:22:04] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye
[13:22:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...
[13:22:10] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye
[13:22:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[13:22:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro...
[13:22:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro...
[13:22:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10jbond) 05Open→03Resolved a:03jbond
[13:22:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond) 05In progress→03Resolved a:03jbond
[13:24:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963688 (owner: 10Muehlenhoff)
[13:27:22] <wikibugs>	 (03CR) 10Jbond: mariadb: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[13:27:57] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469)
[13:27:59] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:28:55] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.236:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.236 port 9042 https://phabricator.wikimedia.org/T93886
[13:29:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "should be ok, the dir is sourced before the rest of them in /etc/network/interfaces" [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:30:27] <Lucas_WMDE>	 alright, I can deploy now
[13:30:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:31:48] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469)
[13:32:26] <urandom>	 !log starting Cassandra rebuild, restbase1030-c — T346803
[13:32:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove pentesters group [puppet] - 10https://gerrit.wikimedia.org/r/963688 (owner: 10Muehlenhoff)
[13:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:32] <stashbot>	 T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803
[13:33:28] <wikibugs>	 (03PS1) 10Jbond: hieradata: drop old hiera files [puppet] - 10https://gerrit.wikimedia.org/r/963729
[13:33:55] <wikibugs>	 (03PS6) 10Lucas Werkmeister (WMDE): Drop old VP8 video transcodes, enable HLS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER)
[13:34:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/963681 (owner: 10Slyngshede)
[13:34:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[13:35:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43913/console" [puppet] - 10https://gerrit.wikimedia.org/r/963729 (owner: 10Jbond)
[13:35:02] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:35:14] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:35:20] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:35:38] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:35:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER)
[13:36:21] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:36:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:36:28] <wikibugs>	 (03Merged) 10jenkins-bot: Drop old VP8 video transcodes, enable HLS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER)
[13:36:29] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:36:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963730 (https://phabricator.wikimedia.org/T341509)
[13:36:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:961864|Drop old VP8 video transcodes, enable HLS on testwiki (T312152 T309823)]]
[13:36:58] <stashbot>	 T309823: Disable old WebM VP8 transcodes except for 360p - https://phabricator.wikimedia.org/T309823
[13:36:59] <stashbot>	 T312152: Clean up video transcode config for speed/bitrate balance - https://phabricator.wikimedia.org/T312152
[13:38:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963730 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott)
[13:38:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 brion and lucaswerkmeister-wmde: Backport for [[gerrit:961864|Drop old VP8 video transcodes, enable HLS on testwiki (T312152 T309823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:38:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:39:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] hieradata: drop old hiera files [puppet] - 10https://gerrit.wikimedia.org/r/963729 (owner: 10Jbond)
[13:41:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff)
[13:41:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) @cmooney's comment above on the default routing policy and priority of routes got me thinking: if...
[13:41:40] <wikibugs>	 (03PS1) 10Bking: flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315)
[13:41:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[13:41:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[13:42:09] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking)
[13:42:38] <wikibugs>	 (03CR) 10Bking: [C: 03+2] flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking)
[13:42:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking)
[13:42:50] <Lucas_WMDE>	 (config change tested over in #mediawiki)
[13:42:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 brion and lucaswerkmeister-wmde: Continuing with sync
[13:43:34] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking)
[13:43:48] <wikibugs>	 (03PS1) 10Jbond: augeas_core: refresh module [puppet] - 10https://gerrit.wikimedia.org/r/963732
[13:44:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] augeas_core: refresh module [puppet] - 10https://gerrit.wikimedia.org/r/963732 (owner: 10Jbond)
[13:44:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: update eqiad1 horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963733 (https://phabricator.wikimedia.org/T341509)
[13:44:55] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:45:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:45:11] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: reorder post-up commands and other fixes [puppet] - 10https://gerrit.wikimedia.org/r/963734 (https://phabricator.wikimedia.org/T347469)
[13:45:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update eqiad1 horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963733 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott)
[13:46:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:46:11] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:47:01] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] install_server: replace ntp.$site with anycasted ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[13:47:58] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[13:48:47] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.400 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:48:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Bawolff) I left WMF about 3 years ago.
[13:48:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:49:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:961864|Drop old VP8 video transcodes, enable HLS on testwiki (T312152 T309823)]] (duration: 12m 07s)
[13:49:06] <stashbot>	 T309823: Disable old WebM VP8 transcodes except for 360p - https://phabricator.wikimedia.org/T309823
[13:49:07] <stashbot>	 T312152: Clean up video transcode config for speed/bitrate balance - https://phabricator.wikimedia.org/T312152
[13:49:21] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:49:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST ipreservations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:49:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: reorder post-up commands and other fixes [puppet] - 10https://gerrit.wikimedia.org/r/963734 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:50:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[13:51:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host sretest1001.eqiad.wmnet with OS b...
[13:51:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff)
[13:51:18] <Lucas_WMDE>	 hehe, I ran enough shell.php that “Writing to directory /var/www/.config/psysh is not allowed.” shows up in logspam-watch now :D
[13:51:39] <Lucas_WMDE>	 (T228041, known issue, no big deal)
[13:51:39] <stashbot>	 T228041: Using shell.php in production fails to load personal configuration and sends warnings to Logstash - https://phabricator.wikimedia.org/T228041
[13:51:50] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't declare the vrf-interface [puppet] - 10https://gerrit.wikimedia.org/r/963736 (https://phabricator.wikimedia.org/T347469)
[13:53:38] <Lucas_WMDE>	 I’ll just chuck in the revert for the train blocker too
[13:53:40] <Lucas_WMDE>	 jouncebot: next
[13:53:40] <jouncebot>	 In 2 hour(s) and 6 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600)
[13:53:43] <Lucas_WMDE>	 yeah should be enough time
[13:53:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseCirrusSearch] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963354 (https://phabricator.wikimedia.org/T348181) (owner: 10Umherirrender)
[13:53:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[13:54:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (14) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:54:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: don't declare the vrf-interface [puppet] - 10https://gerrit.wikimedia.org/r/963736 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez)
[13:55:21] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack::clientpackages::antelope::buster: typo correction [puppet] - 10https://gerrit.wikimedia.org/r/963737
[13:56:44] <wikibugs>	 (03PS22) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[13:57:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9228328, @ayounsi wrote: > That's only for equal prefix length. For example a stati...
[13:58:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack::clientpackages::antelope::buster: typo correction [puppet] - 10https://gerrit.wikimedia.org/r/963737 (owner: 10Andrew Bogott)
[14:00:17] <wikibugs>	 (03PS1) 10Jelto: gitlab/failover: remove deploy-page at the end of cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/963739 (https://phabricator.wikimedia.org/T345531)
[14:01:35] <Lucas_WMDE>	 jouncebot: now
[14:01:35] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 58 minute(s)
[14:01:39] * Lucas_WMDE overrunning the window
[14:04:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[14:04:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[14:04:11] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye
[14:04:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...
[14:05:01] <wikibugs>	 (03PS1) 10Jelto: admin: change email of bawolff [puppet] - 10https://gerrit.wikimedia.org/r/963741 (https://phabricator.wikimedia.org/T348216)
[14:05:34] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh) >>! In T348174#9227047, @ayounsi wrote: > Thanks, I think the scope should be larger than just those two variables if we want to remove...
[14:05:42] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: [DON'T MERGE UNLESS IN EMERGENCY] cloudgw: revert recent changes [puppet] - 10https://gerrit.wikimedia.org/r/963742
[14:05:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[14:07:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Use HookHandlers for core hooks" [extensions/WikibaseCirrusSearch] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963354 (https://phabricator.wikimedia.org/T348181) (owner: 10Umherirrender)
[14:08:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963354|Revert "Use HookHandlers for core hooks" (T348181)]]
[14:08:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DON'T MERGE UNLESS IN EMERGENCY] cloudgw: revert recent changes [puppet] - 10https://gerrit.wikimedia.org/r/963742 (owner: 10Arturo Borrero Gonzalez)
[14:08:23] <stashbot>	 T348181: TypeError: Argument 1 passed to Wikibase\Search\Elastic\CirrusShowSearchHitHandler::newFromGlobalState() must implement interface IContextSource, instance of MediaWiki\Config\GlobalVarConfig given, called in /srv/mediawiki/php- - https://phabricator.wikimedia.org/T348181
[14:08:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye
[14:08:36] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Jelto) a:05Bawolff→03Jelto >>! In T348216#9228274, @Bawolff wrote: > I left WMF about 3 years ago. >  > My current email is bawolff@gmail.com . I hav...
[14:08:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye
[14:09:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "don't merge unless in emergency" [puppet] - 10https://gerrit.wikimedia.org/r/963742 (owner: 10Arturo Borrero Gonzalez)
[14:09:08] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[14:09:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Horizon: update horizon version" [puppet] - 10https://gerrit.wikimedia.org/r/963743
[14:09:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 umherirrender and lucaswerkmeister-wmde: Backport for [[gerrit:963354|Revert "Use HookHandlers for core hooks" (T348181)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:09:44] <Lucas_WMDE>	 testing…
[14:09:58] <Lucas_WMDE>	 seems to fix search on testwikidata
[14:09:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 umherirrender and lucaswerkmeister-wmde: Continuing with sync
[14:10:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[14:10:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: update horizon version" [puppet] - 10https://gerrit.wikimedia.org/r/963743 (owner: 10Andrew Bogott)
[14:11:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[14:11:11] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] gitlab/failover: remove deploy-page at the end of cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/963739 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto)
[14:13:19] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[14:13:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[14:14:44] <wikibugs>	 (03CR) 10Ssingh: bird: rename ACAST_PS_ADVERTISE to BIRD_IP{4,6}_ADVERTISE (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) (owner: 10Ssingh)
[14:14:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/963741 (https://phabricator.wikimedia.org/T348216) (owner: 10Jelto)
[14:15:51] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:16:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PUT endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:17:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[14:17:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963354|Revert "Use HookHandlers for core hooks" (T348181)]] (duration: 08m 50s)
[14:17:17] <stashbot>	 T348181: TypeError: Argument 1 passed to Wikibase\Search\Elastic\CirrusShowSearchHitHandler::newFromGlobalState() must implement interface IContextSource, instance of MediaWiki\Config\GlobalVarConfig given, called in /srv/mediawiki/php- - https://phabricator.wikimedia.org/T348181
[14:17:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[14:17:43] <wikibugs>	 (03CR) 10Ssingh: bird: rename ACAST_PS_ADVERTISE to BIRD_IP{4,6}_ADVERTISE (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) (owner: 10Ssingh)
[14:18:33] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:18:35] * Lucas_WMDE done
[14:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:51] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:20:53] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054)
[14:21:24] <wikibugs>	 (03PS1) 10Muehlenhoff: bacula::director: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/963745
[14:21:35] <jinxer-wm>	 (KubernetesAPILatency) firing: (18) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:22:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[14:22:34] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2004']
[14:22:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2004']
[14:24:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2004.codfw.wmnet with OS bullseye
[14:25:31] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[14:25:41] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host sretest1001.eqiad.wmnet with OS bulls...
[14:25:59] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:26:21] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (18) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:29:49] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:29:52] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:29:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[14:30:12] <wikibugs>	 (03PS1) 10Cwhite: icinga: round elasticsearch shard size check to 2 decimal places [puppet] - 10https://gerrit.wikimedia.org/r/962243 (https://phabricator.wikimedia.org/T327218)
[14:30:59] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:32:20] <wikibugs>	 (03CR) 10Ayounsi: "Overall lgtm." [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[14:33:11] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:33:51] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:33:53] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) @papaul I've moved the google meet for this to the week after - Oct 17th.  There are few other moving parts in...
[14:38:19] <claime>	 !log Bumping kubectd100[4-6].eqiad.wmnet vcpu to 2 - T348228
[14:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:23] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[14:38:34] <claime>	 !log Bumping kubetcd100[4-6].eqiad.wmnet vcpu to 2 - T348228
[14:38:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:48] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: round elasticsearch shard size check to 2 decimal places [puppet] - 10https://gerrit.wikimedia.org/r/962243 (https://phabricator.wikimedia.org/T327218) (owner: 10Cwhite)
[14:41:00] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd1006.eqiad.wmnet with reason: Pick up vcpu change
[14:41:15] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd1006.eqiad.wmnet with reason: Pick up vcpu change
[14:41:35] <claime>	 !log rebooting kubetcd1006.eqiad.wmnet - T348228
[14:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:15] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963745 (owner: 10Muehlenhoff)
[14:43:05] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:16] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd1006.eqiad.wmnet
[14:44:17] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd1006.eqiad.wmnet
[14:44:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd1005.eqiad.wmnet with reason: Pick up vcpu change
[14:44:56] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd1005.eqiad.wmnet with reason: Pick up vcpu change
[14:45:51] <wikibugs>	 (03PS15) 10Ilias Sarantopoulos: team-ml: add alerts for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[14:46:16] <wikibugs>	 (03PS16) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151)
[14:46:23] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd1005.eqiad.wmnet
[14:46:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd1005.eqiad.wmnet
[14:46:39] <claime>	 !log rebooted kubetcd1005.eqiad.wmnet - T348228
[14:46:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:44] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[14:47:04] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd1004.eqiad.wmnet with reason: Pick up vcpu change
[14:47:18] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd1004.eqiad.wmnet with reason: Pick up vcpu change
[14:47:24] <claime>	 !log rebooting kubetcd1004.eqiad.wmnet - T348228
[14:47:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:11] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:50:16] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd1004.eqiad.wmnet
[14:50:16] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd1004.eqiad.wmnet
[14:52:45] <claime>	 !log Bumping kubemaster100[1-2].eqiad.wmnet vcpu to 2, ram to 4G - T348228
[14:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:48] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[14:52:55] <wikibugs>	 (03PS2) 10Muehlenhoff: bacula::director: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/963745
[14:53:01] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster1002.eqiad.wmnet with reason: Pick up vcpu change
[14:53:15] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster1002.eqiad.wmnet with reason: Pick up vcpu change
[14:53:27] <claime>	 !log rebooting kubemaster1002.eqiad.wmnet - T348228
[14:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[14:56:51] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 67897 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[14:57:46] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster1002.eqiad.wmnet
[14:57:47] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster1002.eqiad.wmnet
[14:58:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:59:28] <wikibugs>	 (03PS3) 10Majavah: wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380)
[14:59:39] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster1001.eqiad.wmnet with reason: Pick up vcpu change
[14:59:53] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster1001.eqiad.wmnet with reason: Pick up vcpu change
[15:00:00] <claime>	 !log rebooting kubemaster1001.eqiad.wmnet - T348228
[15:00:36] <claime>	 stashbot: you ok?
[15:02:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet
[15:02:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+2 C: 03+2] wikimediacloud: Add a dedicated CNAME for object storage (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) (owner: 10Majavah)
[15:03:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:32] <claime>	 That took a second
[15:03:33] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[15:03:33] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[15:03:57] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster1001.eqiad.wmnet
[15:03:57] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster1001.eqiad.wmnet
[15:06:38] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro)
[15:06:48] <claime>	 !log Bumping kubetcd200[4-6].eqiad.wmnet vcpu to 2 - T348228
[15:06:48] <wikibugs>	 (03PS1) 10Majavah: hieradata: acme_chief: update openstack cert config [puppet] - 10https://gerrit.wikimedia.org/r/963752
[15:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:08] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd2006.codfw.wmnet with reason: Pick up vcpu change
[15:07:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd2006.codfw.wmnet with reason: Pick up vcpu change
[15:07:39] <claime>	 !log rebooting kubetcd2006.codfw.wmnet - T348228
[15:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:56] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd2006.codfw.wmnet
[15:08:56] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd2006.codfw.wmnet
[15:09:38] <claime>	 !log rebooting kubetcd2005.codfw.wmnet - T348228
[15:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:41] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[15:09:46] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd2005.codfw.wmnet with reason: Pick up vcpu change
[15:10:00] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd2005.codfw.wmnet with reason: Pick up vcpu change
[15:10:58] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd2005.codfw.wmnet
[15:10:59] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd2005.codfw.wmnet
[15:12:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:12:37] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd2004.codfw.wmnet with reason: Pick up vcpu change
[15:12:42] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2023.codfw.wmnet
[15:13:02] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd2004.codfw.wmnet with reason: Pick up vcpu change
[15:13:26] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963745 (owner: 10Muehlenhoff)
[15:13:35] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[15:13:35] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[15:13:35] <icinga-wm>	 PROBLEM - ganeti-mond running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti
[15:13:54] <claime>	 !log rebooting kubetcd2004.codfw.wmnet - T348228
[15:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52832 and previous config saved to /var/cache/conftool/dbconfig/20231005-151450-arnaudb.json
[15:15:01] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[15:16:22] <wikibugs>	 (03CR) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:16:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan)
[15:16:42] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd2004.codfw.wmnet
[15:16:42] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd2004.codfw.wmnet
[15:17:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10Bethany) Approved  >>! In T348209#9227084, @Jelto wrote: > Thanks for the access request. >  > I need approval from @Bethany and @thcipriani here to proceed.
[15:17:17] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: OpenSent - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:17:59] <icinga-wm>	 PROBLEM - SSH on kubemaster2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:18:14] <claime>	 I haven't even touched it yet
[15:18:43] <wikibugs>	 (03CR) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[15:18:43] <claime>	 it's up too
[15:19:17] <icinga-wm>	 RECOVERY - SSH on kubemaster2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:19:47] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster2002.codfw.wmnet with reason: Pick up vcpu change
[15:19:55] <wikibugs>	 (03CR) 10Jforrester: "G2G once wmf.30 is everywhere and won't revert." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581) (owner: 10Krinkle)
[15:20:12] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster2002.codfw.wmnet with reason: Pick up vcpu change
[15:20:19] <godog>	 claime: it is clingy
[15:20:26] <claime>	 !log rebooting kubemaster2002.codfw.wmnet - T348228
[15:20:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:30] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[15:23:15] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:24:05] <wikibugs>	 (03PS1) 10Muehlenhoff: puppetdb: Fix duplicated nginx entry [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529)
[15:24:55] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster2002.codfw.wmnet
[15:24:55] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster2002.codfw.wmnet
[15:25:17] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[15:25:40] <claime>	 Wanna bet I'm gonna have to rebalance
[15:25:50] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster2001.codfw.wmnet with reason: Pick up vcpu change
[15:25:57] <claime>	 !log rebooting kubemaster2001.codfw.wmnet - T348228
[15:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:04] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster2001.codfw.wmnet with reason: Pick up vcpu change
[15:26:05] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[15:26:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2023.codfw.wmnet with reason: reimage to bullseye
[15:26:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2023.codfw.wmnet with reason: reimage to bullseye
[15:27:12] <claime>	 Hmm no, ganeti1019 doesn't have any of the instances I touched
[15:27:47] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:28:02] <claime>	 This one's me
[15:28:19] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[15:28:47] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye
[15:28:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro...
[15:29:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P52834 and previous config saved to /var/cache/conftool/dbconfig/20231005-152956-arnaudb.json
[15:30:25] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster2001.codfw.wmnet
[15:30:25] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster2001.codfw.wmnet
[15:30:34] <wikibugs>	 (03CR) 10Volans: [C: 03+2] spicerack: improve cookbooks help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669 (owner: 10Volans)
[15:30:52] <wikibugs>	 (03PS1) 10Volans: dhcp: always rewrite the DHCP snippet [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756
[15:30:54] <wikibugs>	 (03PS1) 10Volans: dhcp: simplify tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757
[15:31:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[15:31:20] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti-test2004.codfw.wmnet with OS bullseye
[15:31:21] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet
[15:31:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:33:58] <claime>	 Yeah, settle down
[15:34:38] <wikibugs>	 (03Merged) 10jenkins-bot: spicerack: improve cookbooks help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669 (owner: 10Volans)
[15:36:06] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetdb: Fix duplicated nginx entry [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529)
[15:36:55] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet
[15:37:26] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68178 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[15:37:50] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[15:37:57] <volans>	 !log installed 7.3.1 on cumin2002
[15:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:38:53] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[15:39:02] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:39:41] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppetboard.restart-reboot rolling reboot on A:puppetboard
[15:39:58] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[15:40:13] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:40:14] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors
[15:40:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:40:17] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors
[15:41:18] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:51] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:42:42] <wikibugs_>	 (03PS3) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[15:43:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[15:43:32] <claime>	 moritzm: Should I rebalance the ganeti group for ganeti1019 or are you doing things on the cluster ?
[15:43:56] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppet.renew-cert for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002
[15:44:17] <wikibugs>	 (03CR) 10Bking: wdqs: Set up graph_split hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[15:44:22] <logmsgbot>	 !log jbond@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002
[15:44:26] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[15:44:34] <icinga-wm>	 PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure
[15:45:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:45:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P52835 and previous config saved to /var/cache/conftool/dbconfig/20231005-154502-arnaudb.json
[15:48:26] <wikibugs>	 (03PS1) 10Jbond: sre.__init__: remove sleep as set_tries works now [cookbooks] - 10https://gerrit.wikimedia.org/r/963759
[15:48:28] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.renew-cert: drop support for allow_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/963760
[15:48:32] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet
[15:48:59] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/963760 (owner: 10Jbond)
[15:49:22] <wikibugs>	 (03PS2) 10Jbond: sre.puppet.renew-cert: drop support for allow_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/963760
[15:50:26] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/963759 (owner: 10Jbond)
[15:50:36] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:52:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:52:53] <wikibugs>	 (03PS4) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[15:53:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[15:54:06] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet
[15:54:29] <wikibugs>	 (03PS5) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[15:54:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[15:57:09] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10aborrero) The cloudgw side is now completed. We may want to refresh the neutron side as well:  `lang=shell-session aborrero@cloudcontrol100...
[15:58:50] <wikibugs>	 (03PS6) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[15:59:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:00:06] <jouncebot>	 jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600).
[16:00:07] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52836 and previous config saved to /var/cache/conftool/dbconfig/20231005-160009-arnaudb.json
[16:00:11] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[16:00:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[16:00:27] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[16:00:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52837 and previous config saved to /var/cache/conftool/dbconfig/20231005-160030-arnaudb.json
[16:00:50] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors
[16:00:54] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors
[16:01:12] <wikibugs>	 (03PS7) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[16:01:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:01:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: drop support for allow_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/963760 (owner: 10Jbond)
[16:02:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:02:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[16:04:26] <wikibugs>	 (03PS8) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[16:05:18] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppet.renew-cert for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002
[16:05:21] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.puppetboard.restart-reboot (exit_code=0) rolling reboot on A:puppetboard
[16:05:33] <claime>	 jouncebot: nowandnext
[16:05:33] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600)
[16:05:33] <jouncebot>	 In 0 hour(s) and 54 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1700)
[16:05:39] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: Testing mw-on-k8s deployment for T348228
[16:05:47] <rzl>	 claime: puppet window is empty, help yourse-- oh good :)
[16:05:57] <claime>	 rzl: x)
[16:06:03] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10Papaul) @cmooney no problem
[16:06:04] <stashbot>	 T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228
[16:06:36] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002
[16:07:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:03VRiley-WMF
[16:07:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:07:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:07:55] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap: Testing mw-on-k8s deployment for T348228 (duration: 02m 15s)
[16:07:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[16:09:56] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppet.renew-cert for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002
[16:09:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[16:09:59] <logmsgbot>	 !log jbond@cumin2002 END (ERROR) - Cookbook sre.puppet.renew-cert (exit_code=97) for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002
[16:10:35] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.puppetboard.restart-reboot rolling reboot on A:puppetboard
[16:10:55] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors
[16:10:58] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors
[16:11:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.__init__: remove sleep as set_tries works now [cookbooks] - 10https://gerrit.wikimedia.org/r/963759 (owner: 10Jbond)
[16:12:26] <wikibugs>	 (03PS9) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[16:12:34] <dcausse>	 !log cleaning up rdf-streaming-updater-staging swift bucket
[16:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:30] <brennen>	 jouncebot nowandnext
[16:13:30] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600)
[16:13:30] <jouncebot>	 In 0 hour(s) and 46 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1700)
[16:14:04] <wikibugs>	 10SRE, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q2): Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10lmata)
[16:14:58] <jinxer-wm>	 (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh
[16:15:15] <logmsgbot>	 !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors
[16:15:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:15:19] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors
[16:15:43] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:19:46] <logmsgbot>	 !log jbond@cumin2002 END (PASS) - Cookbook sre.puppetboard.restart-reboot (exit_code=0) rolling reboot on A:puppetboard
[16:22:09] <volans>	 !log installed 7.3.1 on cumin1001
[16:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:58] <wikibugs>	 (03CR) 10Bking: [C: 04-1] "We need to wait on T342538/hosts to be moved to production role before we merge this." [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:24:56] <wikibugs>	 (03CR) 10Bking: [C: 04-1] wdqs: Set up graph_split hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:25:26] <wikibugs>	 (03CR) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:27:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM however i think do we not also need to update the reimage cookbook to add a lock with concurrency 1? (still happy for this be merged " [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans)
[16:29:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757 (owner: 10Volans)
[16:30:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[16:34:43] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Pending the release of locking on spicerack" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans)
[16:40:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] dhcp: always rewrite the DHCP snippet (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans)
[16:41:26] <wikibugs>	 (03PS1) 10Bking: wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505)
[16:43:08] <wikibugs>	 (03PS10) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[16:43:35] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:45:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[16:47:25] <bvibber>	 !log brion running requeueTranscodes.php on mwmaint2002 for VP9 transcode cleanup for T312153
[16:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:47:35] <stashbot>	 T312153: Batch run of TMH requeueTranscodes to remove now-unused 120p and 180p low-res files - https://phabricator.wikimedia.org/T312153
[16:49:41] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] alertmanager: Add the "Auto-Submitted: auto-generated" header to AM emails [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse)
[16:50:48] <wikibugs>	 (03PS4) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739)
[16:53:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[16:54:17] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[16:55:01] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[16:57:18] <bvibber>	 !log scaling back batch jobs for T312153 and T312152, will run these in further chunks as the new config rolls out
[16:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:24] <stashbot>	 T312153: Batch run of TMH requeueTranscodes to remove now-unused 120p and 180p low-res files - https://phabricator.wikimedia.org/T312153
[16:57:24] <stashbot>	 T312152: Clean up video transcode config for speed/bitrate balance - https://phabricator.wikimedia.org/T312152
[16:57:43] <wikibugs>	 (03PS5) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739)
[16:59:37] <wikibugs>	 (03PS1) 10Slyngshede: Style ssh key management using Codex. [software/bitu] - 10https://gerrit.wikimedia.org/r/963779
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1700)
[17:00:08] <wikibugs>	 (03PS1) 10Jbond: sretest1003: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/963780
[17:01:13] <wikibugs>	 (03PS6) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739)
[17:01:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sretest1003: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/963780 (owner: 10Jbond)
[17:03:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[17:16:45] <wikibugs>	 (03PS11) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[17:17:02] <wikibugs>	 (03PS24) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[17:17:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis)
[17:19:18] <moritzm>	 claime: I can take care of it tomorrow
[17:19:43] <wikibugs>	 (03PS1) 10Jbond: os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781
[17:19:46] <icinga-wm>	 PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 67476 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops
[17:20:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond)
[17:20:26] <wikibugs>	 (03PS25) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[17:21:40] <wikibugs>	 (03PS1) 10Majavah: hieradata: bump striker docker image [puppet] - 10https://gerrit.wikimedia.org/r/963782 (https://phabricator.wikimedia.org/T347631)
[17:22:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: bump striker docker image [puppet] - 10https://gerrit.wikimedia.org/r/963782 (https://phabricator.wikimedia.org/T347631) (owner: 10Majavah)
[17:23:33] <wikibugs>	 (03PS26) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[17:25:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[17:26:44] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:27:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[17:28:59] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking)
[17:29:57] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[17:30:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[17:31:05] <wikibugs>	 (03PS2) 10Jbond: os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781
[17:31:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond)
[17:32:45] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:32:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43921/console" [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond)
[17:33:08] <wikibugs>	 (03PS27) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910)
[17:34:00] <wikibugs>	 (03PS3) 10Jbond: os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781
[17:35:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED
[17:35:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43923/console" [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond)
[17:36:35] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond)
[17:38:45] <wikibugs>	 (03PS4) 10JHathaway: puppetdb: avoid creating database users via dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842)
[17:38:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43924/console" [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond)
[17:39:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2004']
[17:39:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2004']
[17:40:04] <wikibugs>	 (03CR) 10JHathaway: puppetdb: avoid creating database users via dbconfig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[17:40:10] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetdb: avoid creating database users via dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[17:52:45] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[17:53:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issue - https://phabricator.wikimedia.org/T348272 (10jbond)
[17:59:52] <wikibugs>	 (03PS1) 10Jbond: htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272)
[18:00:05] <jouncebot>	 jeena and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1800). nyaa~
[18:00:54] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963789 (https://phabricator.wikimedia.org/T347080)
[18:00:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963789 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:01:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond)
[18:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963789 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:03:46] <wikibugs>	 (03CR) 10Ladsgroup: mariadb: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[18:04:30] <wikibugs>	 (03PS2) 10Jbond: htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272)
[18:06:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43926/console" [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond)
[18:07:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond)
[18:07:50] <wikibugs>	 (03PS12) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505)
[18:08:56] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.29  refs T347080
[18:09:12] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[18:10:41] <thcipriani>	 yay, unblocked, thanks jeena :)
[18:10:50] <dancy>	 yay!
[18:11:26] <wikibugs>	 (03PS3) 10Jbond: htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272)
[18:11:53] <jeena>	 thanks to the unblockers :P
[18:12:18] <dancy>	 Unblockers be unblockin'
[18:12:56] <thcipriani>	 ^ typical
[18:12:58] <thcipriani>	 :)
[18:15:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update CNAMEs for ntp.$site [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[18:15:07] <wikibugs>	 (03PS2) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054)
[18:15:09] <logmsgbot>	 !log jhuneidi@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.29  refs T347080 (duration: 06m 12s)
[18:15:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 17): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43927/console" [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond)
[18:15:25] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[18:15:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond)
[18:17:06] <sukhe>	 !log running authdns-update: T347054
[18:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:10] <stashbot>	 T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054
[18:17:28] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:18:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:24:30] <wikibugs>	 (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[18:26:19] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963791 (https://phabricator.wikimedia.org/T347080)
[18:26:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963791 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:27:21] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963791 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot)
[18:34:01] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.29  refs T347080
[18:34:10] <stashbot>	 T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080
[18:42:40] <wikibugs>	 (03PS1) 10Ammarpad: Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814)
[18:43:52] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[18:43:58] <wikibugs>	 (03PS1) 10Jdrewniak: [Prototype] Add screen resolution to Typography prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963810
[18:44:23] <wikibugs>	 (03PS2) 10Varnent: Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762)
[18:45:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent)
[18:45:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:46:00] <wikibugs>	 (03PS1) 10Jdrewniak: [Prototype] Edit project link page on reading prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963811
[18:47:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2004']
[18:47:38] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2004']
[18:49:21] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:50:03] <wikibugs>	 (03PS3) 10Varnent: Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762)
[18:50:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent)
[18:51:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2004.codfw.wmnet with OS bullseye
[19:09:19] <wikibugs>	 (03PS4) 10Jforrester: [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent)
[19:14:27] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:47] <wikibugs>	 (03PS1) 10Varnent: Provide 'translationadmin' group with 'edit-legal' right. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801
[19:15:46] <wikibugs>	 (03PS2) 10Varnent: Provide 'translationadmin' group with 'edit-legal' right. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187)
[19:15:48] <wikibugs>	 (03PS3) 10Jforrester: [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent)
[19:20:17] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:26:40] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[19:32:51] <icinga-wm>	 PROBLEM - Check systemd state on sretest1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:43:21] <icinga-wm>	 RECOVERY - Check systemd state on sretest1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:00:04] <jouncebot>	 brennen and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T2000).
[20:00:05] <jouncebot>	 Daimona, Ammar, jan_drewniak, and James_F: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:05] <James_F>	 o/
[20:00:13] <Daimona>	 o/
[20:01:04] <jan_drewniak>	 O/
[20:02:46] <thcipriani>	 ooh, right. I suppose I can be your backporter today :)
[20:03:39] <wikibugs>	 (03PS2) 10Thcipriani: beta: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962625 (https://phabricator.wikimedia.org/T339246) (owner: 10Daimona Eaytoy)
[20:04:40] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] beta: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962625 (https://phabricator.wikimedia.org/T339246) (owner: 10Daimona Eaytoy)
[20:05:07] <thcipriani>	 ^ Daimona beta only! Off to a good start. Should be live in the next 10 minutes on beta.
[20:05:18] <Daimona>	 Yay, thank you :)
[20:05:27] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962625 (https://phabricator.wikimedia.org/T339246) (owner: 10Daimona Eaytoy)
[20:06:20] <thcipriani>	 Ammar: around for your change?
[20:06:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr)
[20:06:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr)
[20:06:41] <Ammar>	 Yes
[20:07:09] <thcipriani>	 ok, you'll be up next
[20:07:31] <wikibugs>	 (03PS2) 10Thcipriani: Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad)
[20:08:00] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] [Prototype] Add screen resolution to Typography prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963810 (owner: 10Jdrewniak)
[20:09:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad)
[20:09:46] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] [Prototype] Edit project link page on reading prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963811 (owner: 10Jdrewniak)
[20:09:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad)
[20:10:14] <brennen>	 (getting the vector ones going)
[20:11:06] <thcipriani>	 eeeh...that looked like a weird CI fail. Let's retry that one.
[20:11:43] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad)
[20:12:38] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad)
[20:12:43] <thcipriani>	 there we go
[20:13:16] <logmsgbot>	 !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:963799|Enable Minerva site notice for Nepali Wikipedia (newiki) (T347814)]]
[20:13:23] <stashbot>	 T347814: Enable wgMinervaEnableSiteNotice for newiki - https://phabricator.wikimedia.org/T347814
[20:14:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[20:14:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[20:14:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[20:14:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[20:14:31] <logmsgbot>	 !log thcipriani@deploy2002 ammarpad and thcipriani: Backport for [[gerrit:963799|Enable Minerva site notice for Nepali Wikipedia (newiki) (T347814)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:14:43] <thcipriani>	 ^ Ammar live on mwdebug if you can test
[20:16:09] <Ammar>	 Thanks. It looks OK to me. There's a site notice and I can see on the mobile domain now
[20:16:24] <thcipriani>	 thanks Ammar continuing with sync
[20:16:41] <logmsgbot>	 !log thcipriani@deploy2002 ammarpad and thcipriani: Continuing with sync
[20:22:14] <logmsgbot>	 !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:963799|Enable Minerva site notice for Nepali Wikipedia (newiki) (T347814)]] (duration: 08m 57s)
[20:22:18] <stashbot>	 T347814: Enable wgMinervaEnableSiteNotice for newiki - https://phabricator.wikimedia.org/T347814
[20:22:25] <thcipriani>	 ^ Ammar all done!
[20:22:46] <wikibugs>	 (03Merged) 10jenkins-bot: [Prototype] Add screen resolution to Typography prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963810 (owner: 10Jdrewniak)
[20:22:58] <thcipriani>	 ^ one down, one to go...
[20:23:43] <wikibugs>	 (03Merged) 10jenkins-bot: [Prototype] Edit project link page on reading prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963811 (owner: 10Jdrewniak)
[20:24:07] <brennen>	 jan_drewniak: you're up
[20:24:19] <thcipriani>	 jan_drewniak: and I'm going to go both at once
[20:24:34] <jan_drewniak>	 Sounds good
[20:24:57] <Ammar>	 thcipriani, thank you.
[20:25:10] <logmsgbot>	 !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:963810|[Prototype] Add screen resolution to Typography prototype]], [[gerrit:963811|[Prototype] Edit project link page on reading prototype]]
[20:25:36] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway)
[20:27:56] <thcipriani>	 jan_drewniak: l10n change means this might take a minute
[20:28:19] <jan_drewniak>	 Gotcha
[20:28:23] <jan_drewniak>	 I
[20:37:14] <logmsgbot>	 !log thcipriani@deploy2002 jdrewniak and thcipriani: Backport for [[gerrit:963810|[Prototype] Add screen resolution to Typography prototype]], [[gerrit:963811|[Prototype] Edit project link page on reading prototype]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:37:26] <thcipriani>	 jan_drewniak: ^ on test servers, check please
[20:39:36] <jan_drewniak>	 thcipriani: good to sync
[20:39:44] <thcipriani>	 cool going live
[20:39:51] <logmsgbot>	 !log thcipriani@deploy2002 jdrewniak and thcipriani: Continuing with sync
[20:49:07] <logmsgbot>	 !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:963810|[Prototype] Add screen resolution to Typography prototype]], [[gerrit:963811|[Prototype] Edit project link page on reading prototype]] (duration: 23m 57s)
[20:49:25] <thcipriani>	 jan_drewniak: ^ should be live
[20:49:42] <thcipriani>	 James_F: still here? :)
[20:49:51] * jan_drewniak thcipriani: awesome, thanks!
[20:50:11] <James_F>	 thcipriani: Yes. 
[20:50:30] <James_F>	 thcipriani: You can sling my two out together. Just boring standard config changes
[20:50:39] <wikibugs>	 (03PS5) 10Thcipriani: [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent)
[20:50:49] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent)
[20:51:32] <wikibugs>	 (03Merged) 10jenkins-bot: [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent)
[20:51:33] <thcipriani>	 k, going to merge these the old fashioned way. my mental model of "rebase if necessary" is...wrong some how
[20:51:47] <wikibugs>	 (03PS4) 10Thcipriani: [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent)
[20:51:59] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent)
[20:52:02] <James_F>	 thcipriani: Gerrit is dark and full of shadows.
[20:52:19] <brennen>	 small creatures skittering in the corners
[20:52:36] <thcipriani>	 just git until it's not
[20:53:08] <wikibugs>	 (03Merged) 10jenkins-bot: [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent)
[20:53:56] <logmsgbot>	 !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:962082|[foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces (T347762 T347822 T348268)]], [[gerrit:963801|[foundationwiki] Provide 'translationadmin' group with 'edit-legal' right (T346187)]]
[20:54:04] <stashbot>	 T347762: Add "Endowment" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347762
[20:54:05] <stashbot>	 T348268: Add "Memory" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T348268
[20:54:05] <stashbot>	 T347822: Add "Agenda" and "Committee" namespaces to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347822
[20:54:05] <stashbot>	 T346187: Give Translate Admin group on edit-legal rights on Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T346187
[20:54:54] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway)
[20:55:13] <logmsgbot>	 !log thcipriani@deploy2002 thcipriani and varnent: Backport for [[gerrit:962082|[foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces (T347762 T347822 T348268)]], [[gerrit:963801|[foundationwiki] Provide 'translationadmin' group with 'edit-legal' right (T346187)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:56:15] <thcipriani>	 ^ James_F up on testwikis, look good?
[20:57:14] <James_F>	 thcipriani: Yup!
[20:58:15] <logmsgbot>	 !log thcipriani@deploy2002 thcipriani and varnent: Continuing with sync
[20:58:32] <thcipriani>	 thanks James_F continuing
[20:58:36] <James_F>	 Thank you.
[20:59:37] <thcipriani>	 need me to run namespace dupes? Or are you already on it?
[21:00:28] <James_F>	 I don't think it'll trigger anything.
[21:00:38] <James_F>	 But if you could run just in case that'd be great.
[21:03:53] <logmsgbot>	 !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:962082|[foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces (T347762 T347822 T348268)]], [[gerrit:963801|[foundationwiki] Provide 'translationadmin' group with 'edit-legal' right (T346187)]] (duration: 09m 56s)
[21:04:12] <stashbot>	 T347762: Add "Endowment" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347762
[21:04:13] <stashbot>	 T348268: Add "Memory" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T348268
[21:04:13] <stashbot>	 T347822: Add "Agenda" and "Committee" namespaces to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347822
[21:04:13] <stashbot>	 T346187: Give Translate Admin group on edit-legal rights on Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T346187
[21:04:19] <thcipriani>	 James_F: 0 pages to fix, 0 were resolvable. All done!
[21:04:24] <James_F>	 Ta!
[21:07:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2020.codfw.wmnet: Maybe cleanup leaked file descriptors(?) - eevans@cumin1001
[21:14:21] <wikibugs>	 (03PS2) 10JHathaway: puppet agent: protect against a missing client bucket path [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970)
[21:15:03] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway)
[21:17:08] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2020.codfw.wmnet: Maybe cleanup leaked file descriptors(?) - eevans@cumin1001
[21:27:47] <wikibugs>	 (03PS1) 10Ladsgroup: Set virtual domain mapping for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963837 (https://phabricator.wikimedia.org/T330590)
[21:30:23] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Look sane to me, minor things to fix inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[21:32:36] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye
[21:32:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w...
[21:34:50] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye
[21:34:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w...
[21:39:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:44:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 50% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:44:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10ATsay-WMF) 05Resolved→03Open Hello, I'd like to request access to analytics-privatedata-users as well. Thanks!
[21:57:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:02:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[22:37:19] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[22:37:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudelastic1007....
[22:37:53] <wikibugs>	 (03PS1) 10Cwhite: opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262)
[22:38:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite)
[22:40:16] <wikibugs>	 (03PS2) 10Cwhite: opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262)
[22:42:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite)
[22:47:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:48:55] <wikibugs>	 (03PS1) 10Urbanecm: cswiki: Remove engineer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963843 (https://phabricator.wikimedia.org/T348279)
[22:52:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:53:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:58:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED
[22:58:37] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED
[22:59:29] <logmsgbot>	 !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[22:59:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudelastic1007.eqia...
[23:00:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED
[23:00:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Papaul) @bking I tried to do the re-images on cloudelastic1007, the re-image finished with the OS install without a...
[23:00:57] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 208 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:02:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED
[23:02:41] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye
[23:02:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye
[23:12:50] <wikibugs>	 (03PS3) 10Cwhite: opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262)
[23:17:29] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 89 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[23:19:26] <wikibugs>	 (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/962244/43928/" [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite)
[23:19:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1003.mgmt.eqiad.wmnet with reboot policy FORCED
[23:22:14] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-master1003.mgmt.eqiad.wmnet with reboot policy FORCED
[23:22:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye
[23:22:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye
[23:29:49] <wikibugs>	 (03CR) 10Tim Starling: [C: 04-1] thumbor: add imagemagick policy file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan)
[23:33:05] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:33:39] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:34:29] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:35:01] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:36:43] <wikibugs>	 (03CR) 10Tim Starling: [C: 04-1] thumbor: add imagemagick policy file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan)
[23:39:51] <wikibugs>	 (03CR) 10Tim Starling: [C: 04-1] thumbor: add imagemagick policy file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan)
[23:43:22] <wikibugs>	 (03PS1) 10Jclark-ctr: correct roles for an-master1003,4 [puppet] - 10https://gerrit.wikimedia.org/r/963845 (https://phabricator.wikimedia.org/T342291)
[23:43:52] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] correct roles for an-master1003,4 [puppet] - 10https://gerrit.wikimedia.org/r/963845 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr)
[23:48:25] <icinga-wm>	 RECOVERY - Disk space on restbase2020 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops