[00:00:33] (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:04:26] PROBLEM - Check systemd state on kubernetes2035 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:48] (DatasourceError) resolved: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:18:30] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 60849 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [00:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962240 [00:38:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962240 (owner: 10TrainBranchBot) [00:52:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962240 (owner: 10TrainBranchBot) [01:16:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 131 probes of 705 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:19:24] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 211 probes of 712 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:21:30] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 77 probes of 705 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:25:48] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:22] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 85 probes of 712 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:38:48] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:48] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:22:52] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68973 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [03:34:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:39:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:03:48] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 57509 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [04:24:33] (03PS1) 10Andrew Bogott: Neutron: update init script for api service in Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963429 [04:27:41] (03PS2) 10Andrew Bogott: Neutron: update init script for api service in Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963429 [04:29:36] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: update init script for api service in Antelope [puppet] - 10https://gerrit.wikimedia.org/r/963429 (owner: 10Andrew Bogott) [05:10:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T343198)', diff saved to https://phabricator.wikimedia.org/P52827 and previous config saved to /var/cache/conftool/dbconfig/20231005-051056-arnaudb.json [05:11:01] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [05:17:03] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10DDeSouza) [05:21:23] 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10Nikerabbit) 1) I can't reproduce {F37983700} 2) Not in scope for the Language team either, maybe SRE? [05:24:07] (03PS1) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [05:26:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P52828 and previous config saved to /var/cache/conftool/dbconfig/20231005-052602-arnaudb.json [05:28:00] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 135 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:29:05] !log Deleting old Jenkins builds on pcc-worker1002 to free disk space [05:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P52829 and previous config saved to /var/cache/conftool/dbconfig/20231005-054109-arnaudb.json [05:46:12] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 61088 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [05:47:26] (03PS2) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [05:55:14] (03PS3) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [05:56:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T343198)', diff saved to https://phabricator.wikimedia.org/P52830 and previous config saved to /var/cache/conftool/dbconfig/20231005-055615-arnaudb.json [05:56:18] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [05:56:20] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [05:56:31] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [05:56:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52831 and previous config saved to /var/cache/conftool/dbconfig/20231005-055637-arnaudb.json [05:57:51] (03PS4) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [05:58:48] 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10R4356th) Can repro. {F37984723} {F37984772} And it's affecting more than just English and French entries. {F37984865} [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600) [06:00:06] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0600). [06:03:40] (03PS5) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [06:05:56] (03PS6) 10Andrea Denisse: webperf: Move navtiming logs to statsd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) [06:11:47] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/963432/43891/" [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse) [06:14:06] (03PS1) 10Umherirrender: Revert "Use HookHandlers for core hooks" [extensions/WikibaseCirrusSearch] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963354 (https://phabricator.wikimedia.org/T348181) [06:14:36] (03CR) 10Ayounsi: [C: 03+1] "two possible improvement, but overall lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) (owner: 10Ssingh) [06:26:37] 10SRE, 10Traffic, 10Patch-For-Review: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ayounsi) Thanks, I think the scope should be larger than just those two variables if we want to remove the term "anycast" as much as possible.... [06:28:26] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:46:07] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10ayounsi) > Secondary Link Migration Looking at link usage, it's fine to drop the secondary link and keep it at 10G. https://librenm... [06:47:38] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 63195 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [06:56:41] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/963413 (https://phabricator.wikimedia.org/T345220) (owner: 10Subramanya Sastry) [07:00:05] Amir1, apergos, and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0700). [07:00:22] Morning! There are no trainees signed up for this morning's awesome opportunity to learn all about deployment. But that's ok, because there are no patches scheduled for deployment either. Have a great day and we'll see you here next time! [07:03:18] apergos: I was going to ask :-] [07:03:48] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:04:07] it's wikimedia connect, I assume that may have cut into the appetite for deployment somewhat [07:09:24] 10SRE, 10All-and-every-Wiktionary, 10Product-Analytics, 10SEO: Google displays “Wikipedia” as site title for some Wiktionary pages - https://phabricator.wikimedia.org/T348203 (10Pamputt) I submitted a feedback to Google 10 days ago but nothing has changed since then. [07:13:06] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10Jelto) p:05Triage→03Medium [07:16:39] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10Jelto) Thanks for the access request. I need approval from @Bethany and @thcipriani here to proceed. [07:18:59] 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [07:27:15] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10ayounsi) > I would propose using free port ge-0/0/3 to add the new routed link, and bringing up the BGP peering before we touch the existing link.... [07:27:42] (03PS1) 10Krinkle: logging: Remove redundant setTimezone() call for UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581) [07:51:35] !log bounce vopsbot on alert1001 [07:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse) [07:55:54] (03CR) 10Hashar: [C: 03+1] "I got confused in my earlier comment, this set the timestamp of files for the application being build rather than the dependency wheels :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [07:59:14] !log installing jetty9 security updates [07:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:38] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/963432 (https://phabricator.wikimedia.org/T345791) (owner: 10Andrea Denisse) [08:09:38] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 69024 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [08:19:50] (03PS2) 10Cathal Mooney: Add ns0 and ns1 /32 routes to anycast_prefixes list [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) [08:20:48] (03CR) 10Cathal Mooney: Add ns0 and ns1 /32 routes to anycast_prefixes list (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/963375 (https://phabricator.wikimedia.org/T348041) (owner: 10Cathal Mooney) [08:26:13] (03PS1) 10Volans: openldap: cross-validate-accounts wording [puppet] - 10https://gerrit.wikimedia.org/r/963666 [08:29:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/963666 (owner: 10Volans) [08:29:33] (03CR) 10Volans: [C: 03+2] openldap: cross-validate-accounts wording [puppet] - 10https://gerrit.wikimedia.org/r/963666 (owner: 10Volans) [08:30:12] moritzm: can I puppet-merge your change too? parsoid-rt-client: Further reduce worker pool to 16 clients [08:30:31] 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Jelto) [08:30:39] 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Jelto) p:05Triage→03Medium [08:33:48] mine can be merged anytime [08:34:40] 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Peachey88) See also {T346921} [08:36:20] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) p:05Triage→03Medium [08:37:11] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) [08:37:27] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) [08:37:33] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [08:40:36] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [08:43:22] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [08:47:10] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) [08:47:50] (03CR) 10EoghanGaffney: [C: 03+1] python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [08:50:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:50:34] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68273 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [08:51:18] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:51:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:52:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:58:47] (03PS2) 10Jelto: [gitlab/switchover] Update DNS for gitlab/gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney) [08:59:02] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [09:00:05] jelto and eoghan: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for GitLab DC switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0900). [09:01:13] !log jelto@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab2002.wikimedia.org [09:05:02] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [09:05:48] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:06:28] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [09:08:16] ACKNOWLEDGEMENT - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused Jelto switchover to codfw - T345531 - The acknowledgement expires at: 2023-10-06 11:25:35. https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [09:08:48] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:08:52] moritzm: re: puppet-merge, there is still your patch pending, mine can be merged anytime, so I'll leave it to you to merge when ready yours [09:18:07] oh yes, sorry. just merged both [09:19:14] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:19:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:19:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:21:34] (03CR) 10Muehlenhoff: [C: 03+2] Reinstate absented package list bullseye without ISC libraries [puppet] - 10https://gerrit.wikimedia.org/r/961983 (owner: 10Muehlenhoff) [09:28:04] (03PS1) 10Volans: spicerack: improve cookbooks help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669 [09:29:59] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014 - https://phabricator.wikimedia.org/T348225 (10cmooney) p:05Triage→03Medium [09:30:17] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014 - https://phabricator.wikimedia.org/T348225 (10cmooney) [09:30:23] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [09:33:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [09:34:57] (03CR) 10DCausse: "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [09:36:57] (03CR) 10Brouberol: [C: 03+1] Bump the maximum number of HDFS files before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [09:37:16] (03CR) 10Btullis: [C: 03+2] Bump the maximum number of HDFS files before triggering an alert [alerts] - 10https://gerrit.wikimedia.org/r/963327 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [09:44:29] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [09:50:59] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [09:51:27] (03CR) 10Jbond: [C: 03+2] gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [09:51:40] (03CR) 10Jbond: [C: 03+2] "merged" [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [09:52:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/963258 (owner: 10Slyngshede) [09:53:23] hashar: fyi i merged the gerrit cr and ran puppet on gerrit1003 [09:54:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [09:55:09] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts ores1001.eqiad.wmnet [09:55:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669 (owner: 10Volans) [09:56:45] (03CR) 10Btullis: [C: 03+1] "OK, so this should add druid1009 and druid1010 to the cluster, correct?" [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [09:57:18] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:59:36] (03CR) 10Jbond: [C: 03+1] "lgtm, optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [09:59:49] !log installing python2.7 security updates [09:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] jelto and eoghan: Your horoscope predicts another unfortunate GitLab DC switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0900). [10:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1000) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1000) [10:00:47] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [10:02:00] (03CR) 10Stevemunene: druid: Bring druid1010.eqiad.wmnet into service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [10:02:35] 10SRE, 10Infrastructure-Foundations, 10Puppet CI, 10Release-Engineering-Team: PCC failing with "No space left on device" - https://phabricator.wikimedia.org/T348176 (10jbond) @hashar could you check if this is on the jenkins side pcc-worker1002 looks healthy to me [10:03:51] (03CR) 10Stevemunene: [C: 03+2] druid: Bring druid1010.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [10:08:08] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [10:09:14] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ores1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [10:09:14] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:09:15] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ores1001.eqiad.wmnet [10:09:45] !log klausman@cumin1001 START - Cookbook sre.hosts.decommission for hosts orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet [10:13:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [10:14:24] (03CR) 10Majavah: [C: 04-1] site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [10:15:21] (03PS1) 10Slyngshede: Implement Codex design for properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/963681 [10:16:18] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [10:16:36] (03CR) 10Klausman: site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [10:19:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:21:02] !log klausman@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [10:21:52] (03PS2) 10Klausman: site.pp: Move ORES machines to spare::system for decomming [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) [10:21:58] (03CR) 10Klausman: site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [10:23:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - klausman@cumin1001" [10:23:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:23:19] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts orespoolcounter[2003-2004].codfw.wmnet,orespoolcounter[1003-1004].eqiad.wmnet [10:24:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:27:21] (03Abandoned) 10Ladsgroup: mariadb: Add grants for testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [10:30:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:39:51] (03CR) 10Jelto: [C: 03+2] [gitlab/switchover] Change profile::gitlab::service_name for switchover [puppet] - 10https://gerrit.wikimedia.org/r/963160 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney) [10:40:06] (03CR) 10Majavah: site.pp: Move ORES machines to spare::system for decomming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [10:40:42] (03PS3) 10Klausman: site.pp: Remove ORES machines (real and VMs) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) [10:40:44] (03PS13) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [10:40:57] (03CR) 10Majavah: [C: 03+1] site.pp: Remove ORES machines (real and VMs) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [10:41:23] (03CR) 10Klausman: [C: 03+2] site.pp: Remove ORES machines (real and VMs) [puppet] - 10https://gerrit.wikimedia.org/r/963313 (https://phabricator.wikimedia.org/T348144) (owner: 10Klausman) [10:42:14] (03CR) 10Btullis: airflow-wmde: configure wmde airflow instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [10:43:21] (03CR) 10CI reject: [V: 04-1] [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [10:46:41] (03PS14) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [10:49:26] (03CR) 10Jbond: [C: 03+2] wmflib::get_clusters: create a puppet version of get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/962225 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:49:30] (03CR) 10Jbond: [C: 03+2] P:prometheus::ops: convert to using wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963026 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:49:33] (03CR) 10Jbond: [C: 03+2] prometheus: switch to wmflib::get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/963027 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:49:36] (03CR) 10Jbond: [C: 03+2] get_clusters: remove legacy functions [puppet] - 10https://gerrit.wikimedia.org/r/963028 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:50:18] (03CR) 10Hnowlan: thumbor: add imagemagick policy file (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [10:50:35] (03CR) 10Hnowlan: [C: 03+2] Don't ignore imagemagick exit status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/961968 (https://phabricator.wikimedia.org/T344233) (owner: 10Tim Starling) [10:53:48] (JobUnavailable) firing: (4) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:57:37] (03PS16) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [10:58:01] (03PS10) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [10:58:28] (03PS12) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) [10:59:09] (03Merged) 10jenkins-bot: Don't ignore imagemagick exit status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/961968 (https://phabricator.wikimedia.org/T344233) (owner: 10Tim Starling) [11:01:35] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:04] (03PS10) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [11:03:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43895/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:04:11] (03CR) 10Muehlenhoff: [C: 03+2] mirrors: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/962032 (owner: 10Muehlenhoff) [11:04:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43896/console" [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:05:24] (ProbeDown) firing: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:06:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43897/console" [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:07:31] (03Abandoned) 10Jbond: firewall: migrate ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/919062 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [11:07:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:10:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:14:24] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:14:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43898/console" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:14:55] (03PS1) 10Muehlenhoff: Remove pentesters group [puppet] - 10https://gerrit.wikimedia.org/r/963688 [11:15:04] (03Abandoned) 10Muehlenhoff: Mark pentesters as deprecated [puppet] - 10https://gerrit.wikimedia.org/r/960547 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [11:17:07] (03PS4) 10Jbond: redis::slave: switch to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) [11:17:11] (03PS6) 10Jbond: prometheus::pdu_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) [11:17:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:17:57] (03PS15) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [11:18:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43899/console" [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:19:11] (JobUnavailable) firing: (4) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:20:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] redis::slave: switch to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:20:27] (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) [11:21:08] (03CR) 10Jelto: [C: 03+2] [gitlab/switchover] Update DNS for gitlab/gitlab-replica [dns] - 10https://gerrit.wikimedia.org/r/963161 (https://phabricator.wikimedia.org/T345531) (owner: 10EoghanGaffney) [11:22:11] (03PS16) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [11:22:29] (03CR) 10Muehlenhoff: [C: 04-1] "We still need to add a constant for NETWORK_INFRA for nftables, -1ing for now" [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [11:22:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43900/console" [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:23:04] !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [11:23:08] !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [11:23:34] !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [11:23:39] !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [11:24:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::pdu_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:24:03] (03PS1) 10Jbond: prometheus::cluster_config: sort targets [puppet] - 10https://gerrit.wikimedia.org/r/963693 [11:24:38] !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [11:24:42] !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab.wikimedia.org/ https://gitlab-replica.wikimedia.org/ on all recursors [11:25:47] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:29:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43901/console" [puppet] - 10https://gerrit.wikimedia.org/r/963693 (owner: 10Jbond) [11:29:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] prometheus::cluster_config: sort targets [puppet] - 10https://gerrit.wikimedia.org/r/963693 (owner: 10Jbond) [11:29:55] (03PS2) 10Slyngshede: Implement Codex design for properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/963681 [11:30:53] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:24] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan) [11:33:12] (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan) [11:34:41] PROBLEM - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 22: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:34:44] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [11:35:19] (03PS1) 10Jelto: gitlab: make gitlab2002 the active host [puppet] - 10https://gerrit.wikimedia.org/r/963706 (https://phabricator.wikimedia.org/T345531) [11:36:37] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [11:36:37] (03CR) 10LSobanski: [C: 03+1] gitlab: make gitlab2002 the active host [puppet] - 10https://gerrit.wikimedia.org/r/963706 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto) [11:36:46] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:36:50] (03CR) 10Jelto: [C: 03+2] gitlab: make gitlab2002 the active host [puppet] - 10https://gerrit.wikimedia.org/r/963706 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto) [11:36:53] RECOVERY - Gitlab SSH healthcheck git daemon on gitlab.wikimedia.org is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:43:00] (03CR) 10WMDE-Fisch: "Exciting and thanks for taking care! When will this be live in production?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan) [11:46:26] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab1004.wikimedia.org to gitlab2002.wikimedia.org [11:47:57] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:43] (03PS1) 10Jbond: puppetdbquery: remove all refrences to query_resources [puppet] - 10https://gerrit.wikimedia.org/r/963709 (https://phabricator.wikimedia.org/T341373) [11:48:53] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:50:24] (ProbeDown) resolved: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:23] (03PS1) 10Jbond: puppetdbquery: drop module [puppet] - 10https://gerrit.wikimedia.org/r/963710 (https://phabricator.wikimedia.org/T341373) [11:54:37] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:42] (03CR) 10Jbond: [C: 03+2] puppetdbquery: remove all refrences to query_resources [puppet] - 10https://gerrit.wikimedia.org/r/963709 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:56:01] (03CR) 10Jbond: [C: 03+2] puppetdbquery: drop module [puppet] - 10https://gerrit.wikimedia.org/r/963710 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:57:16] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet2005-dev [11:57:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:57:18] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudnet2005-dev [12:00:05] jelto and eoghan: It is that lovely time of the day again! You are hereby commanded to deploy GitLab DC switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T0900). [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1200) [12:01:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [12:01:47] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bullseye [12:01:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [12:01:54] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bullseye [12:01:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [12:02:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [12:03:34] (03PS5) 10Muehlenhoff: scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 [12:06:27] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [12:06:32] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063'] [12:07:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1064'] [12:07:23] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) 05Open→03Resolved a:03jbond puppetdbquery has now been removed [12:07:34] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1064'] [12:07:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1063'] [12:07:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1063'] [12:08:19] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [12:08:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1063'] [12:08:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [12:09:05] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond) 05Open→03In progress p:05Triage→03Medium [12:10:34] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetboard2002.codfw.wmnet,puppetboard1002.eqiad.wmnet [12:11:58] 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudlb2001-dev and cloudlb2002-dev connected at different speeds - https://phabricator.wikimedia.org/T348173 (10LSobanski) [12:12:31] (03PS1) 10Jbond: hieradata: remove puppetdb[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/963714 (https://phabricator.wikimedia.org/T347286) [12:13:08] (03CR) 10Jbond: [C: 03+2] hieradata: remove puppetdb[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/963714 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [12:13:37] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet [12:14:29] (03CR) 10DCausse: cirrus streaming updater service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (https://phabricator.wikimedia.org/T347075) (owner: 10Ebernhardson) [12:17:27] (03PS1) 10Jbond: hieradata: rename puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963716 [12:17:39] (03CR) 10Jbond: [C: 03+2] hieradata: rename puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/963716 (owner: 10Jbond) [12:21:56] (03PS17) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [12:22:24] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [12:23:24] (03PS1) 10Jbond: puppetboard: use global aka site specific puppetdb_host [puppet] - 10https://gerrit.wikimedia.org/r/963717 (https://phabricator.wikimedia.org/T347286) [12:24:12] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [12:26:29] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [12:26:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43908/console" [puppet] - 10https://gerrit.wikimedia.org/r/963717 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [12:26:49] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetboard: use global aka site specific puppetdb_host [puppet] - 10https://gerrit.wikimedia.org/r/963717 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [12:27:05] !log jbond@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:27:06] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts puppetboard2002.codfw.wmnet,puppetboard1002.eqiad.wmnet [12:27:18] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetboard2002.codfw.wmnet,p... [12:27:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [12:27:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:27:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet [12:27:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): decomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetdb2002.codfw.wmnet,puppetdb1002.eqiad.wmnet` - p... [12:29:07] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond) > Failed to run the sre.dns.netbox cookbook, run it manually This was completed by a different cookbook run [12:29:29] (03CR) 10Muehlenhoff: [C: 03+2] Blacklist exfat [puppet] - 10https://gerrit.wikimedia.org/r/950145 (owner: 10Muehlenhoff) [12:38:39] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [12:39:04] (03PS18) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [12:40:27] (03PS1) 10Jbond: pupetboard: move to puppetboard role [puppet] - 10https://gerrit.wikimedia.org/r/963718 (https://phabricator.wikimedia.org/T347286) [12:41:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists1001.wikimedia.org [12:41:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43909/console" [puppet] - 10https://gerrit.wikimedia.org/r/963718 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [12:41:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] pupetboard: move to puppetboard role [puppet] - 10https://gerrit.wikimedia.org/r/963718 (https://phabricator.wikimedia.org/T347286) (owner: 10Jbond) [12:42:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [12:45:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists1001.wikimedia.org [12:48:12] (03PS1) 10Jbond: puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) [12:49:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43910/console" [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [12:49:58] (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [12:50:51] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:51:31] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:51:39] (03PS1) 10Ladsgroup: Switch ES cluster to cluster28 and cluster29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) [12:51:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [12:54:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [12:57:11] (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:57:47] (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:58:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: refactor interfaces setting to use the base module [puppet] - 10https://gerrit.wikimedia.org/r/922105 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [12:58:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [12:58:34] (03CR) 10Ladsgroup: [C: 04-2] "Until Jaime is back from ooo so he can switch the backups." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963720 (https://phabricator.wikimedia.org/T342685) (owner: 10Ladsgroup) [12:58:38] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet [puppet] - 10https://gerrit.wikimedia.org/r/963298 (https://phabricator.wikimedia.org/T347469) [12:58:44] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) [12:59:30] (03PS19) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [12:59:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:18] in a meeting at the moment, but I might deploy a config change later [13:00:22] maybe in 20 minutes or so [13:00:42] Lucas_WMDE: that's perfect, i'll do one now, as i'll have to leave in 30 mins or so :) [13:00:47] ok \o/ [13:02:14] (03PS1) 10Urbanecm: [Growth] enwiki: Enable mentorship for 50% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399) [13:02:22] (03CR) 10Urbanecm: [C: 03+2] [Growth] enwiki: Enable mentorship for 50% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm) [13:02:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: move routes out of keepalived into interfaces [puppet] - 10https://gerrit.wikimedia.org/r/963311 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:02:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm) [13:03:04] (03Merged) 10jenkins-bot: [Growth] enwiki: Enable mentorship for 50% of new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963722 (https://phabricator.wikimedia.org/T341399) (owner: 10Urbanecm) [13:04:22] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:963722|[Growth] enwiki: Enable mentorship for 50% of new users (T341399)]] [13:04:29] T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399 [13:04:44] (03PS20) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:04:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host matomo1002.eqiad.wmnet [13:04:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet [13:05:46] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-master1002.eqiad.wmnet [13:05:50] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:963722|[Growth] enwiki: Enable mentorship for 50% of new users (T341399)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:50] (03PS1) 10Arturo Borrero Gonzalez: wmcs: cloudgw: rename resource to avoid clash [puppet] - 10https://gerrit.wikimedia.org/r/963723 (https://phabricator.wikimedia.org/T347469) [13:06:11] (03PS11) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:06:31] (03PS12) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:07:45] (03CR) 10Cathal Mooney: [C: 03+1] wmcs: cloudgw: rename resource to avoid clash [puppet] - 10https://gerrit.wikimedia.org/r/963723 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:08:05] !log urbanecm@deploy2002 urbanecm: Continuing with sync [13:08:27] (03CR) 10CI reject: [V: 04-1] ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:08:47] !log respawning two misbehaving thumbor pods in codfw [13:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host matomo1002.eqiad.wmnet [13:09:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: cloudgw: rename resource to avoid clash [puppet] - 10https://gerrit.wikimedia.org/r/963723 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:11:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [13:11:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [13:11:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [13:12:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1002.eqiad.wmnet [13:12:52] (03PS13) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:13:08] (03PS2) 10Jbond: puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) [13:13:47] (03PS21) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:14:30] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:963722|[Growth] enwiki: Enable mentorship for 50% of new users (T341399)]] (duration: 10m 08s) [13:14:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:14:35] T341399: Increase percentage of newcomers who receive Growth mentorship at English Wikipedia - https://phabricator.wikimedia.org/T341399 [13:14:39] * urbanecm done [13:14:44] Lucas_WMDE: feel fee to go ahead once done with your meeting [13:14:50] (03PS3) 10Jbond: puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) [13:15:09] (03CR) 10Ilias Sarantopoulos: ml-alerts: add alerts for ml services (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:15:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1002.eqiad.wmnet [13:16:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43912/console" [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [13:17:17] urbanecm: ok thanks! [13:18:36] (03PS1) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) [13:19:01] (03PS14) 10Ilias Sarantopoulos: team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [13:19:10] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: move bookworm db's to puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/963719 (https://phabricator.wikimedia.org/T347285) (owner: 10Jbond) [13:19:35] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye [13:22:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1002.eqiad.wmnet [13:22:03] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [13:22:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bullseye [13:22:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro... [13:22:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:22:12] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [13:22:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye executed with erro... [13:22:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye executed with erro... [13:22:37] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetdb[12]002 - https://phabricator.wikimedia.org/T347285 (10jbond) 05Open→03Resolved a:03jbond [13:22:49] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): decomission puppetboard[12]002 - https://phabricator.wikimedia.org/T347286 (10jbond) 05In progress→03Resolved a:03jbond [13:24:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963688 (owner: 10Muehlenhoff) [13:27:22] (03CR) 10Jbond: mariadb: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:27:57] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) [13:27:59] (03CR) 10Klausman: [C: 03+1] team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:28:55] RECOVERY - cassandra-c CQL 10.64.48.236:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.236 port 9042 https://phabricator.wikimedia.org/T93886 [13:29:42] (03CR) 10Cathal Mooney: [C: 03+1] "should be ok, the dir is sourced before the rest of them in /etc/network/interfaces" [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:30:27] alright, I can deploy now [13:30:42] (03CR) 10CI reject: [V: 04-1] cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:31:48] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) [13:32:26] !log starting Cassandra rebuild, restbase1030-c — T346803 [13:32:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove pentesters group [puppet] - 10https://gerrit.wikimedia.org/r/963688 (owner: 10Muehlenhoff) [13:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:32] T346803: Unable to bootstrap restbase1030-{a,b,c} - https://phabricator.wikimedia.org/T346803 [13:33:28] (03PS1) 10Jbond: hieradata: drop old hiera files [puppet] - 10https://gerrit.wikimedia.org/r/963729 [13:33:55] (03PS6) 10Lucas Werkmeister (WMDE): Drop old VP8 video transcodes, enable HLS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER) [13:34:52] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu] - 10https://gerrit.wikimedia.org/r/963681 (owner: 10Slyngshede) [13:34:58] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:35:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43913/console" [puppet] - 10https://gerrit.wikimedia.org/r/963729 (owner: 10Jbond) [13:35:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:35:14] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:35:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:35:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:35:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER) [13:36:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:36:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:28] (03Merged) 10jenkins-bot: Drop old VP8 video transcodes, enable HLS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER) [13:36:29] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:36:48] (03PS1) 10Andrew Bogott: Horizon: update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963730 (https://phabricator.wikimedia.org/T341509) [13:36:54] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:961864|Drop old VP8 video transcodes, enable HLS on testwiki (T312152 T309823)]] [13:36:58] T309823: Disable old WebM VP8 transcodes except for 360p - https://phabricator.wikimedia.org/T309823 [13:36:59] T312152: Clean up video transcode config for speed/bitrate balance - https://phabricator.wikimedia.org/T312152 [13:38:17] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update codfw1dev horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963730 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott) [13:38:26] !log lucaswerkmeister-wmde@deploy2002 brion and lucaswerkmeister-wmde: Backport for [[gerrit:961864|Drop old VP8 video transcodes, enable HLS on testwiki (T312152 T309823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:38:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: create the vrf-cloudgw device via static file [puppet] - 10https://gerrit.wikimedia.org/r/963727 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:39:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] hieradata: drop old hiera files [puppet] - 10https://gerrit.wikimedia.org/r/963729 (owner: 10Jbond) [13:41:03] (03CR) 10Jbond: [C: 03+1] scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [13:41:21] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) @cmooney's comment above on the default routing policy and priority of routes got me thinking: if... [13:41:40] (03PS1) 10Bking: flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) [13:41:52] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [13:41:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [13:42:09] (03CR) 10DCausse: [C: 03+1] flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking) [13:42:38] (03CR) 10Bking: [C: 03+2] flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking) [13:42:48] (03CR) 10Btullis: [C: 03+1] flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking) [13:42:50] (config change tested over in #mediawiki) [13:42:53] !log lucaswerkmeister-wmde@deploy2002 brion and lucaswerkmeister-wmde: Continuing with sync [13:43:34] (03Merged) 10jenkins-bot: flink-app: redeploy from new savepoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/963731 (https://phabricator.wikimedia.org/T346315) (owner: 10Bking) [13:43:48] (03PS1) 10Jbond: augeas_core: refresh module [puppet] - 10https://gerrit.wikimedia.org/r/963732 [13:44:18] (03CR) 10Jbond: [C: 03+2] augeas_core: refresh module [puppet] - 10https://gerrit.wikimedia.org/r/963732 (owner: 10Jbond) [13:44:50] (03PS1) 10Andrew Bogott: Horizon: update eqiad1 horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963733 (https://phabricator.wikimedia.org/T341509) [13:44:55] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:45:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:45:11] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: reorder post-up commands and other fixes [puppet] - 10https://gerrit.wikimedia.org/r/963734 (https://phabricator.wikimedia.org/T347469) [13:45:56] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update eqiad1 horizon version [puppet] - 10https://gerrit.wikimedia.org/r/963733 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott) [13:46:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:46:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:47:01] (03CR) 10Ssingh: [C: 03+2] install_server: replace ntp.$site with anycasted ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [13:47:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [13:48:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.400 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:48:52] 10SRE, 10SRE-Access-Requests: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Bawolff) I left WMF about 3 years ago. [13:48:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:49:02] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:961864|Drop old VP8 video transcodes, enable HLS on testwiki (T312152 T309823)]] (duration: 12m 07s) [13:49:06] T309823: Disable old WebM VP8 transcodes except for 360p - https://phabricator.wikimedia.org/T309823 [13:49:07] T312152: Clean up video transcode config for speed/bitrate balance - https://phabricator.wikimedia.org/T312152 [13:49:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:49:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST ipreservations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: reorder post-up commands and other fixes [puppet] - 10https://gerrit.wikimedia.org/r/963734 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:50:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [13:51:02] 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host sretest1001.eqiad.wmnet with OS b... [13:51:10] (03CR) 10Muehlenhoff: [C: 03+2] scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [13:51:18] hehe, I ran enough shell.php that “Writing to directory /var/www/.config/psysh is not allowed.” shows up in logspam-watch now :D [13:51:39] (T228041, known issue, no big deal) [13:51:39] T228041: Using shell.php in production fails to load personal configuration and sends warnings to Logstash - https://phabricator.wikimedia.org/T228041 [13:51:50] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't declare the vrf-interface [puppet] - 10https://gerrit.wikimedia.org/r/963736 (https://phabricator.wikimedia.org/T347469) [13:53:38] I’ll just chuck in the revert for the train blocker too [13:53:40] jouncebot: next [13:53:40] In 2 hour(s) and 6 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600) [13:53:43] yeah should be enough time [13:53:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseCirrusSearch] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963354 (https://phabricator.wikimedia.org/T348181) (owner: 10Umherirrender) [13:53:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [13:54:35] (KubernetesAPILatency) resolved: (14) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:54:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: don't declare the vrf-interface [puppet] - 10https://gerrit.wikimedia.org/r/963736 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [13:55:21] (03PS1) 10Andrew Bogott: openstack::clientpackages::antelope::buster: typo correction [puppet] - 10https://gerrit.wikimedia.org/r/963737 [13:56:44] (03PS22) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [13:57:38] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 (10ssingh) >>! In T348041#9228328, @ayounsi wrote: > That's only for equal prefix length. For example a stati... [13:58:24] (03CR) 10Andrew Bogott: [C: 03+2] openstack::clientpackages::antelope::buster: typo correction [puppet] - 10https://gerrit.wikimedia.org/r/963737 (owner: 10Andrew Bogott) [14:00:17] (03PS1) 10Jelto: gitlab/failover: remove deploy-page at the end of cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/963739 (https://phabricator.wikimedia.org/T345531) [14:01:35] jouncebot: now [14:01:35] No deployments scheduled for the next 1 hour(s) and 58 minute(s) [14:01:39] * Lucas_WMDE overrunning the window [14:04:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [14:04:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [14:04:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye [14:04:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro... [14:05:01] (03PS1) 10Jelto: admin: change email of bawolff [puppet] - 10https://gerrit.wikimedia.org/r/963741 (https://phabricator.wikimedia.org/T348216) [14:05:34] 10SRE, 10Traffic, 10Patch-For-Review: Rename ACAST_PS_ADVERTISE in bird and anycast-healthchecker to BIRD_IP_ADVERTISE - https://phabricator.wikimedia.org/T348174 (10ssingh) >>! In T348174#9227047, @ayounsi wrote: > Thanks, I think the scope should be larger than just those two variables if we want to remove... [14:05:42] (03PS1) 10Arturo Borrero Gonzalez: [DON'T MERGE UNLESS IN EMERGENCY] cloudgw: revert recent changes [puppet] - 10https://gerrit.wikimedia.org/r/963742 [14:05:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [14:07:48] (03Merged) 10jenkins-bot: Revert "Use HookHandlers for core hooks" [extensions/WikibaseCirrusSearch] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963354 (https://phabricator.wikimedia.org/T348181) (owner: 10Umherirrender) [14:08:12] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:963354|Revert "Use HookHandlers for core hooks" (T348181)]] [14:08:18] (03CR) 10CI reject: [V: 04-1] [DON'T MERGE UNLESS IN EMERGENCY] cloudgw: revert recent changes [puppet] - 10https://gerrit.wikimedia.org/r/963742 (owner: 10Arturo Borrero Gonzalez) [14:08:23] T348181: TypeError: Argument 1 passed to Wikibase\Search\Elastic\CirrusShowSearchHitHandler::newFromGlobalState() must implement interface IContextSource, instance of MediaWiki\Config\GlobalVarConfig given, called in /srv/mediawiki/php- - https://phabricator.wikimedia.org/T348181 [14:08:32] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bullseye [14:08:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: bawolff is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T348216 (10Jelto) a:05Bawolff→03Jelto >>! In T348216#9228274, @Bawolff wrote: > I left WMF about 3 years ago. > > My current email is bawolff@gmail.com . I hav... [14:08:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye [14:09:06] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "don't merge unless in emergency" [puppet] - 10https://gerrit.wikimedia.org/r/963742 (owner: 10Arturo Borrero Gonzalez) [14:09:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [14:09:24] (03PS1) 10Andrew Bogott: Revert "Horizon: update horizon version" [puppet] - 10https://gerrit.wikimedia.org/r/963743 [14:09:37] !log lucaswerkmeister-wmde@deploy2002 umherirrender and lucaswerkmeister-wmde: Backport for [[gerrit:963354|Revert "Use HookHandlers for core hooks" (T348181)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:44] testing… [14:09:58] seems to fix search on testwikidata [14:09:59] !log lucaswerkmeister-wmde@deploy2002 umherirrender and lucaswerkmeister-wmde: Continuing with sync [14:10:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [14:10:44] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: update horizon version" [puppet] - 10https://gerrit.wikimedia.org/r/963743 (owner: 10Andrew Bogott) [14:11:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [14:11:11] (03CR) 10EoghanGaffney: [C: 03+1] gitlab/failover: remove deploy-page at the end of cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/963739 (https://phabricator.wikimedia.org/T345531) (owner: 10Jelto) [14:13:19] (03CR) 10Ilias Sarantopoulos: [C: 03+2] team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [14:13:44] (03CR) 10Ilias Sarantopoulos: team-ml: add alerts for ml services [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [14:14:44] (03CR) 10Ssingh: bird: rename ACAST_PS_ADVERTISE to BIRD_IP{4,6}_ADVERTISE (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) (owner: 10Ssingh) [14:14:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/963741 (https://phabricator.wikimedia.org/T348216) (owner: 10Jelto) [14:15:51] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:16:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PUT endpointslices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [14:17:02] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:963354|Revert "Use HookHandlers for core hooks" (T348181)]] (duration: 08m 50s) [14:17:17] T348181: TypeError: Argument 1 passed to Wikibase\Search\Elastic\CirrusShowSearchHitHandler::newFromGlobalState() must implement interface IContextSource, instance of MediaWiki\Config\GlobalVarConfig given, called in /srv/mediawiki/php- - https://phabricator.wikimedia.org/T348181 [14:17:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [14:17:43] (03CR) 10Ssingh: bird: rename ACAST_PS_ADVERTISE to BIRD_IP{4,6}_ADVERTISE (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963385 (https://phabricator.wikimedia.org/T348174) (owner: 10Ssingh) [14:18:33] !log UTC afternoon backport+config window done [14:18:35] * Lucas_WMDE done [14:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:51] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:20:53] (03PS1) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) [14:21:24] (03PS1) 10Muehlenhoff: bacula::director: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/963745 [14:21:35] (KubernetesAPILatency) firing: (18) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:22:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [14:22:34] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2004'] [14:22:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2004'] [14:24:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2004.codfw.wmnet with OS bullseye [14:25:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [14:25:41] 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host sretest1001.eqiad.wmnet with OS bulls... [14:25:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:26:21] RECOVERY - Check systemd state on kubernetes2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:35] (KubernetesAPILatency) resolved: (18) High Kubernetes API latency (GET blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:29:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:29:52] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:29:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:30:12] (03PS1) 10Cwhite: icinga: round elasticsearch shard size check to 2 decimal places [puppet] - 10https://gerrit.wikimedia.org/r/962243 (https://phabricator.wikimedia.org/T327218) [14:30:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.codfw.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:32:20] (03CR) 10Ayounsi: "Overall lgtm." [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:33:11] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:33:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:33:53] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) @papaul I've moved the google meet for this to the week after - Oct 17th. There are few other moving parts in... [14:38:19] !log Bumping kubectd100[4-6].eqiad.wmnet vcpu to 2 - T348228 [14:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:23] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [14:38:34] !log Bumping kubetcd100[4-6].eqiad.wmnet vcpu to 2 - T348228 [14:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:48] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:06] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: round elasticsearch shard size check to 2 decimal places [puppet] - 10https://gerrit.wikimedia.org/r/962243 (https://phabricator.wikimedia.org/T327218) (owner: 10Cwhite) [14:41:00] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd1006.eqiad.wmnet with reason: Pick up vcpu change [14:41:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd1006.eqiad.wmnet with reason: Pick up vcpu change [14:41:35] !log rebooting kubetcd1006.eqiad.wmnet - T348228 [14:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963745 (owner: 10Muehlenhoff) [14:43:05] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:16] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd1006.eqiad.wmnet [14:44:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd1006.eqiad.wmnet [14:44:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd1005.eqiad.wmnet with reason: Pick up vcpu change [14:44:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd1005.eqiad.wmnet with reason: Pick up vcpu change [14:45:51] (03PS15) 10Ilias Sarantopoulos: team-ml: add alerts for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [14:46:16] (03PS16) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) [14:46:23] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd1005.eqiad.wmnet [14:46:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd1005.eqiad.wmnet [14:46:39] !log rebooted kubetcd1005.eqiad.wmnet - T348228 [14:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:44] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [14:47:04] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd1004.eqiad.wmnet with reason: Pick up vcpu change [14:47:18] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd1004.eqiad.wmnet with reason: Pick up vcpu change [14:47:24] !log rebooting kubetcd1004.eqiad.wmnet - T348228 [14:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:11] (JobUnavailable) firing: (3) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:16] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd1004.eqiad.wmnet [14:50:16] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd1004.eqiad.wmnet [14:52:45] !log Bumping kubemaster100[1-2].eqiad.wmnet vcpu to 2, ram to 4G - T348228 [14:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:48] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [14:52:55] (03PS2) 10Muehlenhoff: bacula::director: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/963745 [14:53:01] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster1002.eqiad.wmnet with reason: Pick up vcpu change [14:53:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster1002.eqiad.wmnet with reason: Pick up vcpu change [14:53:27] !log rebooting kubemaster1002.eqiad.wmnet - T348228 [14:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [14:56:51] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 67897 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [14:57:46] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster1002.eqiad.wmnet [14:57:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster1002.eqiad.wmnet [14:58:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:59:28] (03PS3) 10Majavah: wikimediacloud: Add a dedicated CNAME for object storage [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) [14:59:39] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster1001.eqiad.wmnet with reason: Pick up vcpu change [14:59:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster1001.eqiad.wmnet with reason: Pick up vcpu change [15:00:00] !log rebooting kubemaster1001.eqiad.wmnet - T348228 [15:00:36] stashbot: you ok? [15:02:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [15:02:38] (03CR) 10Majavah: [V: 03+2 C: 03+2] wikimediacloud: Add a dedicated CNAME for object storage (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/963330 (https://phabricator.wikimedia.org/T341380) (owner: 10Majavah) [15:03:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:32] That took a second [15:03:33] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [15:03:33] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [15:03:57] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster1001.eqiad.wmnet [15:03:57] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster1001.eqiad.wmnet [15:06:38] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [15:06:48] !log Bumping kubetcd200[4-6].eqiad.wmnet vcpu to 2 - T348228 [15:06:48] (03PS1) 10Majavah: hieradata: acme_chief: update openstack cert config [puppet] - 10https://gerrit.wikimedia.org/r/963752 [15:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:08] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd2006.codfw.wmnet with reason: Pick up vcpu change [15:07:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd2006.codfw.wmnet with reason: Pick up vcpu change [15:07:39] !log rebooting kubetcd2006.codfw.wmnet - T348228 [15:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:56] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd2006.codfw.wmnet [15:08:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd2006.codfw.wmnet [15:09:38] !log rebooting kubetcd2005.codfw.wmnet - T348228 [15:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:41] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [15:09:46] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd2005.codfw.wmnet with reason: Pick up vcpu change [15:10:00] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd2005.codfw.wmnet with reason: Pick up vcpu change [15:10:58] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd2005.codfw.wmnet [15:10:59] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd2005.codfw.wmnet [15:12:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:12:37] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubetcd2004.codfw.wmnet with reason: Pick up vcpu change [15:12:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti2023.codfw.wmnet [15:13:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubetcd2004.codfw.wmnet with reason: Pick up vcpu change [15:13:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963745 (owner: 10Muehlenhoff) [15:13:35] PROBLEM - ganeti-confd running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:13:35] PROBLEM - ganeti-noded running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:13:35] PROBLEM - ganeti-mond running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [15:13:54] !log rebooting kubetcd2004.codfw.wmnet - T348228 [15:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52832 and previous config saved to /var/cache/conftool/dbconfig/20231005-151450-arnaudb.json [15:15:01] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:16:22] (03CR) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:16:41] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/963690 (https://phabricator.wikimedia.org/T344233) (owner: 10Hnowlan) [15:16:42] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubetcd2004.codfw.wmnet [15:16:42] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubetcd2004.codfw.wmnet [15:17:17] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10Bethany) Approved >>! In T348209#9227084, @Jelto wrote: > Thanks for the access request. > > I need approval from @Bethany and @thcipriani here to proceed. [15:17:17] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: OpenSent - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:59] PROBLEM - SSH on kubemaster2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:18:14] I haven't even touched it yet [15:18:43] (03CR) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:18:43] it's up too [15:19:17] RECOVERY - SSH on kubemaster2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:19:47] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster2002.codfw.wmnet with reason: Pick up vcpu change [15:19:55] (03CR) 10Jforrester: "G2G once wmf.30 is everywhere and won't revert." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963662 (https://phabricator.wikimedia.org/T99581) (owner: 10Krinkle) [15:20:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster2002.codfw.wmnet with reason: Pick up vcpu change [15:20:19] claime: it is clingy [15:20:26] !log rebooting kubemaster2002.codfw.wmnet - T348228 [15:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:30] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [15:23:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:24:05] (03PS1) 10Muehlenhoff: puppetdb: Fix duplicated nginx entry [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) [15:24:55] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster2002.codfw.wmnet [15:24:55] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster2002.codfw.wmnet [15:25:17] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [15:25:40] Wanna bet I'm gonna have to rebalance [15:25:50] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kubemaster2001.codfw.wmnet with reason: Pick up vcpu change [15:25:57] !log rebooting kubemaster2001.codfw.wmnet - T348228 [15:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:04] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kubemaster2001.codfw.wmnet with reason: Pick up vcpu change [15:26:05] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [15:26:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2023.codfw.wmnet with reason: reimage to bullseye [15:26:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2023.codfw.wmnet with reason: reimage to bullseye [15:27:12] Hmm no, ganeti1019 doesn't have any of the instances I touched [15:27:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:28:02] This one's me [15:28:19] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [15:28:47] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1067.eqiad.wmnet with OS bullseye [15:28:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bullseye executed with erro... [15:29:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P52834 and previous config saved to /var/cache/conftool/dbconfig/20231005-152956-arnaudb.json [15:30:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for kubemaster2001.codfw.wmnet [15:30:25] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubemaster2001.codfw.wmnet [15:30:34] (03CR) 10Volans: [C: 03+2] spicerack: improve cookbooks help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669 (owner: 10Volans) [15:30:52] (03PS1) 10Volans: dhcp: always rewrite the DHCP snippet [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 [15:30:54] (03PS1) 10Volans: dhcp: simplify tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757 [15:31:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [15:31:20] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti-test2004.codfw.wmnet with OS bullseye [15:31:21] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [15:31:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:33:58] Yeah, settle down [15:34:38] (03Merged) 10jenkins-bot: spicerack: improve cookbooks help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/963669 (owner: 10Volans) [15:36:06] (03PS2) 10Muehlenhoff: puppetdb: Fix duplicated nginx entry [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) [15:36:55] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [15:37:26] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 68178 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [15:37:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [15:37:57] !log installed 7.3.1 on cumin2002 [15:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:38:53] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [15:39:02] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:41] !log jbond@cumin2002 START - Cookbook sre.puppetboard.restart-reboot rolling reboot on A:puppetboard [15:39:58] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [15:40:13] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:40:14] !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors [15:40:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:40:17] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors [15:41:18] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:51] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:42:42] (03PS3) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [15:43:15] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [15:43:32] moritzm: Should I rebalance the ganeti group for ganeti1019 or are you doing things on the cluster ? [15:43:56] !log jbond@cumin2002 START - Cookbook sre.puppet.renew-cert for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002 [15:44:17] (03CR) 10Bking: wdqs: Set up graph_split hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [15:44:22] !log jbond@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002 [15:44:26] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [15:44:34] PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (278306) = 12.8% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [15:45:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [15:45:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P52835 and previous config saved to /var/cache/conftool/dbconfig/20231005-154502-arnaudb.json [15:48:26] (03PS1) 10Jbond: sre.__init__: remove sleep as set_tries works now [cookbooks] - 10https://gerrit.wikimedia.org/r/963759 [15:48:28] (03PS1) 10Jbond: sre.puppet.renew-cert: drop support for allow_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/963760 [15:48:32] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [15:48:59] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/963760 (owner: 10Jbond) [15:49:22] (03PS2) 10Jbond: sre.puppet.renew-cert: drop support for allow_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/963760 [15:50:26] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/963759 (owner: 10Jbond) [15:50:36] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:52:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:52:53] (03PS4) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [15:53:19] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [15:54:06] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [15:54:29] (03PS5) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [15:54:56] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [15:57:09] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10aborrero) The cloudgw side is now completed. We may want to refresh the neutron side as well: `lang=shell-session aborrero@cloudcontrol100... [15:58:50] (03PS6) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [15:59:17] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:00:06] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600). [16:00:07] No Gerrit patches in the queue for this window AFAICS. [16:00:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T343198)', diff saved to https://phabricator.wikimedia.org/P52836 and previous config saved to /var/cache/conftool/dbconfig/20231005-160009-arnaudb.json [16:00:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [16:00:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [16:00:27] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [16:00:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T343198)', diff saved to https://phabricator.wikimedia.org/P52837 and previous config saved to /var/cache/conftool/dbconfig/20231005-160030-arnaudb.json [16:00:50] !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors [16:00:54] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors [16:01:12] (03PS7) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [16:01:39] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:01:41] (03CR) 10Jbond: [C: 03+2] sre.puppet.renew-cert: drop support for allow_alt_names [cookbooks] - 10https://gerrit.wikimedia.org/r/963760 (owner: 10Jbond) [16:02:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:02:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [16:04:26] (03PS8) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [16:05:18] !log jbond@cumin2002 START - Cookbook sre.puppet.renew-cert for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002 [16:05:21] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppetboard.restart-reboot (exit_code=0) rolling reboot on A:puppetboard [16:05:33] jouncebot: nowandnext [16:05:33] For the next 0 hour(s) and 54 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600) [16:05:33] In 0 hour(s) and 54 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1700) [16:05:39] !log cgoubert@deploy2002 Started scap: Testing mw-on-k8s deployment for T348228 [16:05:47] claime: puppet window is empty, help yourse-- oh good :) [16:05:57] rzl: x) [16:06:03] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10Papaul) @cmooney no problem [16:06:04] T348228: KubernetesAPILatency alert fires on scap deploy - https://phabricator.wikimedia.org/T348228 [16:06:36] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002 [16:07:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:03VRiley-WMF [16:07:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:07:05] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:07:55] !log cgoubert@deploy2002 Finished scap: Testing mw-on-k8s deployment for T348228 (duration: 02m 15s) [16:07:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [16:09:56] !log jbond@cumin2002 START - Cookbook sre.puppet.renew-cert for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002 [16:09:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [16:09:59] !log jbond@cumin2002 END (ERROR) - Cookbook sre.puppet.renew-cert (exit_code=97) for sretest1001.eqiad.wmnet: Renew puppet certificate - jbond@cumin2002 [16:10:35] !log jbond@cumin2002 START - Cookbook sre.puppetboard.restart-reboot rolling reboot on A:puppetboard [16:10:55] !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors [16:10:58] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors [16:11:04] (03CR) 10Jbond: [C: 03+2] sre.__init__: remove sleep as set_tries works now [cookbooks] - 10https://gerrit.wikimedia.org/r/963759 (owner: 10Jbond) [16:12:26] (03PS9) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [16:12:34] !log cleaning up rdf-streaming-updater-staging swift bucket [16:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:30] jouncebot nowandnext [16:13:30] For the next 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1600) [16:13:30] In 0 hour(s) and 46 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1700) [16:14:04] 10SRE, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q2): Grafana dashboard doesn't load anything: "WebSocket connection failed" - https://phabricator.wikimedia.org/T347936 (10lmata) [16:14:58] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 1TiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [16:15:15] !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet. on all recursors [16:15:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:15:19] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet. on all recursors [16:15:43] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:19:46] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppetboard.restart-reboot (exit_code=0) rolling reboot on A:puppetboard [16:22:09] !log installed 7.3.1 on cumin1001 [16:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:58] (03CR) 10Bking: [C: 04-1] "We need to wait on T342538/hosts to be moved to production role before we merge this." [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:24:56] (03CR) 10Bking: [C: 04-1] wdqs: Set up graph_split hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:25:26] (03CR) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:27:22] (03CR) 10Jbond: [C: 03+1] "LGTM however i think do we not also need to update the reimage cookbook to add a lock with concurrency 1? (still happy for this be merged " [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans) [16:29:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963757 (owner: 10Volans) [16:30:13] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/963755 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [16:34:43] (03CR) 10Volans: [C: 04-1] "Pending the release of locking on spicerack" [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans) [16:40:42] (03CR) 10Jbond: [C: 03+1] dhcp: always rewrite the DHCP snippet (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/963756 (owner: 10Volans) [16:41:26] (03PS1) 10Bking: wdqs: bring graph split hosts into service [puppet] - 10https://gerrit.wikimedia.org/r/963777 (https://phabricator.wikimedia.org/T347505) [16:43:08] (03PS10) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [16:43:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:45:39] (03CR) 10CI reject: [V: 04-1] wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [16:47:25] !log brion running requeueTranscodes.php on mwmaint2002 for VP9 transcode cleanup for T312153 [16:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:35] T312153: Batch run of TMH requeueTranscodes to remove now-unused 120p and 180p low-res files - https://phabricator.wikimedia.org/T312153 [16:49:41] (03CR) 10Andrea Denisse: [C: 03+2] alertmanager: Add the "Auto-Submitted: auto-generated" header to AM emails [puppet] - 10https://gerrit.wikimedia.org/r/963368 (https://phabricator.wikimedia.org/T347850) (owner: 10Andrea Denisse) [16:50:48] (03PS4) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) [16:53:49] (03CR) 10CI reject: [V: 04-1] sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [16:54:17] (03CR) 10JHathaway: [C: 03+2] puppetdb: add ability to configure db_ro_host [puppet] - 10https://gerrit.wikimedia.org/r/959229 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [16:55:01] (03CR) 10JHathaway: [C: 03+2] postgresql: fix ordering on a new install [puppet] - 10https://gerrit.wikimedia.org/r/959228 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [16:57:18] !log scaling back batch jobs for T312153 and T312152, will run these in further chunks as the new config rolls out [16:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:24] T312153: Batch run of TMH requeueTranscodes to remove now-unused 120p and 180p low-res files - https://phabricator.wikimedia.org/T312153 [16:57:24] T312152: Clean up video transcode config for speed/bitrate balance - https://phabricator.wikimedia.org/T312152 [16:57:43] (03PS5) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) [16:59:37] (03PS1) 10Slyngshede: Style ssh key management using Codex. [software/bitu] - 10https://gerrit.wikimedia.org/r/963779 [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1700) [17:00:08] (03PS1) 10Jbond: sretest1003: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/963780 [17:01:13] (03PS6) 10Jbond: sre.puppet.migrate_host: migrate hosts from puppet5 to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) [17:01:23] (03CR) 10Jbond: [C: 03+2] sretest1003: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/963780 (owner: 10Jbond) [17:03:32] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [17:16:45] (03PS11) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [17:17:02] (03PS24) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [17:17:34] (03CR) 10CI reject: [V: 04-1] [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [17:19:18] claime: I can take care of it tomorrow [17:19:43] (03PS1) 10Jbond: os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 [17:19:46] PROBLEM - Disk space on restbase2020 is CRITICAL: DISK CRITICAL - free space: /srv/sdb4 67476 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops [17:20:09] (03CR) 10CI reject: [V: 04-1] os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond) [17:20:26] (03PS25) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [17:21:40] (03PS1) 10Majavah: hieradata: bump striker docker image [puppet] - 10https://gerrit.wikimedia.org/r/963782 (https://phabricator.wikimedia.org/T347631) [17:22:34] (03CR) 10Majavah: [C: 03+2] hieradata: bump striker docker image [puppet] - 10https://gerrit.wikimedia.org/r/963782 (https://phabricator.wikimedia.org/T347631) (owner: 10Majavah) [17:23:33] (03PS26) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [17:25:19] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [17:26:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [17:28:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) (owner: 10Bking) [17:29:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [17:30:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [17:31:05] (03PS2) 10Jbond: os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 [17:31:31] (03CR) 10CI reject: [V: 04-1] os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond) [17:32:45] (Traffic bill over quota) firing: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:32:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43921/console" [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond) [17:33:08] (03PS27) 10Btullis: [WIP] Support multiple spark yarn shufflers in parallel [puppet] - 10https://gerrit.wikimedia.org/r/963281 (https://phabricator.wikimedia.org/T344910) [17:34:00] (03PS3) 10Jbond: os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 [17:35:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [17:35:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43923/console" [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond) [17:36:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] os_updates: make class ensurable [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond) [17:38:45] (03PS4) 10JHathaway: puppetdb: avoid creating database users via dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) [17:38:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43924/console" [puppet] - 10https://gerrit.wikimedia.org/r/963781 (owner: 10Jbond) [17:39:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2004'] [17:39:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2004'] [17:40:04] (03CR) 10JHathaway: puppetdb: avoid creating database users via dbconfig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [17:40:10] (03CR) 10JHathaway: [C: 03+2] puppetdb: avoid creating database users via dbconfig [puppet] - 10https://gerrit.wikimedia.org/r/959231 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [17:52:45] (Traffic bill over quota) resolved: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:53:26] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Fix outstanding puppet 7 issue - https://phabricator.wikimedia.org/T348272 (10jbond) [17:59:52] (03PS1) 10Jbond: htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) [18:00:05] jeena and dduvall: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T1800). nyaa~ [18:00:54] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963789 (https://phabricator.wikimedia.org/T347080) [18:00:56] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963789 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:01:11] (03CR) 10CI reject: [V: 04-1] htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond) [18:02:12] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963789 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:03:46] (03CR) 10Ladsgroup: mariadb: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [18:04:30] (03PS2) 10Jbond: htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) [18:06:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43926/console" [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond) [18:07:25] (03CR) 10CI reject: [V: 04-1] htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond) [18:07:50] (03PS12) 10Bking: wdqs: Set up graph_split hosts [puppet] - 10https://gerrit.wikimedia.org/r/963404 (https://phabricator.wikimedia.org/T347505) [18:08:56] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.29 refs T347080 [18:09:12] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [18:10:41] yay, unblocked, thanks jeena :) [18:10:50] yay! [18:11:26] (03PS3) 10Jbond: htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) [18:11:53] thanks to the unblockers :P [18:12:18] Unblockers be unblockin' [18:12:56] ^ typical [18:12:58] :) [18:15:03] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update CNAMEs for ntp.$site [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:15:07] (03PS2) 10Ssingh: wikimedia.org: update CNAMEs for ntp.$site [dns] - 10https://gerrit.wikimedia.org/r/963744 (https://phabricator.wikimedia.org/T347054) [18:15:09] !log jhuneidi@deploy2002 Synchronized php: group1 wikis to 1.41.0-wmf.29 refs T347080 (duration: 06m 12s) [18:15:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 17): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43927/console" [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond) [18:15:25] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [18:15:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] htpasswd: update to make a bit closer to standard puppet style [puppet] - 10https://gerrit.wikimedia.org/r/963788 (https://phabricator.wikimedia.org/T348272) (owner: 10Jbond) [18:17:06] !log running authdns-update: T347054 [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:10] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [18:17:28] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:46] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:24:30] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [18:26:19] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963791 (https://phabricator.wikimedia.org/T347080) [18:26:21] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963791 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:27:21] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963791 (https://phabricator.wikimedia.org/T347080) (owner: 10TrainBranchBot) [18:34:01] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.29 refs T347080 [18:34:10] T347080: 1.41.0-wmf.29 deployment blockers - https://phabricator.wikimedia.org/T347080 [18:42:40] (03PS1) 10Ammarpad: Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) [18:43:52] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [18:43:58] (03PS1) 10Jdrewniak: [Prototype] Add screen resolution to Typography prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963810 [18:44:23] (03PS2) 10Varnent: Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) [18:45:05] (03CR) 10CI reject: [V: 04-1] Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent) [18:45:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:46:00] (03PS1) 10Jdrewniak: [Prototype] Edit project link page on reading prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963811 [18:47:24] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti-test2004'] [18:47:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti-test2004'] [18:49:21] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:50:03] (03PS3) 10Varnent: Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) [18:50:40] (03CR) 10CI reject: [V: 04-1] Add Endowment, Agenda, Committee, and Memory namespaces and include in Visual Editor and search settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent) [18:51:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2004.codfw.wmnet with OS bullseye [19:09:19] (03PS4) 10Jforrester: [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent) [19:14:27] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:47] (03PS1) 10Varnent: Provide 'translationadmin' group with 'edit-legal' right. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 [19:15:46] (03PS2) 10Varnent: Provide 'translationadmin' group with 'edit-legal' right. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) [19:15:48] (03PS3) 10Jforrester: [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent) [19:20:17] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:26:40] (03CR) 10JHathaway: [C: 03+1] "looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [19:32:51] PROBLEM - Check systemd state on sretest1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:43:21] RECOVERY - Check systemd state on sretest1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:04] brennen and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231005T2000). [20:00:05] Daimona, Ammar, jan_drewniak, and James_F: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:05] o/ [20:00:13] o/ [20:01:04] O/ [20:02:46] ooh, right. I suppose I can be your backporter today :) [20:03:39] (03PS2) 10Thcipriani: beta: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962625 (https://phabricator.wikimedia.org/T339246) (owner: 10Daimona Eaytoy) [20:04:40] (03CR) 10Thcipriani: [C: 03+2] beta: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962625 (https://phabricator.wikimedia.org/T339246) (owner: 10Daimona Eaytoy) [20:05:07] ^ Daimona beta only! Off to a good start. Should be live in the next 10 minutes on beta. [20:05:18] Yay, thank you :) [20:05:27] (03Merged) 10jenkins-bot: beta: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962625 (https://phabricator.wikimedia.org/T339246) (owner: 10Daimona Eaytoy) [20:06:20] Ammar: around for your change? [20:06:24] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr) [20:06:31] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [20:06:41] Yes [20:07:09] ok, you'll be up next [20:07:31] (03PS2) 10Thcipriani: Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad) [20:08:00] (03CR) 10Brennen Bearnes: [C: 03+2] [Prototype] Add screen resolution to Typography prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963810 (owner: 10Jdrewniak) [20:09:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad) [20:09:46] (03CR) 10Brennen Bearnes: [C: 03+2] [Prototype] Edit project link page on reading prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963811 (owner: 10Jdrewniak) [20:09:54] (03CR) 10CI reject: [V: 04-1] Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad) [20:10:14] (getting the vector ones going) [20:11:06] eeeh...that looked like a weird CI fail. Let's retry that one. [20:11:43] (03CR) 10Thcipriani: [C: 03+2] Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad) [20:12:38] (03Merged) 10jenkins-bot: Enable Minerva site notice for Nepali Wikipedia (newiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963799 (https://phabricator.wikimedia.org/T347814) (owner: 10Ammarpad) [20:12:43] there we go [20:13:16] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:963799|Enable Minerva site notice for Nepali Wikipedia (newiki) (T347814)]] [20:13:23] T347814: Enable wgMinervaEnableSiteNotice for newiki - https://phabricator.wikimedia.org/T347814 [20:14:01] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [20:14:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [20:14:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [20:14:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [20:14:31] !log thcipriani@deploy2002 ammarpad and thcipriani: Backport for [[gerrit:963799|Enable Minerva site notice for Nepali Wikipedia (newiki) (T347814)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:14:43] ^ Ammar live on mwdebug if you can test [20:16:09] Thanks. It looks OK to me. There's a site notice and I can see on the mobile domain now [20:16:24] thanks Ammar continuing with sync [20:16:41] !log thcipriani@deploy2002 ammarpad and thcipriani: Continuing with sync [20:22:14] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:963799|Enable Minerva site notice for Nepali Wikipedia (newiki) (T347814)]] (duration: 08m 57s) [20:22:18] T347814: Enable wgMinervaEnableSiteNotice for newiki - https://phabricator.wikimedia.org/T347814 [20:22:25] ^ Ammar all done! [20:22:46] (03Merged) 10jenkins-bot: [Prototype] Add screen resolution to Typography prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963810 (owner: 10Jdrewniak) [20:22:58] ^ one down, one to go... [20:23:43] (03Merged) 10jenkins-bot: [Prototype] Edit project link page on reading prototype [skins/Vector] (wmf/1.41.0-wmf.29) - 10https://gerrit.wikimedia.org/r/963811 (owner: 10Jdrewniak) [20:24:07] jan_drewniak: you're up [20:24:19] jan_drewniak: and I'm going to go both at once [20:24:34] Sounds good [20:24:57] thcipriani, thank you. [20:25:10] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:963810|[Prototype] Add screen resolution to Typography prototype]], [[gerrit:963811|[Prototype] Edit project link page on reading prototype]] [20:25:36] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959230 (https://phabricator.wikimedia.org/T346842) (owner: 10JHathaway) [20:27:56] jan_drewniak: l10n change means this might take a minute [20:28:19] Gotcha [20:28:23] I [20:37:14] !log thcipriani@deploy2002 jdrewniak and thcipriani: Backport for [[gerrit:963810|[Prototype] Add screen resolution to Typography prototype]], [[gerrit:963811|[Prototype] Edit project link page on reading prototype]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:26] jan_drewniak: ^ on test servers, check please [20:39:36] thcipriani: good to sync [20:39:44] cool going live [20:39:51] !log thcipriani@deploy2002 jdrewniak and thcipriani: Continuing with sync [20:49:07] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:963810|[Prototype] Add screen resolution to Typography prototype]], [[gerrit:963811|[Prototype] Edit project link page on reading prototype]] (duration: 23m 57s) [20:49:25] jan_drewniak: ^ should be live [20:49:42] James_F: still here? :) [20:49:51] * jan_drewniak thcipriani: awesome, thanks! [20:50:11] thcipriani: Yes. [20:50:30] thcipriani: You can sling my two out together. Just boring standard config changes [20:50:39] (03PS5) 10Thcipriani: [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent) [20:50:49] (03CR) 10Thcipriani: [C: 03+2] [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent) [20:51:32] (03Merged) 10jenkins-bot: [foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962082 (https://phabricator.wikimedia.org/T347762) (owner: 10Varnent) [20:51:33] k, going to merge these the old fashioned way. my mental model of "rebase if necessary" is...wrong some how [20:51:47] (03PS4) 10Thcipriani: [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent) [20:51:59] (03CR) 10Thcipriani: [C: 03+2] [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent) [20:52:02] thcipriani: Gerrit is dark and full of shadows. [20:52:19] small creatures skittering in the corners [20:52:36] just git until it's not [20:53:08] (03Merged) 10jenkins-bot: [foundationwiki] Provide 'translationadmin' group with 'edit-legal' right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963801 (https://phabricator.wikimedia.org/T346187) (owner: 10Varnent) [20:53:56] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:962082|[foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces (T347762 T347822 T348268)]], [[gerrit:963801|[foundationwiki] Provide 'translationadmin' group with 'edit-legal' right (T346187)]] [20:54:04] T347762: Add "Endowment" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347762 [20:54:05] T348268: Add "Memory" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T348268 [20:54:05] T347822: Add "Agenda" and "Committee" namespaces to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347822 [20:54:05] T346187: Give Translate Admin group on edit-legal rights on Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T346187 [20:54:54] (03CR) 10JHathaway: [C: 03+2] ferm: fix ferm-status on container bullseye instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/959236 (https://phabricator.wikimedia.org/T344868) (owner: 10JHathaway) [20:55:13] !log thcipriani@deploy2002 thcipriani and varnent: Backport for [[gerrit:962082|[foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces (T347762 T347822 T348268)]], [[gerrit:963801|[foundationwiki] Provide 'translationadmin' group with 'edit-legal' right (T346187)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:56:15] ^ James_F up on testwikis, look good? [20:57:14] thcipriani: Yup! [20:58:15] !log thcipriani@deploy2002 thcipriani and varnent: Continuing with sync [20:58:32] thanks James_F continuing [20:58:36] Thank you. [20:59:37] need me to run namespace dupes? Or are you already on it? [21:00:28] I don't think it'll trigger anything. [21:00:38] But if you could run just in case that'd be great. [21:03:53] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:962082|[foundationwiki] Add Endowment, Agenda, Committee, and Memory namespaces (T347762 T347822 T348268)]], [[gerrit:963801|[foundationwiki] Provide 'translationadmin' group with 'edit-legal' right (T346187)]] (duration: 09m 56s) [21:04:12] T347762: Add "Endowment" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347762 [21:04:13] T348268: Add "Memory" namespace to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T348268 [21:04:13] T347822: Add "Agenda" and "Committee" namespaces to Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T347822 [21:04:13] T346187: Give Translate Admin group on edit-legal rights on Foundation Governance Wiki (foundation.wikimedia.org) - https://phabricator.wikimedia.org/T346187 [21:04:19] James_F: 0 pages to fix, 0 were resolvable. All done! [21:04:24] Ta! [21:07:32] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2020.codfw.wmnet: Maybe cleanup leaked file descriptors(?) - eevans@cumin1001 [21:14:21] (03PS2) 10JHathaway: puppet agent: protect against a missing client bucket path [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) [21:15:03] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959225 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [21:17:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2020.codfw.wmnet: Maybe cleanup leaked file descriptors(?) - eevans@cumin1001 [21:27:47] (03PS1) 10Ladsgroup: Set virtual domain mapping for url shortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963837 (https://phabricator.wikimedia.org/T330590) [21:30:23] (03CR) 10Volans: [C: 04-1] "Look sane to me, minor things to fix inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/953262 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [21:32:36] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [21:32:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed w... [21:34:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [21:34:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed w... [21:39:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:44:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:44:47] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10ATsay-WMF) 05Resolved→03Open Hello, I'd like to request access to analytics-privatedata-users as well. Thanks! [21:57:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:02:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:37:19] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [22:37:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudelastic1007.... [22:37:53] (03PS1) 10Cwhite: opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) [22:38:18] (03CR) 10CI reject: [V: 04-1] opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite) [22:40:16] (03PS2) 10Cwhite: opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) [22:42:47] (03CR) 10CI reject: [V: 04-1] opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite) [22:47:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:48:55] (03PS1) 10Urbanecm: cswiki: Remove engineer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/963843 (https://phabricator.wikimedia.org/T348279) [22:52:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:53:32] (JobUnavailable) firing: (2) Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:07] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:58:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [22:59:29] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [22:59:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudelastic1007.eqia... [23:00:08] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [23:00:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Papaul) @bking I tried to do the re-images on cloudelastic1007, the re-image finished with the OS install without a... [23:00:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 208 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:02:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-master1004.mgmt.eqiad.wmnet with reboot policy FORCED [23:02:41] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [23:02:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [23:12:50] (03PS3) 10Cwhite: opensearch: disable shard size check on logging opensearch [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) [23:17:29] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 89 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:19:26] (03CR) 10Cwhite: "PCC OK: https://puppet-compiler.wmflabs.org/output/962244/43928/" [puppet] - 10https://gerrit.wikimedia.org/r/962244 (https://phabricator.wikimedia.org/T348262) (owner: 10Cwhite) [23:19:54] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host an-master1003.mgmt.eqiad.wmnet with reboot policy FORCED [23:22:14] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-master1003.mgmt.eqiad.wmnet with reboot policy FORCED [23:22:30] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [23:22:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [23:29:49] (03CR) 10Tim Starling: [C: 04-1] thumbor: add imagemagick policy file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [23:33:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:33:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:34:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:35:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:36:43] (03CR) 10Tim Starling: [C: 04-1] thumbor: add imagemagick policy file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [23:39:51] (03CR) 10Tim Starling: [C: 04-1] thumbor: add imagemagick policy file (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [23:43:22] (03PS1) 10Jclark-ctr: correct roles for an-master1003,4 [puppet] - 10https://gerrit.wikimedia.org/r/963845 (https://phabricator.wikimedia.org/T342291) [23:43:52] (03CR) 10Jclark-ctr: [C: 03+2] correct roles for an-master1003,4 [puppet] - 10https://gerrit.wikimedia.org/r/963845 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [23:48:25] RECOVERY - Disk space on restbase2020 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase2020&var-datasource=codfw+prometheus/ops