[00:01:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 43.2% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:17:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:17:31] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:21:44] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [00:39:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008910 [00:39:12] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008910 (owner: 10TrainBranchBot) [00:40:58] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:41:05] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:02:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1008910 (owner: 10TrainBranchBot) [01:04:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 46.88% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:09:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 44.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:12:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 40.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:22:02] (03CR) 10Krinkle: [C: 03+2] Profiler: Silence "RedisException: Connection timed out" (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008752 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle) [01:22:50] (03Merged) 10jenkins-bot: Profiler: Silence "RedisException: Connection timed out" (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008752 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle) [01:22:51] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [01:27:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 47.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:28:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 38.82% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:28:41] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:34:16] !log krinkle@deploy2002 Synchronized src/Profiler.php: I101a80a (duration: 10m 48s) [01:36:29] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:36:35] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:39:33] (03PS1) 10Stoyofuku-wmf: Rename `--color-link--visited` to `--color-visited` [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928) [01:52:30] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.59% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:58:41] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:00:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 44.12% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:02:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:02:31] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:05:03] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:05:10] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:06:09] (03PS1) 10BCornwall: cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981 [02:08:22] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:08:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:08:54] here, and looking at the ATSBackendErrorsHigh/restbase.discovery alerts [02:10:10] seems like something happened ~12 hours ago that resulted in steady increase in 500s — https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus%2Fops&var-cluster=text&var-origin=restbase.discovery.wmnet&from=now-12h&to=now&var-site=eqiad [02:10:52] "14:40 a.kosiaris: remove all but 1 host from parsoid@eqiad T358752 " maybe? [02:10:53] T358752: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752 [02:18:01] (03CR) 10Jdlrobson: [C: 03+1] Rename `--color-link--visited` to `--color-visited` [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928) (owner: 10Stoyofuku-wmf) [02:18:45] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:18:51] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:19:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:22:25] based on a sampling of the logstash errors, it seems like it's en wiktionary, and that it's the same error we had before, a missing content-language header that's causing restbase to except [02:25:12] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:50] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:27:57] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:37:27] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:00:12] (JobUnavailable) firing: (4) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 45.4% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:14:27] RECOVERY - cassandra-c CQL 10.64.16.35:9042 on restbase1038 is OK: TCP OK - 0.030 second response time on 10.64.16.35 port 9042 https://phabricator.wikimedia.org/T93886 [03:17:51] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:18:21] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:21:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:22:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:22:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.380 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:22:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51595 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:30:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:31:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:31:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:44:31] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:46:59] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [03:47:35] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:48:10] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:53:10] (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:33] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:57:13] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 37.04 ms [03:57:49] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [04:00:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:00:50] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:04:25] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:04:31] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:09:37] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:09:44] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:14:53] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:14:59] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:21:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:36:12] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:36:18] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:59:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:59:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:06:34] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:06:41] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:08:53] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:09:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:10:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:11:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:13:00] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:13:06] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:20:04] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:20:11] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:25:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:25:12] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:28:23] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:28:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:34:21] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:34:28] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:38:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:38:51] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:41:10] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:41:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:43:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:43:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:45:18] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:45:24] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:47:22] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:47:29] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:02:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P58532 and previous config saved to /var/cache/conftool/dbconfig/20240306-060239-root.json [06:06:35] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2024.03.04 - 2024.03.24): Requesting access to kubernetes deployment for tjones - https://phabricator.wikimedia.org/T359092#9604574 (10Marostegui) [06:10:37] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for bdgreenlee - https://phabricator.wikimedia.org/T359123#9604576 (10Marostegui) 05Open→03Resolved a:03Marostegui bdgreenlee added to WMF group. [06:13:39] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9604580 (10Marostegui) p:05Triage→03Medium @FBellamy-WMF we'd need your manager to approve this. [06:16:04] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9604584 (10Marostegui) @bdgreenlee please follow the ticket template at https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ [06:16:57] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9604585 (10Marostegui) p:05Triage→03Medium [06:17:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P58533 and previous config saved to /var/cache/conftool/dbconfig/20240306-061744-root.json [06:22:18] (03PS1) 10Marostegui: es1025: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1009152 (https://phabricator.wikimedia.org/T358746) [06:22:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1025', diff saved to https://phabricator.wikimedia.org/P58534 and previous config saved to /var/cache/conftool/dbconfig/20240306-062221-root.json [06:23:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:20] (03CR) 10Marostegui: [C: 03+2] es1025: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1009152 (https://phabricator.wikimedia.org/T358746) (owner: 10Marostegui) [06:28:21] (03PS1) 10Marostegui: installserver: Do not reimage db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1009154 [06:29:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58535 and previous config saved to /var/cache/conftool/dbconfig/20240306-062919-root.json [06:32:34] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1009154 (owner: 10Marostegui) [06:32:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P58536 and previous config saved to /var/cache/conftool/dbconfig/20240306-063249-root.json [06:44:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58537 and previous config saved to /var/cache/conftool/dbconfig/20240306-064424-root.json [06:47:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P58538 and previous config saved to /var/cache/conftool/dbconfig/20240306-064754-root.json [06:59:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58539 and previous config saved to /var/cache/conftool/dbconfig/20240306-065929-root.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T0700) [07:00:12] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:03:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P58540 and previous config saved to /var/cache/conftool/dbconfig/20240306-070259-root.json [07:14:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58541 and previous config saved to /var/cache/conftool/dbconfig/20240306-071435-root.json [07:18:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P58542 and previous config saved to /var/cache/conftool/dbconfig/20240306-071804-root.json [07:28:01] (03PS3) 10Alexandros Kosiaris: Switch more eqiad parsoid hosts to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006897 (https://phabricator.wikimedia.org/T357392) [07:28:03] (03PS3) 10Alexandros Kosiaris: Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) [07:28:06] (03PS3) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392) [07:28:14] (03PS3) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392) [07:28:22] (03PS3) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392) [07:29:29] (03CR) 10CI reject: [V: 04-1] Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [07:29:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58543 and previous config saved to /var/cache/conftool/dbconfig/20240306-072940-root.json [07:30:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:30:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch more eqiad parsoid hosts to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006897 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [07:35:06] (03CR) 10Slyngshede: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008893 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [07:37:17] (03CR) 10Muehlenhoff: [C: 03+2] Point apt discovery records to apt1002/apt2002 (new bookworm hosts) [puppet] - 10https://gerrit.wikimedia.org/r/1008893 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [07:37:25] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9604693 (10akosiaris) [07:41:27] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse1014.eqiad.wmnet with OS bullseye [07:41:41] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye [07:44:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58544 and previous config saved to /var/cache/conftool/dbconfig/20240306-074445-root.json [07:49:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:21] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:51:27] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:51:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [07:53:10] (SystemdUnitFailed) firing: (2) mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:27] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage [07:55:27] !log akosiaris@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1356.eqiad.wmnet|mw1357.eqiad.wmnet|parse1002.eqiad.wmnet|parse1003.eqiad.wmnet|parse1004.eqiad.wmnet|parse1005.eqiad.wmnet|parse1006.eqiad.wmnet|parse1007.eqiad.wmnet|parse1008.eqiad.wmnet|parse1009.eqiad.wmnet|parse1010.eqiad.wmnet|parse1011.eqiad.wmnet|parse1012.eqiad.wmnet|parse1013.eqiad.wmnet|parse1014.eqiad.wmnet|parse1015.eqiad. [07:55:27] wmnet|parse1016.eqiad.wmnet|parse1017.eqiad.wmnet|parse1018.eqiad.wmnet|parse1019.eqiad.wmnet),cluster=kubernetes,service=kubesvc [07:58:17] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1014.eqiad.wmnet with reason: host reimage [07:59:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58545 and previous config saved to /var/cache/conftool/dbconfig/20240306-075950-root.json [08:00:04] Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T0800). Please do the needful. [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:08:53] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2217 [puppet] - 10https://gerrit.wikimedia.org/r/1008084 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [08:12:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58546 and previous config saved to /var/cache/conftool/dbconfig/20240306-081244-arnaudb.json [08:17:11] !log depool parse2008.codfw.wmnet,parse2009.codfw.wmnet,parse2010.codfw.wmnet,parse2011.codfw.wmnet,parse2012.codfw.wmnet,parse2013.codfw.wmnet,parse2014.codfw.wmnet,parse2015.codfw.wmnet from parsoid. T358752 [08:27:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58547 and previous config saved to /var/cache/conftool/dbconfig/20240306-082749-arnaudb.json [08:33:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:37:25] seems like a spike ^, already dropping [08:37:31] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1014.eqiad.wmnet with OS bullseye [08:38:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58548 and previous config saved to /var/cache/conftool/dbconfig/20240306-083804-arnaudb.json [08:38:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:38:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58549 and previous config saved to /var/cache/conftool/dbconfig/20240306-083822-arnaudb.json [08:38:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 5%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58550 and previous config saved to /var/cache/conftool/dbconfig/20240306-083829-arnaudb.json [08:39:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:42:44] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2008.codfw.wmnet with OS bullseye [08:42:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58551 and previous config saved to /var/cache/conftool/dbconfig/20240306-084254-arnaudb.json [08:43:15] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2009.codfw.wmnet with OS bullseye [08:43:30] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:43:51] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2010.codfw.wmnet with OS bullseye [08:44:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:44:34] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2011.codfw.wmnet with OS bullseye [08:45:05] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2012.codfw.wmnet with OS bullseye [08:45:52] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2013.codfw.wmnet with OS bullseye [08:46:49] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2014.codfw.wmnet with OS bullseye [08:47:29] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2015.codfw.wmnet with OS bullseye [08:50:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2205.codfw.wmnet with reason: Silence for cloning [08:51:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: Silence for cloning [08:51:18] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:51:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2105.codfw.wmnet with reason: Silence for cloning [08:51:25] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:51:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2105.codfw.wmnet with reason: Silence for cloning [08:51:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422 [08:51:52] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [08:51:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422 [08:51:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422 [08:52:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: provisionning db2205.codfw.wmnet - T355422 [08:53:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2105 in db2205 for T355422', diff saved to https://phabricator.wikimedia.org/P58552 and previous config saved to /var/cache/conftool/dbconfig/20240306-085318-arnaudb.json [08:53:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58553 and previous config saved to /var/cache/conftool/dbconfig/20240306-085322-arnaudb.json [08:53:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58554 and previous config saved to /var/cache/conftool/dbconfig/20240306-085327-arnaudb.json [08:53:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 10%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58555 and previous config saved to /var/cache/conftool/dbconfig/20240306-085334-arnaudb.json [08:54:17] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2105.codfw.wmnet onto db2205.codfw.wmnet [08:56:28] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:56:35] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:57:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422 [08:57:57] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [08:58:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58556 and previous config saved to /var/cache/conftool/dbconfig/20240306-085759-arnaudb.json [08:58:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422 [08:58:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422 [08:58:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: provisionning db2206.codfw.wmnet - T355422 [08:58:58] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2009.codfw.wmnet with reason: host reimage [08:59:00] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2008.codfw.wmnet with reason: host reimage [08:59:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2106 in db2206 for T355422', diff saved to https://phabricator.wikimedia.org/P58557 and previous config saved to /var/cache/conftool/dbconfig/20240306-085924-arnaudb.json [08:59:54] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2010.codfw.wmnet with reason: host reimage [09:00:05] jnuche and dduvall: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T0900) [09:00:06] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2011.codfw.wmnet with reason: host reimage [09:00:30] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2106.codfw.wmnet onto db2206.codfw.wmnet [09:00:32] morning, the train is currently blocked by T359290 [09:00:42] T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80 - https://phabricator.wikimedia.org/T359290 [09:00:45] for the moment I'm going to backport a fix for a different blocker T359229 [09:00:47] T359229: Regression: Visited links on mobile appearing as black again - https://phabricator.wikimedia.org/T359229 [09:00:57] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2012.codfw.wmnet with reason: host reimage [09:01:24] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2013.codfw.wmnet with reason: host reimage [09:01:44] akosiaris: hi there, looks like I should wait for you to finish [09:01:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2009.codfw.wmnet with reason: host reimage [09:02:28] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2014.codfw.wmnet with reason: host reimage [09:03:45] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2015.codfw.wmnet with reason: host reimage [09:03:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422 [09:03:56] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [09:04:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2008.codfw.wmnet with reason: host reimage [09:04:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422 [09:04:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422 [09:04:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: provisionning db2208.codfw.wmnet - T355422 [09:05:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2108 in db2208 for T355422', diff saved to https://phabricator.wikimedia.org/P58558 and previous config saved to /var/cache/conftool/dbconfig/20240306-090524-arnaudb.json [09:06:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2013.codfw.wmnet with reason: host reimage [09:06:26] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2108.codfw.wmnet onto db2208.codfw.wmnet [09:08:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58559 and previous config saved to /var/cache/conftool/dbconfig/20240306-090827-arnaudb.json [09:08:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58560 and previous config saved to /var/cache/conftool/dbconfig/20240306-090833-arnaudb.json [09:08:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 15%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58561 and previous config saved to /var/cache/conftool/dbconfig/20240306-090839-arnaudb.json [09:08:43] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2015.codfw.wmnet with reason: host reimage [09:11:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2012.codfw.wmnet with reason: host reimage [09:13:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58562 and previous config saved to /var/cache/conftool/dbconfig/20240306-091304-arnaudb.json [09:13:39] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2014.codfw.wmnet with reason: host reimage [09:16:44] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2010.codfw.wmnet with reason: host reimage [09:20:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2011.codfw.wmnet with reason: host reimage [09:20:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2009.codfw.wmnet with OS bullseye [09:23:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2008.codfw.wmnet with OS bullseye [09:23:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58563 and previous config saved to /var/cache/conftool/dbconfig/20240306-092332-arnaudb.json [09:23:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58564 and previous config saved to /var/cache/conftool/dbconfig/20240306-092337-arnaudb.json [09:23:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 20%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58565 and previous config saved to /var/cache/conftool/dbconfig/20240306-092343-arnaudb.json [09:24:56] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2013.codfw.wmnet with OS bullseye [09:25:37] akosiaris: it looks like the downtiming cookbooks are done? can I go ahead? [09:27:07] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2015.codfw.wmnet with OS bullseye [09:27:52] jnuche: you should be fine to go ahead [09:28:00] jnuche: the hosts are removed from dsh [09:28:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58566 and previous config saved to /var/cache/conftool/dbconfig/20240306-092809-arnaudb.json [09:28:20] claime: thx! [09:29:55] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2012.codfw.wmnet with OS bullseye [09:32:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2014.codfw.wmnet with OS bullseye [09:35:33] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2010.codfw.wmnet with OS bullseye [09:38:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58567 and previous config saved to /var/cache/conftool/dbconfig/20240306-093837-arnaudb.json [09:38:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58568 and previous config saved to /var/cache/conftool/dbconfig/20240306-093842-arnaudb.json [09:38:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 25%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58569 and previous config saved to /var/cache/conftool/dbconfig/20240306-093849-arnaudb.json [09:39:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2011.codfw.wmnet with OS bullseye [09:42:48] wikibugs is stuck? on strike perhaps? [09:42:58] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 181 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:43:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58570 and previous config saved to /var/cache/conftool/dbconfig/20240306-094314-arnaudb.json [09:46:24] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2203/2204 [puppet] - 10https://gerrit.wikimedia.org/r/1008911 (https://phabricator.wikimedia.org/T355422) [09:46:39] godog: just a hard morning apparently [09:46:44] (03PS1) 10Muehlenhoff: Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) [09:46:52] (03CR) 10CI reject: [V: 04-1] Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [09:46:59] claime: understandable [09:47:00] (03PS1) 10Alexandros Kosiaris: Move parse2008-parse2015 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009200 (https://phabricator.wikimedia.org/T358752) [09:47:08] (03PS2) 10Muehlenhoff: Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) [09:47:10] i restarted the redis->irc listener, seems like it's back [09:47:24] (03CR) 10Slyngshede: [C: 03+2] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [09:47:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Move parse2008-parse2015 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009200 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [09:47:49] (03Merged) 10jenkins-bot: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [09:47:50] taavi: many thanks [09:47:57] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2203/2204 [puppet] - 10https://gerrit.wikimedia.org/r/1008911 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [09:47:58] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 40 probes of 737 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:48:05] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1008912 (https://phabricator.wikimedia.org/T355422) [09:48:13] (03CR) 10Marostegui: [C: 03+1] mariadb: toggle notifications for db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1008912 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [09:48:45] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2203/2204 [puppet] - 10https://gerrit.wikimedia.org/r/1008911 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [09:48:53] (03CR) 10Arnaudb: [C: 03+2] mariadb: toggle notifications for db2196 [puppet] - 10https://gerrit.wikimedia.org/r/1008912 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [09:49:17] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse1014.eqiad.wmnet with OS bullseye comp... [09:50:09] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604842 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2008.codfw.wmnet with OS bullseye [09:50:29] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604847 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2009.codfw.wmnet with OS bullseye [09:50:57] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2010.codfw.wmnet with OS bullseye [09:51:17] (03PS1) 10Volans: validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 [09:51:25] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2011.codfw.wmnet with OS bullseye [09:51:45] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [09:51:53] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2012.codfw.wmnet with OS bullseye [09:52:29] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2013.codfw.wmnet with OS bullseye [09:52:39] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1008986|Rename `--color-link--visited` to `--color-visited` (T356928)]] [09:52:43] T356928: Regression: Visited links on mobile appearing as black - https://phabricator.wikimedia.org/T356928 [09:53:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58571 and previous config saved to /var/cache/conftool/dbconfig/20240306-095342-arnaudb.json [09:53:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58572 and previous config saved to /var/cache/conftool/dbconfig/20240306-095347-arnaudb.json [09:53:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 50%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58573 and previous config saved to /var/cache/conftool/dbconfig/20240306-095354-arnaudb.json [09:54:13] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2014.codfw.wmnet with OS bullseye [09:54:42] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2015.codfw.wmnet with OS bullseye [09:55:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:55:10] (03CR) 10Muehlenhoff: [C: 03+2] Add an motd for the old buster reposority server [puppet] - 10https://gerrit.wikimedia.org/r/1009199 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [09:55:34] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:tomcat10 Sync server.xml with default from Tomcat10/Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:56:39] !log jnuche@deploy2002 jnuche and toyofuku: Backport for [[gerrit:1008986|Rename `--color-link--visited` to `--color-visited` (T356928)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:56:50] (03PS1) 10Muehlenhoff: Fix resource header [puppet] - 10https://gerrit.wikimedia.org/r/1009203 [09:57:02] (03CR) 10Filippo Giunchedi: [C: 03+1] cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981 (owner: 10BCornwall) [09:57:05] !log jnuche@deploy2002 jnuche and toyofuku: Continuing with sync [09:57:11] (03PS2) 10Slyngshede: Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [09:57:19] (03PS3) 10Slyngshede: Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [09:57:27] (03CR) 10Muehlenhoff: [C: 03+2] Fix resource header [puppet] - 10https://gerrit.wikimedia.org/r/1009203 (owner: 10Muehlenhoff) [09:57:35] (03PS4) 10Muehlenhoff: Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) [09:58:15] (03CR) 10Muehlenhoff: [C: 03+2] Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [09:58:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58574 and previous config saved to /var/cache/conftool/dbconfig/20240306-095820-arnaudb.json [09:59:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for expanding on the rationale/context!" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron) [10:00:15] (03PS1) 10Volans: validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 [10:02:21] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2009.codfw.wmnet with OS bullseye comp... [10:03:25] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9604974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2008.codfw.wmnet with OS bullseye comp... [10:03:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:03:57] 06SRE, 10SRE Observability (FY2023/2024-Q3): ircecho doesn't attempt to open log files created after startup - https://phabricator.wikimedia.org/T359292 (10fgiunchedi) [10:04:25] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2013.codfw.wmnet with OS bullseye comp... [10:04:55] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:05] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293 (10LSobanski) [10:06:23] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#9605030 (10LSobanski) Same set of alerts is firing for centrallog2002. [10:06:33] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2015.codfw.wmnet with OS bullseye comp... [10:07:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jnuche@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928) (owner: 10Stoyofuku-wmf) [10:07:42] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605037 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2012.codfw.wmnet with OS bullseye comp... [10:08:15] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1008986|Rename `--color-link--visited` to `--color-visited` (T356928)]] (duration: 15m 35s) [10:08:19] T356928: Regression: Visited links on mobile appearing as black - https://phabricator.wikimedia.org/T356928 [10:08:52] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [10:08:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 75%: Cloning done', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240306-100847-arnaudb.json [10:08:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 75%: Cloning done', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240306-100853-arnaudb.json [10:09:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 75%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58576 and previous config saved to /var/cache/conftool/dbconfig/20240306-100859-arnaudb.json [10:09:05] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [10:09:14] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2014.codfw.wmnet with OS bullseye comp... [10:09:45] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [10:10:07] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2010.codfw.wmnet with OS bullseye comp... [10:10:09] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [10:10:27] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007335 (owner: 10Majavah) [10:10:35] (03PS1) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) [10:10:51] (03CR) 10Clément Goubert: [C: 03+1] api-gateway: make ratelimit timeout a value, set to .5s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008933 (owner: 10Hnowlan) [10:11:19] (03PS2) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) [10:11:25] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [10:11:27] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9605090 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2011.codfw.wmnet with OS bullseye comp... [10:11:50] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [10:12:19] (03CR) 10Majavah: [V: 03+1 C: 03+2] ldap: fix sssd socket activation on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007335 (owner: 10Majavah) [10:13:10] (SystemdUnitFailed) firing: (3) httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:15] (03PS3) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) [10:13:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:15:06] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9605115 (10KCVelaga_WMF) As @MoritzMuehlenhoff suggested, I have updated my email to kcvelaga+old@wikimedia.org at idm.wikimedia.org, which is now being... [10:15:18] (03PS4) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) [10:15:52] (03PS17) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [10:16:20] (03Merged) 10jenkins-bot: Rename `--color-link--visited` to `--color-visited` [skins/MinervaNeue] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1008986 (https://phabricator.wikimedia.org/T356928) (owner: 10Stoyofuku-wmf) [10:17:08] (03CR) 10Thiemo Kreuz (WMDE): "I made the patch much smaller (again) in patchset 17." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [10:18:34] (03CR) 10Filippo Giunchedi: "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:18:43] (03CR) 10Filippo Giunchedi: [C: 03+1] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:18:59] (03CR) 10Majavah: [C: 03+2] hieradata: update test VM without floating IP [puppet] - 10https://gerrit.wikimedia.org/r/1008892 (owner: 10Majavah) [10:19:15] (03CR) 10Ayounsi: [C: 03+1] "Overall lgtm, I worry we complexity it too much, but it's not too bad so far :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans) [10:19:55] (03CR) 10Majavah: [C: 03+2] Convert remaining images to shell webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1005952 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [10:20:09] (03Merged) 10jenkins-bot: Convert remaining images to shell webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1005952 (https://phabricator.wikimedia.org/T293552) (owner: 10Majavah) [10:20:41] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1589/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:21:14] (03CR) 10Ayounsi: [C: 03+1] "TIL :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans) [10:21:22] (03CR) 10JMeybohm: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008583 (https://phabricator.wikimedia.org/T359130) (owner: 10RLazarus) [10:21:32] (03CR) 10Fabfur: [V: 03+1 C: 03+2] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009210 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:21:40] (03CR) 10Clément Goubert: [C: 03+2] ferm: Check ferm.service status in ferm_status.py (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [10:21:56] (03CR) 10Hnowlan: [C: 03+2] api-gateway: make ratelimit timeout a value, set to .5s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008933 (owner: 10Hnowlan) [10:22:20] (03Merged) 10jenkins-bot: api-gateway: make ratelimit timeout a value, set to .5s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008933 (owner: 10Hnowlan) [10:23:10] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58577 and previous config saved to /var/cache/conftool/dbconfig/20240306-102357-arnaudb.json [10:24:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58578 and previous config saved to /var/cache/conftool/dbconfig/20240306-102402-arnaudb.json [10:24:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 100%: Cloning done', diff saved to https://phabricator.wikimedia.org/P58579 and previous config saved to /var/cache/conftool/dbconfig/20240306-102404-arnaudb.json [10:25:25] (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [10:28:18] 06SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9605262 (10ayounsi) Thanks for looking into it ! I worry about re-writing an in house library to parse vendor emails, as those emails come in all shapes and forms and change regularly, from attached ICS, to... [10:28:58] (03PS1) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) [10:29:14] (03CR) 10CI reject: [V: 04-1] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:30:28] (03CR) 10Volans: "addressing comments" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans) [10:30:52] (03PS2) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) [10:31:44] (03PS3) 10Fabfur: cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) [10:33:58] (03CR) 10Filippo Giunchedi: [C: 03+1] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:34:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2105.codfw.wmnet onto db2205.codfw.wmnet [10:35:59] (03CR) 10Fabfur: [C: 03+2] cache: fix benthos conffile variable interpolation [puppet] - 10https://gerrit.wikimedia.org/r/1009216 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:37:32] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli) [10:37:39] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:37:46] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:38:37] !log akosiaris@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(parse2008.codfw.wmnet|parse2009.codfw.wmnet|parse2010.codfw.wmnet|parse2011.codfw.wmnet|parse2012.codfw.wmnet|parse2013.codfw.wmnet|parse2014.codfw.wmnet|parse2015.codfw.wmnet),cluster=kubernetes,service=kubesvc [10:38:51] (03PS1) 10Marostegui: data.yaml: Add bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) [10:40:22] (03Merged) 10jenkins-bot: mw-mcrouter: update namespace resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008498 (owner: 10Effie Mouzeli) [10:41:01] (03PS1) 10Jgiannelos: mobileapps: Use upper case for request methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009220 [10:41:04] (03PS1) 10Fabfur: cache: fix benthos typo [puppet] - 10https://gerrit.wikimedia.org/r/1009221 (https://phabricator.wikimedia.org/T358109) [10:42:02] (03CR) 10Effie Mouzeli: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [10:42:27] (03CR) 10Clément Goubert: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [10:43:24] (03CR) 10Effie Mouzeli: [C: 03+2] sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [10:44:57] (03CR) 10Fabfur: [C: 03+2] cache: fix benthos typo [puppet] - 10https://gerrit.wikimedia.org/r/1009221 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:46:21] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:46:55] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:47:37] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:47:57] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Restart mw-jobrunner pods in DC_FROM [cookbooks] - 10https://gerrit.wikimedia.org/r/1008842 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [10:48:04] (03CR) 10Muehlenhoff: data.yaml: Add bdgreenlee (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui) [10:48:54] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:49:23] (03PS2) 10Marostegui: data.yaml: Add bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) [10:49:32] (03CR) 10Marostegui: data.yaml: Add bdgreenlee (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui) [10:49:47] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: sync [10:49:55] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: sync [10:50:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Host has already been cloned, there was 2 candidate master', diff saved to https://phabricator.wikimedia.org/P58580 and previous config saved to /var/cache/conftool/dbconfig/20240306-105007-arnaudb.json [10:50:12] (JobUnavailable) firing: (3) Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:50:17] (03PS1) 10Jgiannelos: mobileapps: Use upper case method name for requests to rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) [10:51:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031) (owner: 10Btullis) [10:52:05] (03CR) 10Hnowlan: [C: 03+1] restbase: Start moving mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005756 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [10:52:45] !og installing gnutls28 security updates on bullseye [10:58:40] (03CR) 10Muehlenhoff: [C: 03+2] openstack::base::pdns::recursor::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1006525 (owner: 10Muehlenhoff) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1100) [11:04:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2108.codfw.wmnet onto db2208.codfw.wmnet [11:04:43] (03PS1) 10Alexandros Kosiaris: mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) [11:05:45] (03CR) 10Hnowlan: [C: 03+1] mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:05:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:06:15] (03CR) 10Clément Goubert: [C: 03+1] mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:07:27] (03Merged) 10jenkins-bot: mw-parsoid: replicas x2 for hopefully the last time [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009227 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:08:33] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [11:08:59] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [11:10:08] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [11:10:52] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [11:10:58] (03Abandoned) 10Jgiannelos: mobileapps: Use upper case for request methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009220 (owner: 10Jgiannelos) [11:13:00] (03PS1) 10Clément Goubert: Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074) [11:14:34] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008914 [11:14:36] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008915 [11:14:57] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008914 (owner: 10PipelineBot) [11:15:40] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1008914 (owner: 10PipelineBot) [11:15:47] (03PS4) 10Alexandros Kosiaris: Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) [11:15:49] (03PS4) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392) [11:15:56] (03PS4) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392) [11:16:04] (03PS4) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392) [11:17:00] (03CR) 10CI reject: [V: 04-1] Switch restbase102[6789], restbase103[0123], restbase202[89], restbase203[01234] to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:17:03] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:17:31] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:17:37] (03PS2) 10Clément Goubert: restbase: Start moving mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005756 (https://phabricator.wikimedia.org/T358213) [11:19:02] (03PS5) 10Alexandros Kosiaris: Switch restbase1026-1033, restbase20289-2034 to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) [11:19:04] (03PS5) 10Alexandros Kosiaris: restbase: Switch the default to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006898 (https://phabricator.wikimedia.org/T357392) [11:19:07] (03PS5) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T357392) [11:19:15] (03PS5) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T357392) [11:19:54] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:20:56] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:21:04] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:21:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2106.codfw.wmnet onto db2206.codfw.wmnet [11:21:53] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:21:55] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9605592 (10cmooney) @kcvelaga_wmf great news! I think the next steps would be to move any files you have. I can do this for the stats boxes or other... [11:24:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch restbase1026-1033, restbase20289-2034 to mw-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/1006896 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [11:26:28] (03CR) 10Hnowlan: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:28:03] jouncebot: nowandnext [11:28:03] For the next 0 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1100) [11:28:04] In 2 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400) [11:28:07] !log Disabling puppet on deployment servers for canary api_appserver move - T351074 [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:11] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:30:21] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:31:16] !log Disabling puppet on mw2374.codfw.wmnet,mw2376.codfw.wmnet,mw2283.codfw.wmnet,mw2284.codfw.wmnet,mw2371.codfw.wmnet,mw2372.codfw.wmnet,mw2373.codfw.wmnet,mw2375.codfw.wmnet for canary api_appserver move - T351074 [11:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:30] (03PS1) 10Jaime Nuche: Add missing function argument to titleWithoutPrefix call [extensions/Gadgets] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009231 (https://phabricator.wikimedia.org/T359290) [11:31:49] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:31:55] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:32:01] (03CR) 10Clément Goubert: [C: 03+2] Move 6 codfw appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009229 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [11:33:41] !log Enabling and running puppet on new canaries mw2283.codfw.wmnet,mw2284.codfw.wmnet - T351074 [11:33:44] (03CR) 10MSantos: [C: 03+1] mobileapps: Use upper case method name for requests to rest.php [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos) [11:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:45] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:37:46] !log Enabling and running puppet on deployment servers - T351074 [11:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:55] (SystemdUnitFailed) firing: (3) otelcol-contrib.service on mw2283:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:37] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9605712 (10KCVelaga_WMF) @cmooney I have moved over the files from stat1005:kcv-wikimf to stat1008:kcvelaga, and everything is working fine. After a co... [11:40:38] !log pooling new canaries - T351074 [11:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:42] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [11:41:18] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: cluster=api_appserver,service=canary,dc=codfw [11:41:27] !log cgoubert@cumin2002 conftool action : set/weight=30; selector: cluster=api_appserver,service=canary,dc=codfw [11:42:36] !log Depooling mw2371.codfw.wmnet,mw2372.codfw.wmnet,mw2373.codfw.wmnet,mw2374.codfw.wmnet,mw2375.codfw.wmnet,mw2376.codfw.wmnet for reimage to kubernetes - T351074 [11:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:10] (SystemdUnitFailed) firing: (3) otelcol-contrib.service on mw2283:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:20] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:43:26] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:43:27] ^lies [12:15:38] PROBLEM - Check whether ferm is active by checking the default input chain on mw2310 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:17:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [12:17:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [12:17:56] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2375.codfw.wmnet with reason: host reimage [12:18:00] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2371.codfw.wmnet with reason: host reimage [12:18:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P58581 and previous config saved to /var/cache/conftool/dbconfig/20240306-121800-ladsgroup.json [12:18:04] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2373.codfw.wmnet with reason: host reimage [12:18:10] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2374.codfw.wmnet with reason: host reimage [12:18:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:18:22] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2376.codfw.wmnet with reason: host reimage [12:18:26] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2372.codfw.wmnet with reason: host reimage [12:19:41] jouncebot: nowandnext [12:19:41] No deployments scheduled for the next 1 hour(s) and 40 minute(s) [12:19:41] In 1 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400) [12:19:53] claime: okay if I deploy mw? [12:19:59] Amir1: check with jnuche [12:20:12] cool thanks [12:20:16] he's backporting something, then rolling the train forward [12:20:17] (03PS2) 10Jgiannelos: mobileapps: Use upper case method names in request templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) [12:20:26] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2375.codfw.wmnet with reason: host reimage [12:20:26] ah I see [12:20:30] so maybe you can squeeze your backport in between [12:20:30] idk [12:20:43] yeah, I wait for it to finish [12:21:33] 10ops-eqiad, 06DC-Ops: Inconsistent data in Netbox for some msw device - https://phabricator.wikimedia.org/T359326 (10Volans) [12:21:47] jnuche: please let me know once you're done with your magic. [12:21:51] (03PS2) 10Volans: validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 [12:21:53] (03PS2) 10Volans: validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 [12:21:57] (03PS1) 10Volans: validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240 [12:22:04] Amir1: hi there, what's your patch? maybe we can merger it ahead of time to go faster [12:22:21] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1008503 [12:22:28] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2372.codfw.wmnet with reason: host reimage [12:22:35] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: reduce cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009238 (owner: 10Effie Mouzeli) [12:23:00] (03CR) 10Clément Goubert: [C: 03+1] sre.switchdc.mediawiki: update mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [12:23:33] (03Merged) 10jenkins-bot: mw-mcrouter: reduce cpu limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009238 (owner: 10Effie Mouzeli) [12:23:47] Amir1: looks like it needs a rebase, should I just do it from the UI? [12:23:58] yeah, that's a lie [12:24:17] (any edit on IS.php triggers merge conflict) [12:24:29] (03CR) 10Ayounsi: [C: 03+1] validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans) [12:24:34] (03CR) 10Clément Goubert: [C: 03+1] Move parse2002-parse2007 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009239 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [12:24:53] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2371.codfw.wmnet with reason: host reimage [12:24:58] yeah, oversensitivity of gerrit with file modification [12:25:03] (03CR) 10Ayounsi: [C: 03+1] validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans) [12:25:21] (03PS2) 10Jaime Nuche: Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [12:25:24] (03CR) 10Volans: [C: 03+2] validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans) [12:25:32] (03CR) 10Volans: [C: 03+2] validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans) [12:25:57] (03Merged) 10jenkins-bot: validators: improve IPs DNS name validation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009202 (owner: 10Volans) [12:26:02] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [12:26:03] (03Merged) 10jenkins-bot: validators: add field name to fail messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009206 (owner: 10Volans) [12:26:30] will merge it in a sec, waiting for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Gadgets/+/1009231 to merge to avoid issues [12:27:15] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2376.codfw.wmnet with reason: host reimage [12:27:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui) [12:27:58] (03Merged) 10jenkins-bot: Add missing function argument to titleWithoutPrefix call [extensions/Gadgets] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009231 (https://phabricator.wikimedia.org/T359290) (owner: 10Jaime Nuche) [12:28:16] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add bdgreenlee [puppet] - 10https://gerrit.wikimedia.org/r/1009219 (https://phabricator.wikimedia.org/T359123) (owner: 10Marostegui) [12:28:23] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1009231|Add missing function argument to titleWithoutPrefix call (T359290)]] [12:28:27] T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80 - https://phabricator.wikimedia.org/T359290 [12:29:06] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [12:29:09] (03CR) 10Jaime Nuche: [C: 03+2] Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [12:29:12] (03CR) 10Ayounsi: [C: 03+1] validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240 (owner: 10Volans) [12:29:50] (03Merged) 10jenkins-bot: Set two more wikis to read new for pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008503 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [12:30:00] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1009231|Add missing function argument to titleWithoutPrefix call (T359290)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:30:03] (03CR) 10Volans: [C: 03+2] validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240 (owner: 10Volans) [12:30:13] !log jnuche@deploy2002 jnuche: Continuing with sync [12:30:21] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2373.codfw.wmnet with reason: host reimage [12:30:36] (03Merged) 10jenkins-bot: validators: fix existing bugs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1009240 (owner: 10Volans) [12:32:39] (03CR) 10Bartosz Dziewoński: "(WMPL team asked me to review)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008830 (https://phabricator.wikimedia.org/T358379) (owner: 10Urbanecm) [12:33:10] (SystemdUnitFailed) firing: (3) ferm.service on mw2310:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:33:22] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2374.codfw.wmnet with reason: host reimage [12:33:29] !log volans@cumin2002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:33:48] !log volans@cumin2002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:34:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:35:02] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:37:10] (03PS3) 10Jgiannelos: mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) [12:37:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:37:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:37:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] Move parse2002-parse2007 to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009239 (https://phabricator.wikimedia.org/T358752) (owner: 10Alexandros Kosiaris) [12:39:29] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2375.codfw.wmnet with OS bullseye [12:39:34] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1009231|Add missing function argument to titleWithoutPrefix call (T359290)]] (duration: 11m 10s) [12:39:39] T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80 - https://phabricator.wikimedia.org/T359290 [12:40:31] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2002.codfw.wmnet with OS bullseye [12:40:44] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2002.codfw.wmnet with OS bullseye [12:40:48] Amir1: backporting your change [12:41:17] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2003.codfw.wmnet with OS bullseye [12:41:22] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2372.codfw.wmnet with OS bullseye [12:41:32] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2003.codfw.wmnet with OS bullseye [12:41:34] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:41:51] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2004.codfw.wmnet with OS bullseye [12:42:04] !log jnuche@deploy2002 Started scap: Backport for [[gerrit:1008503|Set two more wikis to read new for pagelinks migration (T351237)]] [12:42:05] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2004.codfw.wmnet with OS bullseye [12:42:18] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [12:42:26] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2005.codfw.wmnet with OS bullseye [12:42:36] jnuche: thanks! [12:42:41] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2005.codfw.wmnet with OS bullseye [12:42:59] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2006.codfw.wmnet with OS bullseye [12:43:08] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2371.codfw.wmnet with OS bullseye [12:43:14] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606073 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2006.codfw.wmnet with OS bullseye [12:43:28] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host parse2007.codfw.wmnet with OS bullseye [12:43:52] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606080 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host parse2007.codfw.wmnet with OS bullseye [12:45:36] !log jnuche@deploy2002 jnuche and ladsgroup: Backport for [[gerrit:1008503|Set two more wikis to read new for pagelinks migration (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:45:38] RECOVERY - Check whether ferm is active by checking the default input chain on mw2310 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:46:04] !log jnuche@deploy2002 jnuche and ladsgroup: Continuing with sync [12:46:26] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2376.codfw.wmnet with OS bullseye [12:46:52] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [12:47:53] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:49:25] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2373.codfw.wmnet with OS bullseye [12:49:34] PROBLEM - RPKI Validator RTR port on rpki2002 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [12:49:36] PROBLEM - Routinator process on rpki2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [12:52:18] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2374.codfw.wmnet with OS bullseye [12:52:53] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:53:10] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:10] (SystemdUnitFailed) firing: (4) ferm.service on kubernetes2033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:54] !log Running homer 'cr*codfw*' commit 'T351074' [12:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:10] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [12:54:19] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:54:26] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:54:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [12:55:12] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:55:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2218.codfw.wmnet with reason: Maintenance [12:55:24] !log jnuche@deploy2002 Finished scap: Backport for [[gerrit:1008503|Set two more wikis to read new for pagelinks migration (T351237)]] (duration: 13m 20s) [12:55:29] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [12:55:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58582 and previous config saved to /var/cache/conftool/dbconfig/20240306-125529-arnaudb.json [12:55:33] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:55:34] RECOVERY - RPKI Validator RTR port on rpki2002 is OK: TCP OK - 0.001 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [12:55:36] RECOVERY - Routinator process on rpki2002 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process [12:56:00] Amir1: done! [12:56:09] awesome. Thanks you! [12:56:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2002.codfw.wmnet with reason: host reimage [12:56:21] I noticed the blocker (T359290) errors happen consistently at 10 minutes past the top of the hour [12:56:22] T359290: ArgumentCountError: Too few arguments to function MediaWiki\Extension\Gadgets\GadgetRepo::titleWithoutPrefix(), 1 passed in /srv/mediawiki/php-1.42.0-wmf.21/extensions/Gadgets/includes/GadgetResourceLoaderModule.php on line 80 - https://phabricator.wikimedia.org/T359290 [12:56:29] (03PS1) 10Muehlenhoff: routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247 [12:56:34] so I'm going to wait a bit until 13:10 UTC to verify the backport fixed the problem before rolling forward the train [12:57:04] jouncebot: next [12:57:04] In 1 hour(s) and 2 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400) [12:57:20] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2003.codfw.wmnet with reason: host reimage [12:57:25] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2004.codfw.wmnet with reason: host reimage [12:57:44] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2005.codfw.wmnet with reason: host reimage [12:58:10] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:10] (SystemdUnitFailed) firing: (7) ferm.service on kubernetes2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:24] !log robh@cumin1002 START - Cookbook sre.dns.netbox [12:58:40] (KubernetesRsyslogDown) firing: rsyslog on mw2436:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2436 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:47] (HelmReleaseBadStatus) firing: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:58:49] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2006.codfw.wmnet with reason: host reimage [12:58:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:59:00] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2002.codfw.wmnet with reason: host reimage [12:59:01] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:59:27] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on parse2007.codfw.wmnet with reason: host reimage [12:59:55] (SystemdUnitFailed) firing: (7) ferm.service on kubernetes2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:16] !log robh@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fixing incorrect asset tags - robh@cumin1002" [13:00:56] (03CR) 10Ayounsi: [C: 03+1] routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247 (owner: 10Muehlenhoff) [13:01:08] !log robh@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fixing incorrect asset tags - robh@cumin1002" [13:01:08] !log robh@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:01:26] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2005.codfw.wmnet with reason: host reimage [13:01:43] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [13:01:49] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [13:02:58] (03CR) 10Elukey: [C: 03+1] "Sure makes sense, I'd write a comment to the puppet class to highlight this decision though, otherwise it may be missed at first." [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron) [13:03:10] (SystemdUnitFailed) resolved: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2003.codfw.wmnet with reason: host reimage [13:03:47] (HelmReleaseBadStatus) resolved: Helm release mw-mcrouter/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:05:13] (03CR) 10Elukey: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1590/co" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron) [13:05:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58583 and previous config saved to /var/cache/conftool/dbconfig/20240306-130542-arnaudb.json [13:05:52] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:06:05] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2006.codfw.wmnet with reason: host reimage [13:06:41] !log Pooling and uncordoning mw2372.codfw.wmnet mw2373.codfw.wmnet mw2374.codfw.wmnet mw2375.codfw.wmnet mw2376.codfw.wmnet - T351074 [13:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:50] !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw2371.codfw.wmnet|mw2372.codfw.wmnet|mw2373.codfw.wmnet|mw2374.codfw.wmnet|mw2375.codfw.wmnet|mw2376.codfw.wmnet),cluster=kubernetes,service=kubesvc [13:06:56] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [13:07:45] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9606215 (10MoritzMuehlenhoff) [13:08:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:08:23] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2007.codfw.wmnet with reason: host reimage [13:09:04] (03PS2) 10Muehlenhoff: routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247 [13:11:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse2004.codfw.wmnet with reason: host reimage [13:12:27] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:12:52] the blocker error is not reproducing anymore, rolling train forward [13:13:01] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:13:07] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:13:10] (SystemdUnitFailed) firing: (7) ferm.service on kubernetes2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:13:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:13:30] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009250 (https://phabricator.wikimedia.org/T354439) [13:13:32] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009250 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [13:14:16] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009250 (https://phabricator.wikimedia.org/T354439) (owner: 10TrainBranchBot) [13:16:32] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [13:16:38] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [13:16:41] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [13:16:47] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [13:17:41] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2002.codfw.wmnet with OS bullseye [13:17:56] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2002.codfw.wmnet with OS bullseye comp... [13:20:21] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2005.codfw.wmnet with OS bullseye [13:20:36] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606285 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2005.codfw.wmnet with OS bullseye comp... [13:20:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58585 and previous config saved to /var/cache/conftool/dbconfig/20240306-132048-arnaudb.json [13:21:17] 07sre-alert-triage, 06SRE Observability: Alert in need of triage: ProbeDown (instance centrallog1002:6514) - https://phabricator.wikimedia.org/T359293#9606310 (10fgiunchedi) Thank you @LSobanski ! Those are known, I've silenced the alerts for now, leaving the task open as a reminder [13:23:31] akosiaris, claime: four of the K8s parse hosts failed to pull the latest multiversion image during the train rollout, presumably due to the reimaging: [13:23:36] https://www.irccloud.com/pastebin/SOxQUtd6/ [13:24:00] will they pull the latest version of the image once they get put back in the rotation? [13:24:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2003.codfw.wmnet with OS bullseye [13:24:38] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2006.codfw.wmnet with OS bullseye [13:24:38] jnuche: they shouldn't have even tried [13:24:43] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606341 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2003.codfw.wmnet with OS bullseye comp... [13:24:44] they're not parse anymore [13:24:47] proceed [13:24:55] alright, thank you [13:24:56] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2006.codfw.wmnet with OS bullseye comp... [13:25:03] almost everything has been migrated to mw-parsoid now [13:25:09] 06SRE, 10SRE Observability (FY2023/2024-Q3): ircecho doesn't attempt to open log files created after startup - https://phabricator.wikimedia.org/T359292#9606343 (10fgiunchedi) Logs from `ircecho.service` ` Mar 05 15:14:33 alert2001 ircecho[1136326]: Failed to open file: /var/log/icinga/irc-analytics.log Mar 0... [13:25:42] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:25:45] (03CR) 10Muehlenhoff: [C: 03+2] routinator: Drop --tal-dir [puppet] - 10https://gerrit.wikimedia.org/r/1009247 (owner: 10Muehlenhoff) [13:25:49] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:27:06] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2007.codfw.wmnet with OS bullseye [13:27:22] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2007.codfw.wmnet with OS bullseye comp... [13:27:37] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.21 refs T354439 [13:27:41] T354439: 1.42.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T354439 [13:28:10] (03PS1) 10Fabfur: haproxy: send errored messages to separate (deadletter) topic [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) [13:30:09] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse2004.codfw.wmnet with OS bullseye [13:30:22] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host parse2004.codfw.wmnet with OS bullseye comp... [13:35:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P58586 and previous config saved to /var/cache/conftool/dbconfig/20240306-133555-arnaudb.json [13:36:11] (03CR) 10Btullis: [C: 03+2] Restrict the set of URLS serviced by Archiva [puppet] - 10https://gerrit.wikimedia.org/r/1008926 (https://phabricator.wikimedia.org/T359031) (owner: 10Btullis) [13:37:27] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:17] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1591/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [13:45:13] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:49:12] (03CR) 10Muehlenhoff: Routed Ganeti: use per tap interface dhcrelay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:49:25] (03PS1) 10Filippo Giunchedi: icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) [13:50:12] (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:50:36] (03CR) 10CI reject: [V: 04-1] icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi) [13:50:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995032 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [13:51:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T357189)', diff saved to https://phabricator.wikimedia.org/P58587 and previous config saved to /var/cache/conftool/dbconfig/20240306-135102-arnaudb.json [13:51:07] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:52:09] (03PS2) 10Filippo Giunchedi: icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) [13:53:21] (03CR) 10Dreamy Jazz: [C: 03+1] throttle: Allow for overriding temp account creation limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008112 (https://phabricator.wikimedia.org/T357777) (owner: 10Kosta Harlan) [13:54:03] (03CR) 10Jgiannelos: "I suggest we keep only uppercase the references to rest.php so we make sure that while this [1] patch is not deployed in production we don" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos) [13:55:04] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9606600 (10SLopes-WMF) Approved. Please go ahead. [13:55:23] (03CR) 10Muehlenhoff: icinga: create ircecho log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi) [13:55:58] (03CR) 10CI reject: [V: 04-1] icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi) [13:58:46] (03CR) 10Ssingh: [C: 03+1] cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981 (owner: 10BCornwall) [13:59:06] (03PS2) 10Reedy: CommonSettings: Add $wgSecurePollExcludedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1008997 (https://phabricator.wikimedia.org/T303135) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:37] agree :D [14:02:43] (03PS3) 10Herron: profile::kafka::broker: set cert renewal at 1 month [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) [14:02:57] urbanecm: I see you're already working on a fix for T359216. Do you think the user impact is bad enough to rollback while you work on the fix? [14:02:58] T359216: [testwiki - wmf.21] Bad request for page/summary and user-impact - https://phabricator.wikimedia.org/T359216 [14:03:41] jnuche: just saw your comment on the task, replied there [14:03:45] (03CR) 10Filippo Giunchedi: icinga: create ircecho log files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi) [14:03:51] (03PS3) 10Filippo Giunchedi: icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) [14:04:21] (03PS1) 10Marostegui: data.yaml: Add FebinBellamy [puppet] - 10https://gerrit.wikimedia.org/r/1009259 (https://phabricator.wikimedia.org/T359208) [14:04:35] urbanecm: thx 👍 [14:05:03] jnuche: my "fix" reverts bunch of other things, not sure what exactly those commits change. i pinged Daniel in #engineering-all at Slack, let's see what happens. [14:05:19] ack [14:05:43] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9606622 (10Marostegui) a:03Marostegui [14:06:49] (03CR) 10Herron: [C: 03+2] "done!" [puppet] - 10https://gerrit.wikimedia.org/r/1008535 (https://phabricator.wikimedia.org/T358870) (owner: 10Herron) [14:07:15] (03PS1) 10Effie Mouzeli: mw-mcrouter: lower memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009260 [14:08:22] (03PS1) 10Ssingh: P:dns::auth: skipping running authdns-update on host if not pooled [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054) [14:09:22] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: lower memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009260 (owner: 10Effie Mouzeli) [14:10:17] (03Merged) 10jenkins-bot: mw-mcrouter: lower memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009260 (owner: 10Effie Mouzeli) [14:10:19] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1592/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:11:01] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [14:11:25] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [14:11:31] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [14:11:49] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [14:12:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: update mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:13:10] (03CR) 10Bking: [C: 03+1] "Giving my +1 so we can merge and test this today." [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:14:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] sre.switchdc.mediawiki: update mediawiki services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:17:46] (03CR) 10David Caro: [C: 03+1] "LGTM, if you don't mind, be quite verbose on irc when you deploy this in codfw (in case anyone is doing any tests)." [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [14:18:47] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [14:20:08] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9606851 (10akosiaris) [14:20:55] !log akosiaris@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(parse2002.codfw.wmnet|parse2003.codfw.wmnet|parse2004.codfw.wmnet|parse2005.codfw.wmnet|parse2006.codfw.wmnet|parse2007.codfw.wmnet),cluster=kubernetes,service=kubesvc [14:21:01] (03CR) 10Effie Mouzeli: [C: 03+2] sre.switchdc.mediawiki: update mediawiki services (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:22:09] (03PS1) 10David Caro: bullseye-standalone: add logrotate [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1009264 (https://phabricator.wikimedia.org/T357567) [14:23:30] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606911 (10akosiaris) Almost all parsoid hosts have been reimaged as kubernetes nodes. Scandium, testreduce1002, parse1001 and parse1002 being the exce... [14:24:32] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9606936 (10akosiaris) [14:25:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009259 (https://phabricator.wikimedia.org/T359208) (owner: 10Marostegui) [14:26:15] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752#9606934 (10akosiaris) 05Open→03Resolved [14:27:04] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add FebinBellamy [puppet] - 10https://gerrit.wikimedia.org/r/1009259 (https://phabricator.wikimedia.org/T359208) (owner: 10Marostegui) [14:28:22] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: update mediawiki services [cookbooks] - 10https://gerrit.wikimedia.org/r/1009233 (https://phabricator.wikimedia.org/T359154) (owner: 10Effie Mouzeli) [14:28:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks "good" to me." [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi) [14:30:36] (03Abandoned) 10David Caro: bullseye-standalone: add logrotate [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1009264 (https://phabricator.wikimedia.org/T357567) (owner: 10David Caro) [14:30:40] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for FebinBellamy - https://phabricator.wikimedia.org/T359208#9606987 (10Marostegui) 05Open→03Resolved This is all done [14:31:40] (03PS1) 10Clément Goubert: Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) [14:32:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to reimage T358642', diff saved to https://phabricator.wikimedia.org/P58588 and previous config saved to /var/cache/conftool/dbconfig/20240306-143204-arnaudb.json [14:32:21] T358642: Upgrade x1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358642 [14:33:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2131.codfw.wmnet with reason: Silence for reimaging [14:33:38] (03CR) 10Hnowlan: [C: 03+1] mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos) [14:33:44] !log installing nftables bugfix updates from bullseye point release [14:33:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2131.codfw.wmnet with reason: Silence for reimaging [14:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:00] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos) [14:34:18] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi) [14:34:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2131.codfw.wmnet with OS bookworm [14:34:32] (03PS2) 10Fabfur: haproxy: send errored messages to separate (deadletter) topic [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) [14:34:33] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 3 others: Create parsoid mediawiki deployment and migrate parsoid-php.discovery.wmnet traffic to it - https://phabricator.wikimedia.org/T357392#9607078 (10akosiaris) 05In progress→03Resolved [14:35:09] 06SRE, 10MW-on-K8s, 06Traffic, 06serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9607081 (10akosiaris) [14:35:54] (03Merged) 10jenkins-bot: mobileapps: Use upper case method names for rest.php requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009224 (https://phabricator.wikimedia.org/T359306) (owner: 10Jgiannelos) [14:37:27] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:41] (03PS1) 10Jgiannelos: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009267 [14:38:15] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) [14:38:29] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009267 (owner: 10Jgiannelos) [14:39:25] (03Merged) 10jenkins-bot: mobileapps: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009267 (owner: 10Jgiannelos) [14:39:35] (03PS1) 10Clément Goubert: trafficserver: move 60% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1009269 (https://phabricator.wikimedia.org/T357508) [14:40:17] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:40:22] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:40:35] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:40:39] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:41:04] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:41:41] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:41:43] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:42:15] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:42:41] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:42:54] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:44:15] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:44:25] (03CR) 10Hnowlan: [C: 04-1] Move 6 eqiad appservers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:44:53] (03PS1) 10Muehlenhoff: Add Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1009270 [14:45:41] !log installing postgres 13 security updates [14:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:17] (03PS2) 10Clément Goubert: Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) [14:46:39] (03CR) 10Clément Goubert: Move 6 eqiad appservers to kubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:47:25] (03CR) 10Hnowlan: Create a shellbox deployment for videoscalers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [14:47:28] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: create ircecho log files [puppet] - 10https://gerrit.wikimedia.org/r/1009256 (https://phabricator.wikimedia.org/T359292) (owner: 10Filippo Giunchedi) [14:48:02] (03CR) 10Hnowlan: [C: 03+1] Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:51:00] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [14:51:15] RECOVERY - Kafka broker TLS certificate validity on kafka-logging2001 is OK: SSL OK - Certificate kafka-logging2001.codfw.wmnet valid until 2025-03-01 20:58:00 +0000 (expires in 360 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:51:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2131.codfw.wmnet with reason: host reimage [14:51:44] jouncebot: nowandnext [14:51:44] For the next 0 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1400) [14:51:44] In 0 hour(s) and 8 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1500) [14:52:36] !log Depooling mw1441.eqiad.wmnet,mw1442.eqiad.wmnet,mw1451.eqiad.wmnet,mw1452.eqiad.wmnet,mw1454.eqiad.wmnet,mw1455.eqiad.wmnet for reimage to kubernetes - T351074 [14:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:40] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [14:53:02] (03CR) 10Clément Goubert: [C: 03+2] Move 6 eqiad appservers to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1009266 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [14:53:45] 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9607287 (10fgiunchedi) [14:54:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2131.codfw.wmnet with reason: host reimage [14:55:06] 06SRE, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): ircecho doesn't attempt to open log files created after startup - https://phabricator.wikimedia.org/T359292#9607285 (10fgiunchedi) 05Open→03Resolved Calling this done, albeit with an hack [14:55:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:56:33] (03CR) 10Hnowlan: [C: 03+1] trafficserver: move 60% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1009269 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [14:56:37] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet [14:56:39] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) [14:57:26] (03CR) 10Hnowlan: [C: 04-1] mw-web, mw-api-ext: Raise replicas for 60% traffic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [14:57:27] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:02] (03CR) 10Majavah: [V: 03+1 C: 03+2] openstack: neutron: add API support for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1008462 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1500) [15:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [15:02:19] (03CR) 10Fabfur: [C: 03+2] haproxy: send errored messages to separate (deadletter) topic [puppet] - 10https://gerrit.wikimedia.org/r/1009255 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [15:04:54] (03PS1) 10Clément Goubert: Add missing node definition [puppet] - 10https://gerrit.wikimedia.org/r/1009273 [15:05:00] (03CR) 10Volans: "question inline, I'm happy either way" [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff) [15:06:52] (03CR) 10Clément Goubert: [C: 03+2] Add missing node definition [puppet] - 10https://gerrit.wikimedia.org/r/1009273 (owner: 10Clément Goubert) [15:08:30] (03PS2) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) [15:08:38] (03CR) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 60% traffic (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [15:09:55] (SystemdUnitFailed) firing: (2) ferm.service on mw1367:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:32] (03PS19) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [15:11:10] (03PS2) 10Muehlenhoff: Add Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1009270 [15:11:15] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches [15:11:21] (03CR) 10Muehlenhoff: Add Cumin alias for routed Ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff) [15:11:39] !log jiji@cumin1002 END (FAIL) - Cookbook sre.switchdc.mediawiki.00-optional-warmup-caches (exit_code=99) [15:12:01] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl [15:12:54] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1441.eqiad.wmnet with OS bullseye [15:12:57] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1442.eqiad.wmnet with OS bullseye [15:13:00] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1451.eqiad.wmnet with OS bullseye [15:13:03] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1452.eqiad.wmnet with OS bullseye [15:13:05] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff) [15:13:06] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1454.eqiad.wmnet with OS bullseye [15:13:08] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1455.eqiad.wmnet with OS bullseye [15:13:52] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [15:14:13] claime: I would suggest to increase a bit the sleep between starts... this will bottleneck on running puppet on the alert host for the downtime [15:15:19] volans: it's not a sleep it's me starting them too fast and then cursing myself every time [15:15:30] (manually I mean) [15:15:41] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9607433 (10MoritzMuehlenhoff) [15:15:46] lol [15:15:57] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9607425 (10bking) 05Resolved→03Open @VRiley-WMF `wdqs1025` is failing to reimage. I can't see any disks in the DRAC interface, are you... [15:16:14] (03PS1) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) [15:16:14] see this as a stress test of the locking mechanism >:) [15:17:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2131.codfw.wmnet with OS bookworm [15:17:46] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) [15:18:08] :D [15:19:15] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:21:12] (03PS1) 10Muehlenhoff: Move the old apt servers to insetup::buster role [puppet] - 10https://gerrit.wikimedia.org/r/1009281 (https://phabricator.wikimedia.org/T331613) [15:21:16] (03PS1) 10Muehlenhoff: Move nginx/Puppet settings for new apt hosts to the role Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1009282 (https://phabricator.wikimedia.org/T331613) [15:21:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool to clone on db2131 T358642', diff saved to https://phabricator.wikimedia.org/P58589 and previous config saved to /var/cache/conftool/dbconfig/20240306-152130-arnaudb.json [15:21:43] T358642: Upgrade x1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358642 [15:21:45] jouncebot: now [15:21:45] For the next 0 hour(s) and 38 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1500) [15:22:24] PROBLEM - Check whether ferm is active by checking the default input chain on mw1367 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:22:58] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki viwiki --current --all --touched-after=20230613000000 --start '["8661638"]' 2>&1 | tee ~/T315510-viwiki-2 # in tmux [15:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:10] (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2196.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422 [15:23:55] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [15:24:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2196.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422 [15:24:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422 [15:24:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: provisionning db2131.codfw.wmnet - T355422 [15:24:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:25:42] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2196.codfw.wmnet onto db2131.codfw.wmnet [15:27:02] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1441.eqiad.wmnet with reason: host reimage [15:27:08] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1451.eqiad.wmnet with reason: host reimage [15:27:25] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1454.eqiad.wmnet with reason: host reimage [15:27:32] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1442.eqiad.wmnet with reason: host reimage [15:27:56] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1452.eqiad.wmnet with reason: host reimage [15:28:08] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks [15:28:10] (SystemdUnitFailed) firing: (3) ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:11] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1455.eqiad.wmnet with reason: host reimage [15:28:26] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) [15:29:35] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1441.eqiad.wmnet with reason: host reimage [15:31:13] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [15:31:38] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1454.eqiad.wmnet with reason: host reimage [15:31:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [15:32:47] (03PS5) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560) [15:32:49] (03PS5) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560) [15:32:53] (03PS5) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560) [15:33:01] (03PS5) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560) [15:34:04] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1452.eqiad.wmnet with reason: host reimage [15:34:32] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [15:34:36] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance [15:34:47] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) [15:36:41] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1442.eqiad.wmnet with reason: host reimage [15:36:59] (03PS20) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [15:39:29] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1455.eqiad.wmnet with reason: host reimage [15:42:27] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1451.eqiad.wmnet with reason: host reimage [15:43:45] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly [15:43:45] !log jiji@cumin1002 [DRY-RUN] MediaWiki read-only period starts at: 2024-03-06 15:43:44.970687 [15:43:47] (03PS1) 10Btullis: Allow the lilypond packages to be installed on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) [15:43:48] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:44:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) [15:44:02] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:44:15] PROBLEM - cassandra-a CQL 10.64.16.39:9042 on restbase1039 is CRITICAL: connect to address 10.64.16.39 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:44:16] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1039.eqiad.wmnet with reason: Bootstrapping — T354560 [15:44:18] eevans@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:44:18] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [15:44:31] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1039.eqiad.wmnet with reason: Bootstrapping — T354560 [15:44:33] eevans@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:44:37] (03CR) 10Hnowlan: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 60% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009268 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [15:45:16] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki [15:45:17] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:45:30] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) [15:45:31] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:46:12] (03CR) 10Bking: "Per IRC conversation with volans, we're going to wait until after the offsite before merging, so we have time to address some of these con" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [15:46:49] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [15:46:50] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:46:52] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [15:46:53] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:47:32] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1009270 (owner: 10Muehlenhoff) [15:47:52] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite [15:47:53] jiji@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [15:47:56] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1441.eqiad.wmnet with OS bullseye [15:47:56] cgoubert@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [15:48:02] !log jiji@cumin1002 [DRY-RUN] MediaWiki read-only period ends at: 2024-03-06 15:48:02.718097 [15:48:03] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) [15:48:33] (03CR) 10ArielGlenn: [C: 03+2] Allow the lilypond packages to be installed on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [15:48:43] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner [15:48:43] !log root@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [15:48:43] !log root@deploy2002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [15:48:44] !log root@deploy2002 helmfile [eqiad] [main] FAIL helmfile.d/services/mw-jobrunner : sync [15:48:44] !log root@deploy2002 helmfile [eqiad] [canary] FAIL helmfile.d/services/mw-jobrunner : sync [15:48:45] !log jiji@cumin1002 END (FAIL) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=99) [15:49:17] (03CR) 10Muehlenhoff: "I don't think this is needed/correct? Bullseye should have a recent enough Lilypond version by itself?" [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [15:50:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [15:50:04] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1454.eqiad.wmnet with OS bullseye [15:50:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [15:51:23] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1452.eqiad.wmnet with OS bullseye [15:52:25] RECOVERY - Check whether ferm is active by checking the default input chain on mw1367 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:54:35] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance [15:55:31] (03PS1) 10Brouberol: Superset: migrate external services egress to Calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009290 (https://phabricator.wikimedia.org/T359411) [15:55:45] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1442.eqiad.wmnet with OS bullseye [15:55:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2005.mgmt.codfw.wmnet with reboot policy FORCED [15:55:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbprov2006.mgmt.codfw.wmnet with reboot policy FORCED [15:56:53] (03PS2) 10Brouberol: Add template rendering external services egress NetworkPolicy resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1009279 (https://phabricator.wikimedia.org/T331894) [15:57:05] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) [15:57:29] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1455.eqiad.wmnet with OS bullseye [15:58:05] (03CR) 10Btullis: "Oh right, yes it has 2.22.0-10 in the bullseye repos but 2.22.1-2~bpo11+1 in bullseye-backports. I had assumed that the backported one wou" [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [15:59:27] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl [15:59:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) [16:00:21] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1451.eqiad.wmnet with OS bullseye [16:00:44] !log jiji@cumin1002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters [16:05:07] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:05:14] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:05:35] (03CR) 10Muehlenhoff: "Either is fine I guess, we can also just keep it as-is." [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [16:14:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P58590 and previous config saved to /var/cache/conftool/dbconfig/20240306-161546-ladsgroup.json [16:15:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:18:35] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1009294 [16:19:13] (03PS3) 10Brouberol: global_config: add presto/druid/IDP node IPs to the k8s global config [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [16:19:20] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [16:21:03] (03PS4) 10Brouberol: global_config: add presto/druid/IDP node IPs to the k8s global config [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) [16:21:24] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1009292 (https://phabricator.wikimedia.org/T359411) (owner: 10Brouberol) [16:21:28] (03PS6) 10Alexandros Kosiaris: Clean up all the RESTBase hosts's parsoid uri changes [puppet] - 10https://gerrit.wikimedia.org/r/1006899 (https://phabricator.wikimedia.org/T359387) [16:21:36] (03PS6) 10Alexandros Kosiaris: services_proxy: Remove parsoid-php, parsoid-async [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) [16:26:03] !log Disable meta-monitoring for alert1001 - T333615 [16:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:07] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [16:27:26] (03CR) 10Btullis: "Oh it's a bit noisy. puppet is displaying a notice for each package that uses this format, on both buster and bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/1009288 (https://phabricator.wikimedia.org/T325228) (owner: 10Btullis) [16:28:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417 (10bdgreenlee) [16:29:07] (03PS2) 10Fabfur: haproxy: enable log to benthos socket [puppet] - 10https://gerrit.wikimedia.org/r/1009293 (https://phabricator.wikimedia.org/T358109) [16:29:22] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v8.4.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1009294 (owner: 10Volans) [16:30:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P58591 and previous config saved to /var/cache/conftool/dbconfig/20240306-163053-ladsgroup.json [16:31:27] (03PS1) 10Volans: Upstream release v8.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1009297 [16:31:30] (03CR) 10BCornwall: [C: 03+2] cdn: Fix site var for ATSBackendErrorsHigh [alerts] - 10https://gerrit.wikimedia.org/r/1008981 (owner: 10BCornwall) [16:31:33] (03PS1) 10EoghanGaffney: [gitlab] Failover test of gitlab replica hosts [puppet] - 10https://gerrit.wikimedia.org/r/1009298 (https://phabricator.wikimedia.org/T358559) [16:32:31] (03PS1) 10Elukey: slo_template: update SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009299 [16:34:39] (03CR) 10Subramanya Sastry: "How does this impact scandium and our use of that server for round-trip testing which we run weekly?" [puppet] - 10https://gerrit.wikimedia.org/r/1006900 (https://phabricator.wikimedia.org/T359387) (owner: 10Alexandros Kosiaris) [16:34:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:35:01] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1009293 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:35:01] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:36:08] !log denisse@cumin2002 START - Cookbook sre.hosts.reimage for host alert1001.wikimedia.org with OS bookworm [16:36:38] (03PS1) 10EoghanGaffney: [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559) [16:36:50] 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9607819 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host alert1001.wikimedia.org with OS bookworm [16:36:54] !log Running homer 'cr*eqiad*' commit 'T351074' [16:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:58] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:37:22] (03CR) 10CI reject: [V: 04-1] [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559) (owner: 10EoghanGaffney) [16:38:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:38:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:38:58] (03PS1) 10Filippo Giunchedi: cumin: fix ganeti-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1009302 [16:39:54] (03CR) 10Filippo Giunchedi: "I'm assuming the current ganeti-all version is what you meant" [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi) [16:40:32] (03CR) 10Muehlenhoff: cumin: fix ganeti-all alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi) [16:40:53] (03CR) 10Volans: [C: 03+2] Upstream release v8.4.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1009297 (owner: 10Volans) [16:41:57] (03PS2) 10EoghanGaffney: [gitlab] Failover test of gitlab replica hosts [dns] - 10https://gerrit.wikimedia.org/r/1009300 (https://phabricator.wikimedia.org/T358559) [16:42:05] (03PS2) 10Filippo Giunchedi: cumin: fix ganeti-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1009302 [16:42:13] (03CR) 10Filippo Giunchedi: cumin: fix ganeti-all alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi) [16:43:40] (03PS2) 10Ssingh: P:dns::auth: skipping running authdns-update on host if not pooled [puppet] - 10https://gerrit.wikimedia.org/r/1009261 (https://phabricator.wikimedia.org/T347054) [16:44:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9607891 (10odimitrijevic) Approved [16:44:51] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:44:58] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:45:11] (03PS53) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [16:45:13] (JobUnavailable) firing: (4) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:45:13] (03PS1) 10AOkoth: vrts: disable vrts-cache-cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/1009303 (https://phabricator.wikimedia.org/T354422) [16:46:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P58592 and previous config saved to /var/cache/conftool/dbconfig/20240306-164559-ladsgroup.json [16:46:56] (03PS2) 10AOkoth: vrts: disable vrts-cache-cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/1009303 (https://phabricator.wikimedia.org/T354422) [16:48:54] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert1001.wikimedia.org with reason: host reimage [16:49:02] !log uploaded spicerack_8.4.1 to apt.wikimedia.org bullseye-wikimedia [16:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2196.codfw.wmnet onto db2131.codfw.wmnet [16:50:13] (JobUnavailable) firing: (4) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:50:58] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi) [16:51:16] (03CR) 10Filippo Giunchedi: [C: 03+2] cumin: fix ganeti-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1009302 (owner: 10Filippo Giunchedi) [16:52:03] (03CR) 10Ssingh: [C: 03+1] haproxy: enable log to benthos socket [puppet] - 10https://gerrit.wikimedia.org/r/1009293 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:52:43] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on alert1001.wikimedia.org with reason: host reimage [16:52:44] !log Pooling and uncordoning mw1441.eqiad.wmnet,mw1442.eqiad.wmnet,mw1451.eqiad.wmnet,mw1452.eqiad.wmnet,mw1454.eqiad.wmnet,mw1455.eqiad.wmnet - T351074 [16:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:52] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:52:55] !log cgoubert@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=(mw1441.eqiad.wmnet|mw1442.eqiad.wmnet|mw1451.eqiad.wmnet|mw1452.eqiad.wmnet|mw1454.eqiad.wmnet|mw1455.eqiad.wmnet),cluster=kubernetes,service=kubesvc [16:54:12] jouncebot: nowandnext [16:54:12] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [16:54:13] In 0 hour(s) and 5 minute(s): Alert hosts failover alert2001 -> alert1001 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1700) [16:54:24] (03PS1) 10Btullis: Enable the MarketingCampaignsReporting plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) [16:54:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 25%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58593 and previous config saved to /var/cache/conftool/dbconfig/20240306-165439-arnaudb.json [16:55:06] not sure what the alert hosts failover is about – I was about to do a MW backport for a train blocker, but I can wait until after the failover if required? [16:55:13] (JobUnavailable) firing: (4) Reduced availability for job alertmanager in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:55:17] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1595/console" [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) (owner: 10Btullis) [16:55:32] denisse: maybe you know, as it seems like the host's reimaging right now? [16:55:39] (03PS2) 10Btullis: Enable the MarketingCampaignsReporting plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) [16:56:06] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:56:13] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:57:01] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1596/co" [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) (owner: 10Btullis) [16:57:13] (03CR) 10Btullis: [V: 03+1 C: 03+2] Enable the MarketingCampaignsReporting plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/1009305 (https://phabricator.wikimedia.org/T319013) (owner: 10Btullis) [16:57:29] Hi @urbanecm, we're upgrading our Alert hosts instances. We're just doing the reimage of the passive host and plan on doing the failover at 17 UTC. [16:58:26] We're still waiting for the re-image to finish so please proceed with your backport. [16:58:40] (KubernetesRsyslogDown) firing: rsyslog on mw2436:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2436 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:58:42] Do you have an estimate of how long is it going to take? [16:59:40] denisse: thanks for the info. since it's a core patch, it might take ~35 mins due to CI. not sure when the reimage might finish. [16:59:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:02] but maybe i can +2 it now, wait for the failover and then finish it? [17:00:05] Deploy window Alert hosts failover alert2001 -> alert1001 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1700) [17:00:27] urbanecm: Thanks Martin, due to the time it would take we would greatly appreciate it if you could merge it after the failover. [17:00:37] I'll let you know ASAP when we finish. [17:00:41] okay, no problem. will wait for the ping from you then. [17:01:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P58594 and previous config saved to /var/cache/conftool/dbconfig/20240306-170106-ladsgroup.json [17:01:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [17:01:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [17:01:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P58595 and previous config saved to /var/cache/conftool/dbconfig/20240306-170125-ladsgroup.json [17:01:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:02:35] !log restart rsyslog on mw2436 [17:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:40] (KubernetesRsyslogDown) resolved: rsyslog on mw2436:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw2436 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:06:08] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:06:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:09:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 50%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58596 and previous config saved to /var/cache/conftool/dbconfig/20240306-170944-arnaudb.json [17:10:12] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:10:19] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:13:01] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:13:47] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9608001 (10bdgreenlee) Done: https://phabricator.wikimedia.org/T359417 [17:15:02] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host alert1001.wikimedia.org with OS bookworm [17:15:20] 06SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9608025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host alert1001.wikimedia.org with OS bookworm completed: - alert1001 (**WARN**) - Remo... [17:17:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:17:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:18:03] !log failing over from alert2001 to alert1001 [17:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:26] (03CR) 10Andrea Denisse: [C: 03+2] Revert "alert: Failover Icinga and Alertmanager to alert2001" [puppet] - 10https://gerrit.wikimedia.org/r/1008761 (owner: 10Andrea Denisse) [17:21:03] (03PS2) 10Andrea Denisse: Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759 [17:21:19] (03CR) 10Andrea Denisse: [C: 03+2] Revert "wikimedia.org: failover icinga to alert2001 too" [dns] - 10https://gerrit.wikimedia.org/r/1008760 (owner: 10Andrea Denisse) [17:21:43] (03PS3) 10Andrea Denisse: Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759 [17:23:27] (03CR) 10Andrea Denisse: [C: 03+2] Revert "alert: Resolve alerts DNS queries to alert2001" [dns] - 10https://gerrit.wikimedia.org/r/1008759 (owner: 10Andrea Denisse) [17:23:31] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359197#9608094 (10Marostegui) [17:23:39] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608096 (10Marostegui) [17:24:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608100 (10Marostegui) @odimitrijevic I assume you are also their manager and hence approving for manager and analytics group? [17:24:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608101 (10Marostegui) p:05Triage→03Medium [17:24:47] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:24:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 75%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58597 and previous config saved to /var/cache/conftool/dbconfig/20240306-172449-arnaudb.json [17:24:54] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:33:11] (03PS1) 10Hnowlan: kubernetes: migrate 5 eqiad appservers to k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/1009309 (https://phabricator.wikimedia.org/T351074) [17:35:17] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9608157 (10VRiley-WMF) @dr0ptp4kt would you be able to try to reimage this unit again? I have ran it through a power cycle and that can he... [17:37:09] @urbanecm : Hi, we've finished with the Alert hosts failover. [17:37:12] (JobUnavailable) firing: (3) Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:37:17] denisse: ack, thanks! [17:37:47] (03PS1) 10Urbanecm: JS REST: make POST default to empty object [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009326 (https://phabricator.wikimedia.org/T359216) [17:37:53] (03CR) 10Urbanecm: [C: 03+2] JS REST: make POST default to empty object [core] (wmf/1.42.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1009326 (https://phabricator.wikimedia.org/T359216) (owner: 10Urbanecm) [17:39:33] jnuche: fyi i plan to deploy that patch myself (so i can test it). i can ping you once done if that'd be helpful. [17:39:43] just waiting on CI rn [17:39:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2196 (re)pooling @ 100%: Clone source repooling', diff saved to https://phabricator.wikimedia.org/P58598 and previous config saved to /var/cache/conftool/dbconfig/20240306-173954-arnaudb.json [17:39:58] urbanecm: sounds great, thanks a lot [17:41:18] No problém. [17:44:39] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2518.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:47:12] (JobUnavailable) firing: (3) Reduced availability for job icinga-am in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:51:39] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:53:33] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1800) [18:02:12] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9608284 (10bking) @VRiley-WMF Unfortunately, I'm still getting errors [[ https://ewr1.vultrobjects.com/work/disk_errors_wdqs1025.png | (... [18:07:37] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1009326|JS REST: make POST default to empty object (T359216)]] [18:07:53] T359216: [testwiki - wmf.21] Bad request for page/summary and user-impact - https://phabricator.wikimedia.org/T359216 [18:11:39] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1009326|JS REST: make POST default to empty object (T359216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:12:02] !log urbanecm@deploy2002 urbanecm: Continuing with sync [18:21:56] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1009326|JS REST: make POST default to empty object (T359216)]] (duration: 14m 19s) [18:22:04] T359216: [testwiki - wmf.21] Bad request for page/summary and user-impact - https://phabricator.wikimedia.org/T359216 [18:22:16] * urbanecm done [18:32:22] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#9608338 (10cmooney) p:05Medium→03Low [18:33:05] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9608340 (10VRiley-WMF) Swapped cable with a new one (same port), shut down the unit and reseated the drives as well. Powered the unit back on [18:45:48] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@af71f6e] (releasing): (no justification provided) [18:46:29] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@af71f6e] (releasing): (no justification provided) (duration: 00m 41s) [18:49:19] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608395 (10odimitrijevic) Yes, that's correct! Approve x 2 [18:57:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608411 (10Marostegui) [18:59:01] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [18:59:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bdgreenlee - https://phabricator.wikimedia.org/T359417#9608415 (10Marostegui) [18:59:43] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs1025'] [18:59:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs1025'] [19:00:05] jnuche and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1900). [19:00:05] jnuche and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T1900). [19:00:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [19:04:46] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:04:53] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:10:46] (03CR) 10RLazarus: [C: 03+1] slo_template: update SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1009299 (owner: 10Elukey) [19:31:55] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2271.codfw.wmnet, mw2438.codfw.wmnet, mw2336.codfw.wmnet, mw2331.codfw.wmnet, mw2415.codfw.wmnet, mw2276.codfw.wmnet, mw2393.codfw.wmnet, mw2413.codfw.wmnet, mw2329.codfw.wmnet, mw2325.codfw.wmnet, mw2414.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2408.codfw.wmnet, mw2269.codfw.wmnet, mw2361.codfw.wmnet, mw [19:36:31] fw.wmnet, mw2270.codfw.wmnet, mw2441.codfw.wmnet, mw2337.codfw.wmnet, mw2274.codfw.wmnet, mw2277.codfw.wmnet, mw2272.codfw.wmnet, mw2407.codfw.wmnet, mw2268.codfw.wmnet, mw2273.codfw.wmnet, mw2333.codfw.wmnet, mw2432.codfw.wmnet, mw2303.codfw.wmnet, mw2439.codfw.wmnet, mw2389.codfw.wmnet, mw2390.codfw.wmnet, mw2412.codfw.wmnet are marked down but pooled: mw-web_4450: Servers mw2424.codfw.wmnet, kubernetes2046.codfw.wmnet, mw2317.codfw.wmn [19:36:31] rnetes2045.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2301.codfw.wmnet, mw2377.codfw.wmnet, mw2447.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2034.codfw.wmnet, mw2422.codfw.wmnet, pa https://wikitech.wikimedia.org/wiki/PyBal [19:36:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2271.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2392.codfw.wmnet, mw2393.codfw.wmnet, mw2338.codfw.wmnet, mw2325.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2269.codfw.wmnet, mw2408.codfw.wmnet, mw2327.codfw.wmnet, mw2433.codfw.wmnet, mw2270.codfw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw [19:36:31] fw.wmnet, mw2277.codfw.wmnet, mw2388.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2407.codfw.wmnet, mw2268.codfw.wmnet, mw2336.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2432.codfw.wmnet, mw2303.codfw.wmnet, mw2391.codfw.wmnet, mw2309.codfw.wmnet, mw2439.codfw.wmnet, mw2390.codfw.wmnet, mw2412.codfw.wmnet are marked down but pooled: mw-web_4450: Servers mw2424.codfw.wmnet, mw2292.codfw.wmnet, mw2350.codfw.wmnet, kube [19:36:32] 60.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2426.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2267.codfw.wmnet, mw2420.codfw.wmnet, parse2010.codfw.wmnet, mw2294.codfw.wmnet, parse2006. https://wikitech.wikimedia.org/wiki/PyBal [19:36:57] (ProbeDown) firing: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:15] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 14.3% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:38:15] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:38:54] shit [19:39:31] any idea what's up here ^ ? [19:39:35] o/ [19:39:44] (HaproxyUnavailable) firing: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:40:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:41:15] (MediaWikiLatencyExceeded) firing: (2) Average latency high: codfw api_appserver GET/200: 0.4919868694431104s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:41:15] (MediaWikiLatencyExceeded) firing: p75 latency high: codfw mw-parsoid (k8s) 14.7s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:41:57] (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:42:15] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:42:29] Tons of timeouts while accessing the database?! [19:42:51] (SwaggerProbeHasFailures) firing: (4) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:43:15] (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver POST/504: 430.3384974393115s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:43:16] yeah I don't see an outside traffic spike at first glance [19:43:30] hi, I'm here, sorry [19:45:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:45:26] I don't even see a 5xx spike from the edge POV [19:45:33] just a drop in traffic [19:46:02] and only on the codfw side of the world (codfw+ulsfo+eqsin) [19:46:15] (HttpdUnreachable) firing: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [19:46:15] (MediaWikiLatencyExceeded) firing: (3) p75 latency high: codfw mw-api-ext (k8s) 1.799s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:46:57] (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:51] (SwaggerProbeHasFailures) firing: (4) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:49:51] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:50:15] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:51:15] (HttpdUnreachable) resolved: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [19:51:15] (MediaWikiLatencyExceeded) firing: (3) p75 latency high: codfw mw-api-ext (k8s) 1.031s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:51:58] (ProbeDown) firing: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:21] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:52:51] (SwaggerProbeHasFailures) firing: (3) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:53:15] (MediaWikiLatencyExceeded) firing: (2) Average latency high: codfw appserver GET/200: 6.358808350662945s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:53:15] (HttpdUnreachable) firing: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [19:54:51] (ATSBackendErrorsHigh) firing: (14) ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:55:24] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:55:30] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:56:15] (MediaWikiLatencyExceeded) firing: (2) p75 latency high: codfw mw-parsoid (k8s) 21.46s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:56:58] (ProbeDown) resolved: (12) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=codfw&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [19:57:21] (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:51] (SwaggerProbeHasFailures) resolved: (3) Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:58:15] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:58:15] (MediaWikiLatencyExceeded) firing: (3) Average latency high: codfw appserver GET/200: 71.20180196214734s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:58:15] (HttpdUnreachable) resolved: httpd unavailable for deployment mw-web at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [19:59:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:59:12] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:59:33] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:59:35] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:59:44] (HaproxyUnavailable) resolved: (2) HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:59:51] (ATSBackendErrorsHigh) resolved: (15) ATS: elevated 5xx errors from appservers-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:00:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:00:15] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:01:15] (MediaWikiLatencyExceeded) resolved: (2) Average latency high: codfw api_appserver GET/200: 0.23083468048045475s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:01:15] (MediaWikiLatencyExceeded) resolved: (2) p75 latency high: codfw mw-parsoid (k8s) 3.74s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:01:55] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:15] (PHPFPMTooBusy) resolved: (3) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 20.18% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:02:21] (ProbeDown) resolved: (14) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:03:15] (MediaWikiLatencyExceeded) resolved: (3) Average latency high: codfw appserver GET/200: 71.20180196214734s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:03:59] (03PS1) 10Majavah: Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 [20:04:52] (03CR) 10CI reject: [V: 04-1] Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah) [20:04:54] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:05:00] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:05:21] (03PS2) 10Majavah: Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 [20:05:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah) [20:06:46] (03Merged) 10jenkins-bot: Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah) [20:07:11] !log taavi@deploy2002 Started scap: Backport for [[gerrit:1009325|Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki]] [20:08:49] !log taavi@deploy2002 taavi: Backport for [[gerrit:1009325|Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:09:24] !log taavi@deploy2002 taavi: Continuing with sync [20:10:55] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:11:02] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:14:05] RECOVERY - cassandra-a CQL 10.64.16.39:9042 on restbase1039 is OK: TCP OK - 0.000 second response time on 10.64.16.39 port 9042 https://phabricator.wikimedia.org/T93886 [20:15:27] PROBLEM - Check whether ferm is active by checking the default input chain on mw1390 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:16:55] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:12] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:1009325|Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki]] (duration: 12m 01s) [20:20:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1025.eqiad.wmnet with OS bullseye [20:25:26] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:25:33] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:45:26] (03PS1) 10Majavah: Undeploy Striker from codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1009350 [20:45:27] RECOVERY - Check whether ferm is active by checking the default input chain on mw1390 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:46:55] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:50:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:50:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P58599 and previous config saved to /var/cache/conftool/dbconfig/20240306-205006-ladsgroup.json [20:50:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T2100). [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:01:13] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:01:20] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:01:59] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:21] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wdqs1025 [21:04:24] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wdqs1025 [21:05:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P58600 and previous config saved to /var/cache/conftool/dbconfig/20240306-210512-ladsgroup.json [21:13:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:18:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [21:19:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:19:16] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:20:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P58601 and previous config saved to /var/cache/conftool/dbconfig/20240306-212019-ladsgroup.json [21:25:09] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:25:15] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:27:31] (03CR) 10BryanDavis: [C: 03+1] "Fine with me, but I would defer to Andrew's opinion. At this point I actually don't remember which things made fully deploying it difficul" [puppet] - 10https://gerrit.wikimedia.org/r/1009350 (owner: 10Majavah) [21:35:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P58604 and previous config saved to /var/cache/conftool/dbconfig/20240306-213525-ladsgroup.json [21:35:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [21:35:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:35:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [21:35:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:35:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:36:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T352010)', diff saved to https://phabricator.wikimedia.org/P58605 and previous config saved to /var/cache/conftool/dbconfig/20240306-213603-ladsgroup.json [21:40:48] (03CR) 10Ladsgroup: [C: 03+1] Set wgFlaggedRevsHandleIncludes to FR_INCLUDES_CURRENT on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1009325 (owner: 10Majavah) [21:47:12] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240306T2200) [22:25:27] (03PS1) 10Bking: WIP: Add Flink alerts for Cirrus Streaming Updater [alerts] - 10https://gerrit.wikimedia.org/r/1009359 (https://phabricator.wikimedia.org/T359213) [22:28:01] 06SRE, 10SRE-Access-Requests: Requesting access to mwmaint for rkhan / Himejijo - https://phabricator.wikimedia.org/T359490 (10Himejijo) 03NEW [22:33:22] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9609487 (10Jclark-ctr) @bking was puppet and site.pp updated? unfortunately me and Valerie do not have access to push updates and has be... [22:33:47] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/1006974/1597/gitlab2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [22:34:48] (03PS2) 10BBlack: Make auth NSID distinct from recdns on same host [puppet] - 10https://gerrit.wikimedia.org/r/1009316 [22:34:51] (03CR) 10Dzahn: [V: 03+1] "Resources only in the new catalog" [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [22:35:10] (03CR) 10BBlack: Make auth NSID distinct from recdns on same host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1009316 (owner: 10BBlack) [22:36:30] (03PS5) 10Dzahn: phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) [22:38:48] (03CR) 10Dzahn: [C: 03+2] phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [22:59:36] (03PS1) 10Bking: site.pp: Add wdqs1025 host [puppet] - 10https://gerrit.wikimedia.org/r/1009361 (https://phabricator.wikimedia.org/T358727) [23:02:21] PROBLEM - Thanos swift https on thanos-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [23:02:21] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [23:04:11] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Thanos [23:04:11] RECOVERY - Thanos swift https on thanos-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.391 second response time https://wikitech.wikimedia.org/wiki/Thanos [23:08:34] (03CR) 10Ryan Kemper: [C: 03+1] site.pp: Add wdqs1025 host [puppet] - 10https://gerrit.wikimedia.org/r/1009361 (https://phabricator.wikimedia.org/T358727) (owner: 10Bking) [23:09:20] (03CR) 10Bking: [C: 03+2] site.pp: Add wdqs1025 host [puppet] - 10https://gerrit.wikimedia.org/r/1009361 (https://phabricator.wikimedia.org/T358727) (owner: 10Bking) [23:10:21] 06SRE, 10MW-on-K8s, 10Scap, 06serviceops, and 2 others: Adapt scap's testing strategy to mw-on-k8s - https://phabricator.wikimedia.org/T358117#9609616 (10CodeReviewBot) thcipriani merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/230 scap sync-world: Add support for testserver checks [23:16:35] 06SRE, 10ops-eqiad, 10Wikidata, 10wmde-wikidata-tech, and 3 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9609622 (10bking) @Jclark-ctr Thanks for the tip, I've added a patch and will try the reimage again. [23:16:51] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye