[00:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P49313 and previous config saved to /var/cache/conftool/dbconfig/20230609-000226-ladsgroup.json [00:16:39] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T336886)', diff saved to https://phabricator.wikimedia.org/P49314 and previous config saved to /var/cache/conftool/dbconfig/20230609-001732-ladsgroup.json [00:17:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [00:17:37] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:17:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [00:17:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:18:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:18:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T336886)', diff saved to https://phabricator.wikimedia.org/P49315 and previous config saved to /var/cache/conftool/dbconfig/20230609-001821-ladsgroup.json [00:22:31] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q4): Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10colewhite) Lag //can// indicate problems with Logstash but as we have seen the past two quarters, it usually indicates some log pro... [00:23:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pki-root1002.mgmt.eqiad.wmnet with reboot policy FORCED [00:24:41] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki-root1002'] [00:24:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['pki-root1002'] [00:25:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki-root1002'] [00:25:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pki-root1002'] [00:26:25] PROBLEM - aqs endpoints health on aqs2003 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [00:32:00] (03PS1) 10Papaul: Add pki-root1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/928689 (https://phabricator.wikimedia.org/T334401) [00:32:49] (03CR) 10Papaul: [C: 03+2] Add pki-root1002 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/928689 (https://phabricator.wikimedia.org/T334401) (owner: 10Papaul) [00:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T336886)', diff saved to https://phabricator.wikimedia.org/P49316 and previous config saved to /var/cache/conftool/dbconfig/20230609-003406-ladsgroup.json [00:34:10] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:34:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host pki-root1002.eqiad.wmnet with OS bullseye [00:34:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host pki-root1002.eqiad.wmnet with OS bullseye [00:39:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927781 [00:39:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927781 (owner: 10TrainBranchBot) [00:47:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye [00:47:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err... [00:48:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pki-root1002.eqiad.wmnet with reason: host reimage [00:49:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P49317 and previous config saved to /var/cache/conftool/dbconfig/20230609-004912-ladsgroup.json [00:51:28] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1011.eqiad.wmnet with OS bullseye [00:51:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye executed with err... [00:51:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki-root1002.eqiad.wmnet with reason: host reimage [00:51:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [00:51:58] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1011.eqiad.wmnet with OS bullseye [00:52:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [00:52:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye [00:59:52] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927781 (owner: 10TrainBranchBot) [01:04:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P49318 and previous config saved to /var/cache/conftool/dbconfig/20230609-010418-ladsgroup.json [01:08:05] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:19:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T336886)', diff saved to https://phabricator.wikimedia.org/P49319 and previous config saved to /var/cache/conftool/dbconfig/20230609-011924-ladsgroup.json [01:19:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [01:19:29] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:19:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [01:19:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T336886)', diff saved to https://phabricator.wikimedia.org/P49320 and previous config saved to /var/cache/conftool/dbconfig/20230609-011945-ladsgroup.json [01:29:16] I could use a bit of mediawiki admin help if anyone is around. wikitech-static has a ton of files in /srv/mediawiki/images/wikitech/archive but deleteArchivedFiles.php --delete says there's nothing to delete. [01:29:24] Am I misunderstanding what the dir is for? [01:29:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [01:29:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pki-root1002.eqiad.wmnet with OS bullseye [01:29:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host pki-root1002.eqiad.wmnet with OS bullseye completed: - pki-root1002 (**... [01:35:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T336886)', diff saved to https://phabricator.wikimedia.org/P49321 and previous config saved to /var/cache/conftool/dbconfig/20230609-013515-ladsgroup.json [01:35:20] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:48:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye [01:48:20] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1011.eqiad.wmnet with OS bullseye [01:48:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err... [01:48:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye executed with err... [01:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P49322 and previous config saved to /var/cache/conftool/dbconfig/20230609-015021-ladsgroup.json [01:53:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10Papaul) [01:55:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10Papaul) 05Open→03Resolved @jbond this is complete [02:00:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudswift1002.eqiad.wmnet with OS bullseye [02:00:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudswift1002.eqiad.wmnet with O... [02:00:36] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:38] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudswift1002.eqiad.wmnet with reason: host reimage [02:04:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudswift1002.eqiad.wmnet with reason: host reimage [02:04:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P49323 and previous config saved to /var/cache/conftool/dbconfig/20230609-020528-ladsgroup.json [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudswift1002.eqiad.wmnet with OS bullseye [02:12:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bu... [02:13:53] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10thcipriani) >>! In T338468#8915689, @KFrancis wrote: > The NDA is complete. Please proceed with the access request. Awesome! >>! In T338468#8914512, @Matthew... [02:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T336886)', diff saved to https://phabricator.wikimedia.org/P49324 and previous config saved to /var/cache/conftool/dbconfig/20230609-022034-ladsgroup.json [02:20:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [02:20:38] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:20:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [02:20:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T336886)', diff saved to https://phabricator.wikimedia.org/P49325 and previous config saved to /var/cache/conftool/dbconfig/20230609-022054-ladsgroup.json [02:21:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) [02:22:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Papaul) 05Stalled→03Resolved This is complete [02:26:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T336886)', diff saved to https://phabricator.wikimedia.org/P49326 and previous config saved to /var/cache/conftool/dbconfig/20230609-023548-ladsgroup.json [02:35:52] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:40:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frav1003 - https://phabricator.wikimedia.org/T334400 (10Papaul) @Dwisehaupt it looks the server doesn't exist on the switch ` papaul@fasw-c-eqiad# run show interfaces descriptions | match frav100* ge-0/0/17 up up f... [02:43:00] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2021 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:43:46] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:45:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:49:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P49327 and previous config saved to /var/cache/conftool/dbconfig/20230609-025054-ladsgroup.json [03:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P49328 and previous config saved to /var/cache/conftool/dbconfig/20230609-030600-ladsgroup.json [03:21:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T336886)', diff saved to https://phabricator.wikimedia.org/P49329 and previous config saved to /var/cache/conftool/dbconfig/20230609-032106-ladsgroup.json [03:21:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [03:21:12] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [03:21:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1218.eqiad.wmnet with reason: Maintenance [03:21:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1218 (T336886)', diff saved to https://phabricator.wikimedia.org/P49330 and previous config saved to /var/cache/conftool/dbconfig/20230609-032127-ladsgroup.json [03:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T336886)', diff saved to https://phabricator.wikimedia.org/P49331 and previous config saved to /var/cache/conftool/dbconfig/20230609-033727-ladsgroup.json [03:37:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [03:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:52:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P49332 and previous config saved to /var/cache/conftool/dbconfig/20230609-035233-ladsgroup.json [04:07:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P49333 and previous config saved to /var/cache/conftool/dbconfig/20230609-040739-ladsgroup.json [04:22:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T336886)', diff saved to https://phabricator.wikimedia.org/P49334 and previous config saved to /var/cache/conftool/dbconfig/20230609-042246-ladsgroup.json [04:22:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [04:22:50] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [04:23:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1219.eqiad.wmnet with reason: Maintenance [04:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T336886)', diff saved to https://phabricator.wikimedia.org/P49335 and previous config saved to /var/cache/conftool/dbconfig/20230609-042306-ladsgroup.json [04:30:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:06] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:34:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T336886)', diff saved to https://phabricator.wikimedia.org/P49336 and previous config saved to /var/cache/conftool/dbconfig/20230609-043756-ladsgroup.json [04:38:01] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [04:45:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:14] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P49337 and previous config saved to /var/cache/conftool/dbconfig/20230609-045302-ladsgroup.json [05:08:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P49338 and previous config saved to /var/cache/conftool/dbconfig/20230609-050809-ladsgroup.json [05:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:53] (03CR) 10Muehlenhoff: C:IDM Add ldap group settings. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928641 (owner: 10Slyngshede) [05:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/928586 (owner: 10Slyngshede) [05:23:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T336886)', diff saved to https://phabricator.wikimedia.org/P49339 and previous config saved to /var/cache/conftool/dbconfig/20230609-052315-ladsgroup.json [05:23:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:23:19] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:23:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [05:36:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [05:36:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [05:41:15] (03CR) 10Muehlenhoff: [C: 03+2] Fully manage /etc/nftables/ in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/928463 (owner: 10Muehlenhoff) [05:49:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:49:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:50:39] !log installing cpio security updates [05:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:55] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/928699 [06:04:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [06:04:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [06:04:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T336886)', diff saved to https://phabricator.wikimedia.org/P49340 and previous config saved to /var/cache/conftool/dbconfig/20230609-060438-ladsgroup.json [06:04:43] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:04:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:58] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/928699 (owner: 10Muehlenhoff) [06:13:00] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:07] (03PS1) 10Muehlenhoff: krb5: Use delaycompress for KDC logs [puppet] - 10https://gerrit.wikimedia.org/r/928701 (https://phabricator.wikimedia.org/T337906) [06:15:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:10] (03PS2) 10Muehlenhoff: krb5: Use delaycompress for KDC logs [puppet] - 10https://gerrit.wikimedia.org/r/928701 (https://phabricator.wikimedia.org/T337906) [06:19:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928701 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [06:19:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T336886)', diff saved to https://phabricator.wikimedia.org/P49341 and previous config saved to /var/cache/conftool/dbconfig/20230609-061941-ladsgroup.json [06:19:45] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:20:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:34:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P49342 and previous config saved to /var/cache/conftool/dbconfig/20230609-063447-ladsgroup.json [06:44:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress [06:44:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: Setup in progress [06:48:25] (03CR) 10Elukey: [C: 03+1] krb5: Use delaycompress for KDC logs [puppet] - 10https://gerrit.wikimedia.org/r/928701 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [06:48:58] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: prometheus3001.esams.wmnet [06:48:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: prometheus3001.esams.wmnet [06:49:02] (03CR) 10Klausman: [C: 03+2] ml-services: update docker images for revertrisk-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/928554 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [06:49:27] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: puppetmaster1005.eqiad.wmnet [06:49:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: puppetmaster1005.eqiad.wmnet [06:49:37] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: puppetmaster2005.codfw.wmnet [06:49:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: puppetmaster2005.codfw.wmnet [06:49:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: puppetmaster1005.eqiad.wmnet [06:49:48] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: puppetmaster2005.codfw.wmnet [06:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P49343 and previous config saved to /var/cache/conftool/dbconfig/20230609-064953-ladsgroup.json [06:50:24] !log installing wireshark security updates [06:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:34] (03Merged) 10jenkins-bot: ml-services: update docker images for revertrisk-wikidata [deployment-charts] - 10https://gerrit.wikimedia.org/r/928554 (https://phabricator.wikimedia.org/T333125) (owner: 10AikoChou) [07:01:34] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, [07:01:34] th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [07:02:10] PROBLEM - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:02:10] PROBLEM - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:03:04] PROBLEM - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [07:05:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T336886)', diff saved to https://phabricator.wikimedia.org/P49344 and previous config saved to /var/cache/conftool/dbconfig/20230609-070459-ladsgroup.json [07:05:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [07:05:04] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [07:05:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2116.codfw.wmnet with reason: Maintenance [07:05:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T336886)', diff saved to https://phabricator.wikimedia.org/P49345 and previous config saved to /var/cache/conftool/dbconfig/20230609-070520-ladsgroup.json [07:06:10] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:07:05] (03CR) 10Elukey: [C: 03+1] Remove dse mediawiki-page-content-change-enrichment and stream-enrichment-poc ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/927224 (https://phabricator.wikimedia.org/T325303) (owner: 10Ottomata) [07:11:00] PROBLEM - Restbase root url on restbase2018 is CRITICAL: connect to address 10.192.48.120 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [07:11:36] (03CR) 10Filippo Giunchedi: [C: 03+1] krb5: Use delaycompress for KDC logs [puppet] - 10https://gerrit.wikimedia.org/r/928701 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [07:13:27] (03PS2) 10Majavah: ldap: inline yamlconfig [puppet] - 10https://gerrit.wikimedia.org/r/924984 [07:13:40] (03PS2) 10Majavah: ldap::client::sssd: use strongly typed parameters [puppet] - 10https://gerrit.wikimedia.org/r/924985 [07:14:12] (03PS3) 10Majavah: ldap: inline yamlconfig [puppet] - 10https://gerrit.wikimedia.org/r/924984 [07:14:14] (03PS3) 10Majavah: ldap::client::sssd: use strongly typed parameters [puppet] - 10https://gerrit.wikimedia.org/r/924985 [07:14:20] PROBLEM - Host mw1492 is DOWN: PING CRITICAL - Packet loss = 100% [07:14:30] (03CR) 10Majavah: ldap::client::sssd: use strongly typed parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924985 (owner: 10Majavah) [07:19:44] (03PS4) 10Slyngshede: C:IDM Add ldap group settings. [puppet] - 10https://gerrit.wikimedia.org/r/928641 [07:19:47] !log powercycling restbase2018 (kernel hung following what looks like I/O errors) [07:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:37] (03CR) 10Slyngshede: C:IDM Add ldap group settings. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928641 (owner: 10Slyngshede) [07:21:05] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] LDAPBackend: Add mail and a default shell. [software/bitu] - 10https://gerrit.wikimedia.org/r/928586 (owner: 10Slyngshede) [07:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T336886)', diff saved to https://phabricator.wikimedia.org/P49346 and previous config saved to /var/cache/conftool/dbconfig/20230609-072118-ladsgroup.json [07:21:22] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [07:22:26] PROBLEM - cassandra-c service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:30] PROBLEM - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:22:40] PROBLEM - cassandra-b service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:46] PROBLEM - cassandra-a service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:58] PROBLEM - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:23:08] RECOVERY - Restbase root url on restbase2018 is OK: HTTP OK: HTTP/1.1 200 - 17613 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/RESTBase [07:23:12] PROBLEM - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:25:32] RECOVERY - cassandra-c service on restbase2018 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:25:46] RECOVERY - cassandra-b service on restbase2018 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:25:52] RECOVERY - cassandra-a service on restbase2018 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:27:26] RECOVERY - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.126 port 9042 https://phabricator.wikimedia.org/T93886 [07:27:38] RECOVERY - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-c valid until 2024-08-31 00:11:10 +0000 (expires in 448 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:27:52] RECOVERY - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-a valid until 2024-08-31 00:11:05 +0000 (expires in 448 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:28:08] RECOVERY - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.124 port 9042 https://phabricator.wikimedia.org/T93886 [07:28:08] RECOVERY - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is OK: TCP OK - 0.033 second response time on 10.192.48.125 port 9042 https://phabricator.wikimedia.org/T93886 [07:28:12] RECOVERY - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-b valid until 2024-08-31 00:11:07 +0000 (expires in 448 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [07:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:32] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1492.eqiad.wmnet [07:34:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:55] 10SRE, 10ops-eqiad, 10serviceops: mw1492 is down - https://phabricator.wikimedia.org/T338566 (10MoritzMuehlenhoff) [07:36:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P49347 and previous config saved to /var/cache/conftool/dbconfig/20230609-073624-ladsgroup.json [07:49:25] (03CR) 10Joal: [C: 03+1] "Ok for the version number and syntax, this needs to be deployed AFTER the new refinery-source package has been released within refinery" [puppet] - 10https://gerrit.wikimedia.org/r/928525 (https://phabricator.wikimedia.org/T335308) (owner: 10Snwachukwu) [07:51:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P49348 and previous config saved to /var/cache/conftool/dbconfig/20230609-075130-ladsgroup.json [07:55:34] (03PS1) 10Muehlenhoff: Disable cadvisor for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/928782 [07:55:43] (03PS2) 10Muehlenhoff: Disable cadvisor for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/928782 [08:02:09] (03CR) 10Filippo Giunchedi: "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/928782 (owner: 10Muehlenhoff) [08:02:14] (03CR) 10Filippo Giunchedi: [C: 03+1] Disable cadvisor for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/928782 (owner: 10Muehlenhoff) [08:06:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T336886)', diff saved to https://phabricator.wikimedia.org/P49349 and previous config saved to /var/cache/conftool/dbconfig/20230609-080637-ladsgroup.json [08:06:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:06:41] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [08:07:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:07:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T336886)', diff saved to https://phabricator.wikimedia.org/P49350 and previous config saved to /var/cache/conftool/dbconfig/20230609-080708-ladsgroup.json [08:11:28] PROBLEM - SSH on wdqs2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:12:28] (03Abandoned) 10Ilias Sarantopoulos: ml-services: debug HIP for AMD GPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/928085 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:15:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:56] (03CR) 10Muehlenhoff: [C: 03+2] Disable cadvisor for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/928782 (owner: 10Muehlenhoff) [08:15:57] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10SLyngshede-WMF) [08:18:50] (03CR) 10Daniel Kinzler: "I want to start with enwiki because if we see it works, the rest should also work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [08:20:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T336886)', diff saved to https://phabricator.wikimedia.org/P49351 and previous config saved to /var/cache/conftool/dbconfig/20230609-082310-ladsgroup.json [08:23:15] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [08:24:28] (03CR) 10Muehlenhoff: [C: 03+2] krb5: Use delaycompress for KDC logs [puppet] - 10https://gerrit.wikimedia.org/r/928701 (https://phabricator.wikimedia.org/T337906) (owner: 10Muehlenhoff) [08:28:33] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [08:30:06] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10SLyngshede-WMF) User have been added to the LDAP NDA group, we're holding off processing the rest until after training. [08:31:38] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/928641 (owner: 10Slyngshede) [08:37:02] (03CR) 10Slyngshede: [C: 03+2] C:IDM Add ldap group settings. [puppet] - 10https://gerrit.wikimedia.org/r/928641 (owner: 10Slyngshede) [08:38:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P49352 and previous config saved to /var/cache/conftool/dbconfig/20230609-083816-ladsgroup.json [08:39:52] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:53:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P49353 and previous config saved to /var/cache/conftool/dbconfig/20230609-085322-ladsgroup.json [09:00:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:01:48] (03PS1) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 [09:02:07] (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (owner: 10Ayounsi) [09:04:41] (03PS2) 10Arturo Borrero Gonzalez: labstore: Add glamwikidashboard project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/926599 (https://phabricator.wikimedia.org/T338063) (owner: 10Milimetric) [09:05:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:22] PROBLEM - Check systemd state on wdqs2021 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service,wdqs-blazegraph.service,wdqs-updater.service,wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labstore: Add glamwikidashboard project to dumps mounts [puppet] - 10https://gerrit.wikimedia.org/r/926599 (https://phabricator.wikimedia.org/T338063) (owner: 10Milimetric) [09:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T336886)', diff saved to https://phabricator.wikimedia.org/P49354 and previous config saved to /var/cache/conftool/dbconfig/20230609-090829-ladsgroup.json [09:08:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [09:08:33] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:08:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2141.codfw.wmnet with reason: Maintenance [09:09:16] RECOVERY - SSH on wdqs2021 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:31] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10SLyngshede-WMF) a:03SLyngshede-WMF I think we're done, but just cleaning up a few comment and old naming. [09:11:18] (03PS2) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 [09:12:54] (03PS3) 10Ayounsi: NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) [09:13:00] (03CR) 10CI reject: [V: 04-1] NetboxInventory: use GraphQL and save ~30s at each run [software/homer] - 10https://gerrit.wikimedia.org/r/928795 (https://phabricator.wikimedia.org/T310577) (owner: 10Ayounsi) [09:18:25] Has someone reported that beta wiki functions is dying already? [09:19:41] Request ID: ZILuK1Z0RVllJxE91Yeu@AAAAAk [09:19:41] Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:229 [09:19:53] I think that’s https://phabricator.wikimedia.org/T338495 [09:20:03] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/927782 [09:20:29] (03PS3) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: refresh VIPs [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) [09:21:06] It is; thanks Lucas_WMDE [09:21:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:21:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:21:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T336886)', diff saved to https://phabricator.wikimedia.org/P49355 and previous config saved to /var/cache/conftool/dbconfig/20230609-092141-ladsgroup.json [09:21:45] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:21:56] (03PS1) 10Majavah: ssh: support listening on multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/928797 (https://phabricator.wikimedia.org/T337241) [09:23:48] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/928508/41639/" [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:24:06] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] cloudservices: codfw1dev: refresh VIPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:24:39] (03PS1) 10Lucas Werkmeister (WMDE): [wikidatawiki] Add pagelang to wikidata-staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928798 (https://phabricator.wikimedia.org/T337760) [09:26:48] (03CR) 10Majavah: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/928797/41640/" [puppet] - 10https://gerrit.wikimedia.org/r/928797 (https://phabricator.wikimedia.org/T337241) (owner: 10Majavah) [09:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:33] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:34:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:57] (03CR) 10Cathal Mooney: [C: 03+1] cloudservices: codfw1dev: refresh VIPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:36:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T336886)', diff saved to https://phabricator.wikimedia.org/P49356 and previous config saved to /var/cache/conftool/dbconfig/20230609-093638-ladsgroup.json [09:36:42] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [09:37:10] 10SRE, 10ops-eqiad, 10serviceops: mw1492 is down - https://phabricator.wikimedia.org/T338566 (10elukey) ` elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "mw1492.mgmt.eqiad.wmnet" -U root -E mc reset cold Unable to read password from environment Password: Error: Unable to establish IPMI v2 / RMCP+ session `... [09:41:20] (03PS4) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: refresh VIPs [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) [09:41:36] (03CR) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: refresh VIPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:43:20] (03PS1) 10Marco Fossati: ImageSuggestions: add help link to 4 new languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) [09:45:13] (03PS3) 10Ilias Sarantopoulos: ml-services: deploy LLM model falcon-7b-instruct with GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/927611 (https://phabricator.wikimedia.org/T333861) [09:45:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:00] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:48:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices: codfw1dev: refresh VIPs [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [09:49:20] (03PS4) 10Ilias Sarantopoulos: ml-services: deploy LLM model falcon-7b-instruct with GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/927611 (https://phabricator.wikimedia.org/T333861) [09:50:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:24] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928645 (owner: 10JHathaway) [09:51:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P49357 and previous config saved to /var/cache/conftool/dbconfig/20230609-095144-ladsgroup.json [09:52:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927796 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [09:53:27] (03CR) 10Elukey: [C: 03+2] ml-services: deploy LLM model falcon-7b-instruct with GPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/927611 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [09:53:51] (03PS4) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) [09:54:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928642 (owner: 10JHathaway) [09:54:25] !log installing jupyter-core security updates on bullseye [09:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:32] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:54:33] (03Abandoned) 10Arturo Borrero Gonzalez: profile::bird::anycast: add template parameter [puppet] - 10https://gerrit.wikimedia.org/r/904518 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [09:55:39] (03CR) 10Jbond: "lgtm excluding the comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [09:56:43] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:57:08] !log increase {eqiad,codfw}.change-prop.transcludes.resource-change topic partitions (3->5) on kafka main clusters - T338357 [09:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:11] T338357: Pushing jobs to jobqueue is slow again - https://phabricator.wikimedia.org/T338357 [09:59:47] (03CR) 10Matthias Mullie: [C: 03+1] "LGTM; ready for deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928801 (https://phabricator.wikimedia.org/T331036) (owner: 10Marco Fossati) [10:03:03] (03CR) 10Jbond: "lgtm but see comment" [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:04:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928644 (owner: 10JHathaway) [10:06:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P49358 and previous config saved to /var/cache/conftool/dbconfig/20230609-100650-ladsgroup.json [10:07:26] (03CR) 10Jbond: "lgtm circa the namespacing issue but i think we can deal with that separately if it works then its juts semantics at the end of the day" [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:08:01] (03CR) 10Jbond: [C: 03+1] dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:08:49] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [10:09:37] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [10:10:28] (03PS3) 10Clément Goubert: Backport preStop sleep and draining changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T331609) [10:12:00] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [10:12:29] (03CR) 10Jbond: [C: 03+1] "LGTM see nit" [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:12:51] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [10:15:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:21] (03PS1) 10Zabe: admin: Add zabe to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/928803 (https://phabricator.wikimedia.org/T337703) [10:17:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Drop disabling removed Datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928600 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [10:17:38] (03PS1) 10Ssingh: Revert "Revert "lvs2014: commission new LVS host (codfw hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/928610 [10:17:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "should be ok to deploy once T332724 is ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928601 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [10:18:08] (03CR) 10Jbond: "see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/928662 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:18:59] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10kostajh) [10:19:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:40] (03CR) 10Jbond: dev env: have ssh server use the dev environment ssh configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928664 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:20:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928669 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:21:04] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "lvs2014: commission new LVS host (codfw hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/928610 (owner: 10Ssingh) [10:21:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2014.codfw.wmnet with OS bullseye [10:21:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye [10:21:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T336886)', diff saved to https://phabricator.wikimedia.org/P49359 and previous config saved to /var/cache/conftool/dbconfig/20230609-102156-ladsgroup.json [10:21:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:22:00] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:22:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:22:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T336886)', diff saved to https://phabricator.wikimedia.org/P49360 and previous config saved to /var/cache/conftool/dbconfig/20230609-102217-ladsgroup.json [10:23:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928671 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:23:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928670 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:24:05] (03PS1) 10Majavah: bird: prune unmanaged anycast-healthchecker checks [puppet] - 10https://gerrit.wikimedia.org/r/928804 [10:24:43] (03PS1) 10Slyngshede: Addresses feedback from testing [software/bitu] - 10https://gerrit.wikimedia.org/r/928805 [10:25:17] (03CR) 10Jbond: [C: 03+1] "ahh great i think it would be better to use this file to do things like disable puppet agent and alter how we manager ssh." [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:28:25] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: recursor: add new BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) [10:28:46] (03CR) 10CI reject: [V: 04-1] openstack: designate: recursor: add new BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] bird: prune unmanaged anycast-healthchecker checks [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [10:30:30] (03CR) 10Jbond: dev env: don't pull firewall rules from etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928662 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [10:31:13] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM good stuff :)" [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [10:31:19] (03PS4) 10Clément Goubert: Backport preStop sleep and draining changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/928791 (https://phabricator.wikimedia.org/T338210) [10:32:38] (03CR) 10Muehlenhoff: "Looks good, one comment inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/928805 (owner: 10Slyngshede) [10:33:03] (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: recursor: add new BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Superpes15) >>! In T338468#8916922, @thcipriani wrote: > Now that @Superpes15 has NDA, they can attend a deployment training slot (note: use [[ https://phabricato... [10:35:35] (03PS1) 10Majavah: pdns_server: require bullseye [puppet] - 10https://gerrit.wikimedia.org/r/928808 [10:35:37] (03PS1) 10Majavah: pdns_server: allow listening on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/928809 [10:35:39] (03PS1) 10Majavah: pdns_server: use dns_auth_soa_name instead of a hardcoded value [puppet] - 10https://gerrit.wikimedia.org/r/928810 [10:37:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2014.codfw.wmnet with reason: host reimage [10:37:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T336886)', diff saved to https://phabricator.wikimedia.org/P49361 and previous config saved to /var/cache/conftool/dbconfig/20230609-103711-ladsgroup.json [10:37:16] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [10:37:21] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41641/console" [puppet] - 10https://gerrit.wikimedia.org/r/928810 (owner: 10Majavah) [10:38:28] (03PS2) 10Majavah: pdns_server: allow listening on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/928809 [10:38:29] (03PS2) 10Majavah: pdns_server: use dns_auth_soa_name instead of a hardcoded value [puppet] - 10https://gerrit.wikimedia.org/r/928810 [10:38:59] (03CR) 10Ayounsi: "Adding sukhe as reviewer as it might impact DNS and I don't know enough those "file" parameter." [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [10:39:32] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41642/console" [puppet] - 10https://gerrit.wikimedia.org/r/928810 (owner: 10Majavah) [10:40:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] pdns_server: require bullseye [puppet] - 10https://gerrit.wikimedia.org/r/928808 (owner: 10Majavah) [10:40:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2014.codfw.wmnet with reason: host reimage [10:44:40] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:45:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:21] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/output/928809/41643/" [puppet] - 10https://gerrit.wikimedia.org/r/928809 (owner: 10Majavah) [10:48:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:30] (03PS1) 10Ilias Sarantopoulos: ml-services: remove falcon llm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/928812 [10:51:02] (03CR) 10Muehlenhoff: "I'm wondering if can't simply replace this with a debconf::set, wouldn't that also have the same effect? Does this settng cause any issues" [puppet] - 10https://gerrit.wikimedia.org/r/928644 (owner: 10JHathaway) [10:51:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [10:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P49362 and previous config saved to /var/cache/conftool/dbconfig/20230609-105217-ladsgroup.json [10:53:11] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10jbond) 05Open→03Resolved i think you are right ` $ cumin R:cron... [10:53:20] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) [10:53:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2014.codfw.wmnet with OS bullseye [10:53:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye completed:... [10:57:50] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [11:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:57] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add new LVS host lvs2014 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/928113 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [11:02:26] !log homer "cr*-codfw*" commit "Gerrit: 928113 add new LVS host lvs2014 [11:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P49363 and previous config saved to /var/cache/conftool/dbconfig/20230609-110723-ladsgroup.json [11:07:58] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10MoritzMuehlenhoff) > ill close this no need to keep it around to change comments, great work all It's not just comments, there's al... [11:08:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] pdns_server: allow listening on multiple addresses [puppet] - 10https://gerrit.wikimedia.org/r/928809 (owner: 10Majavah) [11:08:48] (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post hardware refresh (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/928818 (https://phabricator.wikimedia.org/T326767) [11:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [11:13:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T338333 (10phaultfinder) [11:14:43] !log sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 [11:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:10] (03PS1) 10Muehlenhoff: Remove access for fsero [puppet] - 10https://gerrit.wikimedia.org/r/928819 [11:17:59] (03CR) 10Elukey: [C: 03+2] ml-services: remove falcon llm deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/928812 (owner: 10Ilias Sarantopoulos) [11:19:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:44] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for fsero [puppet] - 10https://gerrit.wikimedia.org/r/928819 (owner: 10Muehlenhoff) [11:20:22] !log pcc-db1001: sudo systemctl start pcc_facts_processor.service [11:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T336886)', diff saved to https://phabricator.wikimedia.org/P49364 and previous config saved to /var/cache/conftool/dbconfig/20230609-112229-ladsgroup.json [11:22:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [11:22:34] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:22:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2153.codfw.wmnet with reason: Maintenance [11:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T336886)', diff saved to https://phabricator.wikimedia.org/P49365 and previous config saved to /var/cache/conftool/dbconfig/20230609-112250-ladsgroup.json [11:23:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [11:26:05] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/928818/41648/" [puppet] - 10https://gerrit.wikimedia.org/r/928818 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [11:27:06] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:27:24] topranks: XioNoX, is that anycast alert above expected ? [11:27:34] I see in cloud-admin there was some discussion around bird changes [11:27:42] claime: cloudsw1-b1-codfw is their dev/test env [11:27:50] so even if not expected you can ignore [11:27:55] a'ight [11:27:56] thanks [11:28:04] arturo: ^ [11:29:08] yep it's understood arturo is working on some patches to fix it up [11:29:12] thanks claime for the heads up [11:29:37] I get twitchy around network alerts :p [11:29:50] thanks for the quick answer [11:30:32] yes, working on it, thanks for the heads up [11:30:33] topranks: glad to see the alert working as expected too [11:30:43] as it's a recent check [11:30:55] I'm also happy introducing some rule for the monitoring system to don't alert about that device [11:32:46] I'll leave that up to the network SREs :) [11:35:49] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [11:36:21] (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "PCC detected a subtle change: https://puppet-compiler.wmflabs.org/output/928810/41647/ that should be further investigated" [puppet] - 10https://gerrit.wikimedia.org/r/928810 (owner: 10Majavah) [11:37:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T336886)', diff saved to https://phabricator.wikimedia.org/P49366 and previous config saved to /var/cache/conftool/dbconfig/20230609-113724-ladsgroup.json [11:37:30] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [11:38:15] (03CR) 10Majavah: [C: 04-1] openstack: designate: recursor: add new BGP VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:38:47] icinga alert above is related to user removal [11:39:45] (03PS3) 10Arturo Borrero Gonzalez: openstack: designate: recursor: add new BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) [11:41:41] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/928806/41649/" [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:43:06] (03PS1) 10Majavah: openstack: designate: auth: allow configuring ips to listen on [puppet] - 10https://gerrit.wikimedia.org/r/928822 (https://phabricator.wikimedia.org/T307357) [11:43:34] (03CR) 10Majavah: [C: 03+1] openstack: designate: recursor: add new BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:43:59] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: designate: recursor: add new BGP VIP [puppet] - 10https://gerrit.wikimedia.org/r/928806 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [11:44:20] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41650/console" [puppet] - 10https://gerrit.wikimedia.org/r/928822 (https://phabricator.wikimedia.org/T307357) (owner: 10Majavah) [11:44:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:45:04] (03PS2) 10Slyngshede: Addresses feedback from testing [software/bitu] - 10https://gerrit.wikimedia.org/r/928805 [11:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:19] (03PS2) 10Majavah: openstack: designate: auth: allow configuring ips to listen on [puppet] - 10https://gerrit.wikimedia.org/r/928822 (https://phabricator.wikimedia.org/T307357) [11:46:26] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41651/console" [puppet] - 10https://gerrit.wikimedia.org/r/928822 (https://phabricator.wikimedia.org/T307357) (owner: 10Majavah) [11:46:34] (03CR) 10Slyngshede: Addresses feedback from testing (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/928805 (owner: 10Slyngshede) [11:46:56] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Addresses feedback from testing [software/bitu] - 10https://gerrit.wikimedia.org/r/928805 (owner: 10Slyngshede) [11:47:02] (03PS1) 10Muehlenhoff: Remove fsero from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/928823 [11:47:43] (03PS1) 10Urbanecm: [Growth] Enable user impact refresh for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928824 (https://phabricator.wikimedia.org/T336203) [11:47:45] (03PS1) 10Urbanecm: [Growth] Enable new Impact module for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928825 (https://phabricator.wikimedia.org/T336203) [11:48:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: designate: auth: allow configuring ips to listen on [puppet] - 10https://gerrit.wikimedia.org/r/928822 (https://phabricator.wikimedia.org/T307357) (owner: 10Majavah) [11:49:21] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Fsero out of all services on: 1262 hosts [11:49:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:49:55] (03CR) 10Clément Goubert: [C: 03+1] Remove fsero from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/928823 (owner: 10Muehlenhoff) [11:50:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Fsero out of all services on: 1262 hosts [11:50:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove fsero from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/928823 (owner: 10Muehlenhoff) [11:52:13] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Fsero out of all services on: 778 hosts [11:52:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P49367 and previous config saved to /var/cache/conftool/dbconfig/20230609-115230-ladsgroup.json [11:52:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Fsero out of all services on: 778 hosts [11:56:24] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [11:59:08] (03PS1) 10Majavah: hieradata: update codfw1dev dns healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/928826 [12:00:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: update codfw1dev dns healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/928826 (owner: 10Majavah) [12:00:25] (03PS1) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. [puppet] - 10https://gerrit.wikimedia.org/r/928827 [12:01:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [12:03:30] (Not accepting/receiving prefixes from anycast BGP peer) resolved: Device cloudsw1-b1-codfw.mgmt.codfw.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [12:07:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P49368 and previous config saved to /var/cache/conftool/dbconfig/20230609-120737-ladsgroup.json [12:12:10] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) I think this can be resolved, although I haven't categorically proven that it's... [12:13:15] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Shared-Data-Infrastructure, 10User-MoritzMuehlenhoff: Hadoop MapReduce port range cannot be configured to a fixed range - https://phabricator.wikimedia.org/T111433 (10BTullis) 05Open→03Resolved [12:13:39] !log aborrero@cumin2002 START - Cookbook sre.dns.netbox [12:13:53] (03CR) 10Muehlenhoff: C:IDM Enable tooltips for ldap attributes. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928827 (owner: 10Slyngshede) [12:15:35] !log aborrero@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2003-dev - aborrero@cumin2002" [12:16:39] !log aborrero@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw2003-dev - aborrero@cumin2002" [12:16:39] !log aborrero@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:14] (03PS2) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. [puppet] - 10https://gerrit.wikimedia.org/r/928827 [12:18:31] (03PS3) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. [puppet] - 10https://gerrit.wikimedia.org/r/928827 [12:20:10] (03CR) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928827 (owner: 10Slyngshede) [12:22:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T336886)', diff saved to https://phabricator.wikimedia.org/P49369 and previous config saved to /var/cache/conftool/dbconfig/20230609-122243-ladsgroup.json [12:22:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [12:22:47] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:22:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2167.codfw.wmnet with reason: Maintenance [12:23:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T336886)', diff saved to https://phabricator.wikimedia.org/P49370 and previous config saved to /var/cache/conftool/dbconfig/20230609-122303-ladsgroup.json [12:27:14] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:28:06] (03CR) 10Muehlenhoff: "Couple more nits." [puppet] - 10https://gerrit.wikimedia.org/r/928827 (owner: 10Slyngshede) [12:29:10] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-add DNS for cloud-hosts-codfw vlan. - cmooney@cumin1001" [12:30:10] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-add DNS for cloud-hosts-codfw vlan. - cmooney@cumin1001" [12:30:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:30:18] (03PS4) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. [puppet] - 10https://gerrit.wikimedia.org/r/928827 [12:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:48] (03CR) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928827 (owner: 10Slyngshede) [12:32:45] (03CR) 10Muehlenhoff: [C: 03+1] "One final nit" [puppet] - 10https://gerrit.wikimedia.org/r/928827 (owner: 10Slyngshede) [12:33:26] (03PS5) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. [puppet] - 10https://gerrit.wikimedia.org/r/928827 [12:34:08] (03CR) 10Slyngshede: C:IDM Enable tooltips for ldap attributes. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928827 (owner: 10Slyngshede) [12:34:10] (03CR) 10Slyngshede: [C: 03+2] C:IDM Enable tooltips for ldap attributes. [puppet] - 10https://gerrit.wikimedia.org/r/928827 (owner: 10Slyngshede) [12:35:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:57] (03PS1) 10Slyngshede: Extra " [software/bitu] - 10https://gerrit.wikimedia.org/r/928833 [12:37:25] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Extra " [software/bitu] - 10https://gerrit.wikimedia.org/r/928833 (owner: 10Slyngshede) [12:37:32] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:33] (03PS2) 10Krinkle: Profiler: Replace copy of ExcimerClient.php with git submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926574 (https://phabricator.wikimedia.org/T337873) [12:40:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T336886)', diff saved to https://phabricator.wikimedia.org/P49371 and previous config saved to /var/cache/conftool/dbconfig/20230609-124002-ladsgroup.json [12:40:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [12:41:04] * Krinkle locks deploy1002 for testing on mwdebug1002 [12:42:21] (03CR) 10Krinkle: [C: 03+2] Profiler: Replace copy of ExcimerClient.php with git submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926574 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [12:43:23] (03Merged) 10jenkins-bot: Profiler: Replace copy of ExcimerClient.php with git submodule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926574 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [12:43:31] !log krinkle@deploy1002 Started scap: I385d28d2edacb37 [12:45:03] (03PS1) 10Krinkle: Profiler: Include the hostname in the URL for Excimer UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 [12:47:53] (03PS1) 10Cathal Mooney: Add private cloud-vips range to prefix list for cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/928838 (https://phabricator.wikimedia.org/T307357) [12:50:30] !log krinkle@deploy1002 Finished scap: I385d28d2edacb37 (duration: 06m 59s) [12:51:24] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add private cloud-vips range to prefix list for cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/928838 (https://phabricator.wikimedia.org/T307357) (owner: 10Cathal Mooney) [12:53:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add private cloud-vips range to prefix list for cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/928838 (https://phabricator.wikimedia.org/T307357) (owner: 10Cathal Mooney) [12:53:34] (03CR) 10Cathal Mooney: Add private cloud-vips range to prefix list for cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/928838 (https://phabricator.wikimedia.org/T307357) (owner: 10Cathal Mooney) [12:53:36] (03CR) 10Cathal Mooney: [C: 03+2] Add private cloud-vips range to prefix list for cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/928838 (https://phabricator.wikimedia.org/T307357) (owner: 10Cathal Mooney) [12:55:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P49373 and previous config saved to /var/cache/conftool/dbconfig/20230609-125508-ladsgroup.json [12:55:32] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2014 [12:55:33] (03CR) 10Cathal Mooney: [C: 03+2] Add private cloud-vips range to prefix list for cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/928838 (https://phabricator.wikimedia.org/T307357) (owner: 10Cathal Mooney) [12:56:01] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2014 [12:56:02] (03Merged) 10jenkins-bot: Add private cloud-vips range to prefix list for cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/928838 (https://phabricator.wikimedia.org/T307357) (owner: 10Cathal Mooney) [12:57:32] (03PS1) 10Muehlenhoff: Create additional nftables directories [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) [12:57:45] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2014 [12:57:54] (03CR) 10CI reject: [V: 04-1] Create additional nftables directories [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:57:59] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2014 [12:59:33] (03PS2) 10Muehlenhoff: Create additional nftables directories [puppet] - 10https://gerrit.wikimedia.org/r/928839 (https://phabricator.wikimedia.org/T336497) [12:59:51] !log sudo cumin 'A:lvs and A:codfw' 'disable-puppet "CR 928818"' [12:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:40] (03CR) 10Ssingh: [C: 03+2] hiera: lvs/balancer: unify hiera post hardware refresh (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/928818 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [13:01:50] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2014 [13:02:27] !log sudo cumin 'A:lvs and A:codfw' 'enable-puppet "CR 928818"' [13:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs2014 [13:05:39] (03PS1) 10Ssingh: depool codfw (emergency patch, do not merge, testing new LVS host) [dns] - 10https://gerrit.wikimedia.org/r/928840 [13:07:41] !log stop pybal on lvs2013 to test lvs2014 [13:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:10:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P49376 and previous config saved to /var/cache/conftool/dbconfig/20230609-131014-ladsgroup.json [13:10:26] claime: expected ^ [13:10:32] if it breaks, I will fix it so on me :) [13:12:32] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:12:54] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:13:11] ^ expected [13:13:17] (03CR) 10Krinkle: [C: 04-1] "Doesn't work as intended. Fixing in https://gerrit.wikimedia.org/r/c/performance/excimer-ui-client/+/928841" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928836 (owner: 10Krinkle) [13:13:24] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:16:30] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [13:24:50] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:00] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T336886)', diff saved to https://phabricator.wikimedia.org/P49377 and previous config saved to /var/cache/conftool/dbconfig/20230609-132520-ladsgroup.json [13:25:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:25:25] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:25:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:25:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T336886)', diff saved to https://phabricator.wikimedia.org/P49378 and previous config saved to /var/cache/conftool/dbconfig/20230609-132541-ladsgroup.json [13:25:57] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [13:26:21] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:28:54] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [13:29:04] !log start pybal on lvs2013 [13:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on cp4037.ulsfo.wmnet with reason: Working on vk [13:29:18] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:29:38] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:30:08] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:06] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:54] (03PS1) 10Elukey: profile::cache::kafka::certificate: use root instead of the kafka user [puppet] - 10https://gerrit.wikimedia.org/r/928846 (https://phabricator.wikimedia.org/T337825) [13:32:22] (03CR) 10Ssingh: "Looks good in theory: we manage the hc-vip* files under anycast-healthchecker.d ourselves but like Arzhel said, please run PCC and happy t" [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [13:33:06] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 77 connections established with conf2004.codfw.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [13:33:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41652/console" [puppet] - 10https://gerrit.wikimedia.org/r/928846 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:34:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:58] (03CR) 10Ssingh: [C: 03+1] profile::cache::kafka::certificate: use root instead of the kafka user [puppet] - 10https://gerrit.wikimedia.org/r/928846 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:35:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::cache::kafka::certificate: use root instead of the kafka user [puppet] - 10https://gerrit.wikimedia.org/r/928846 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [13:35:38] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:31] * Krinkle more testing on mwdebug1002 [13:37:08] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:34] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T336886)', diff saved to https://phabricator.wikimedia.org/P49379 and previous config saved to /var/cache/conftool/dbconfig/20230609-134137-ladsgroup.json [13:41:41] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [13:45:27] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff) Another option which will be used for Planet Debian: https://grep.be/blog//en/computer/ptlink/Planet_Debian_rendered_with_PtLink/ [13:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:50:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:56:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P49380 and previous config saved to /var/cache/conftool/dbconfig/20230609-135643-ladsgroup.json [13:57:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.229 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:57:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:06:56] (03PS1) 10Elukey: Revert "Move cp4037's varnishkafka instances to PKI" [puppet] - 10https://gerrit.wikimedia.org/r/928616 [14:07:33] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:55] (03PS1) 10Samtar: cloud pki: add (new) add deployment-prep agents as authorised clients [puppet] - 10https://gerrit.wikimedia.org/r/928851 (https://phabricator.wikimedia.org/T338495) [14:09:24] PROBLEM - Check systemd state on cp4037 is CRITICAL: CRITICAL - degraded: The following units failed: varnishkafka-webrequest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:26] PROBLEM - Webrequests Varnishkafka log producer on cp4037 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:11:19] (03CR) 10Elukey: [C: 03+2] Revert "Move cp4037's varnishkafka instances to PKI" [puppet] - 10https://gerrit.wikimedia.org/r/928616 (owner: 10Elukey) [14:11:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P49381 and previous config saved to /var/cache/conftool/dbconfig/20230609-141149-ladsgroup.json [14:13:34] RECOVERY - Webrequests Varnishkafka log producer on cp4037 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [14:14:08] RECOVERY - Check systemd state on cp4037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:21] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Traffic: Move varnishkafka to PKI - https://phabricator.wikimedia.org/T337825 (10elukey) ` Jun 09 14:05:42 cp4037 varnishkafka[3568251]: %3|1686319542.526|FAIL|varnishkafka#producer-1| [thrd:ssl://kafka-jumbo1009.eqiad.wmnet:9093/bootstrap]: ssl://kafka-jum... [14:14:40] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [14:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:32] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) a:05Jclark-ctr→03Papaul [14:19:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:51] (03PS2) 10Samtar: cloud pki: add (new) add deployment-prep agents as authorised clients [puppet] - 10https://gerrit.wikimedia.org/r/928851 (https://phabricator.wikimedia.org/T338495) [14:21:44] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:18] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:52] (03CR) 10Jbond: [C: 03+1] "LGTM barring the commit message but lets not deploy on a friday" [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [14:25:20] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:42] (03CR) 10Jbond: [C: 03+1] ldap::client::sssd: use strongly typed parameters [puppet] - 10https://gerrit.wikimedia.org/r/924985 (owner: 10Majavah) [14:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T336886)', diff saved to https://phabricator.wikimedia.org/P49382 and previous config saved to /var/cache/conftool/dbconfig/20230609-142655-ladsgroup.json [14:26:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [14:26:59] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:27:00] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:27:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2173.codfw.wmnet with reason: Maintenance [14:27:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:27:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:27:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T336886)', diff saved to https://phabricator.wikimedia.org/P49383 and previous config saved to /var/cache/conftool/dbconfig/20230609-142731-ladsgroup.json [14:29:15] (03PS1) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) [14:29:40] (03CR) 10CI reject: [V: 04-1] profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:29:42] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) @papaul i did not have any luck last night imaging servers still failling [14:32:52] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:28] (03PS2) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) [14:34:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41654/console" [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:35:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:47] (03CR) 10CI reject: [V: 04-1] profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [14:38:15] (03PS1) 10Jbond: deployment-prep: add new puppet ca public cert [puppet] - 10https://gerrit.wikimedia.org/r/928856 (https://phabricator.wikimedia.org/T338495) [14:38:48] (03CR) 10Jbond: [V: 03+2 C: 03+2] deployment-prep: add new puppet ca public cert [puppet] - 10https://gerrit.wikimedia.org/r/928856 (https://phabricator.wikimedia.org/T338495) (owner: 10Jbond) [14:38:56] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:49] (03Abandoned) 10Samtar: cloud pki: add (new) add deployment-prep agents as authorised clients [puppet] - 10https://gerrit.wikimedia.org/r/928851 (https://phabricator.wikimedia.org/T338495) (owner: 10Samtar) [14:41:48] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T336886)', diff saved to https://phabricator.wikimedia.org/P49384 and previous config saved to /var/cache/conftool/dbconfig/20230609-144305-ladsgroup.json [14:43:10] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [14:43:28] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:33] (03CR) 10Jbond: "ahh sorry missed this CR, currently on regional holiday so just skimmed the task" [puppet] - 10https://gerrit.wikimedia.org/r/928851 (https://phabricator.wikimedia.org/T338495) (owner: 10Samtar) [14:44:48] (03CR) 10Samtar: cloud pki: add (new) add deployment-prep agents as authorised clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928851 (https://phabricator.wikimedia.org/T338495) (owner: 10Samtar) [14:45:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:08] (03CR) 10Majavah: [C: 04-1] wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:50:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:28] (03PS5) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) [14:50:57] (03PS6) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) [14:51:20] (03CR) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [14:53:54] (03PS1) 10Majavah: P:puppet::agent: add script to locate unmanaged files [puppet] - 10https://gerrit.wikimedia.org/r/928857 [14:54:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:56:08] (03CR) 10CI reject: [V: 04-1] P:puppet::agent: add script to locate unmanaged files [puppet] - 10https://gerrit.wikimedia.org/r/928857 (owner: 10Majavah) [14:57:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41655/console" [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [14:57:27] (03PS2) 10Majavah: P:puppet::agent: add script to locate unmanaged files [puppet] - 10https://gerrit.wikimedia.org/r/928857 [14:58:11] (03CR) 10Majavah: [V: 03+1] bird: prune unmanaged anycast-healthchecker checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928804 (owner: 10Majavah) [14:58:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P49385 and previous config saved to /var/cache/conftool/dbconfig/20230609-145812-ladsgroup.json [14:59:46] (03CR) 10Jbond: [C: 03+1] "lgtm will test a bit more and merge monday if no-one else has." [puppet] - 10https://gerrit.wikimedia.org/r/928857 (owner: 10Majavah) [15:01:38] (03PS3) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) [15:03:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41656/console" [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [15:04:04] (03PS1) 10Hokwelum: Modify the global blocks script to accept output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 [15:04:16] (03PS3) 10Majavah: P:puppet::agent: add script to locate unmanaged files [puppet] - 10https://gerrit.wikimedia.org/r/928857 [15:13:17] (03PS4) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) [15:13:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P49386 and previous config saved to /var/cache/conftool/dbconfig/20230609-151318-ladsgroup.json [15:13:19] (03PS1) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) [15:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41657/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [15:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:19] (03PS5) 10Elukey: profile::cache::kafka::certificate: fix client PKI config [puppet] - 10https://gerrit.wikimedia.org/r/928854 (https://phabricator.wikimedia.org/T337825) [15:16:21] (03PS2) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) [15:17:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41658/console" [puppet] - 10https://gerrit.wikimedia.org/r/928862 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [15:17:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) [15:22:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entry for snapshot101[6-7] - pt1979@cumin2002" [15:23:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entry for snapshot101[6-7] - pt1979@cumin2002" [15:23:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:27:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host snapshot1016.mgmt.eqiad.wmnet with reboot policy FORCED [15:27:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host snapshot1017.mgmt.eqiad.wmnet with reboot policy FORCED [15:27:54] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:28:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T336886)', diff saved to https://phabricator.wikimedia.org/P49387 and previous config saved to /var/cache/conftool/dbconfig/20230609-152824-ladsgroup.json [15:28:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [15:28:28] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:28:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2174.codfw.wmnet with reason: Maintenance [15:28:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T336886)', diff saved to https://phabricator.wikimedia.org/P49388 and previous config saved to /var/cache/conftool/dbconfig/20230609-152845-ladsgroup.json [15:29:45] (03CR) 10JHathaway: [C: 03+2] wmflib::dir::mkdir_p: exclude FHS dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928645 (owner: 10JHathaway) [15:29:56] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:39] !log wikitech-static: deleted everything in /srv/mediawiki/images/wikitech/archive for T338520 [15:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:42] T338520: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 [15:31:20] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:27] (03CR) 10JHathaway: [C: 03+2] puppet7: re-add mailalias core [puppet] - 10https://gerrit.wikimedia.org/r/927796 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [15:34:10] (03CR) 10JHathaway: [C: 03+2] "thanks for reviewing" [puppet] - 10https://gerrit.wikimedia.org/r/928642 (owner: 10JHathaway) [15:34:30] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:52] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. From what I understand should be ok to merge now. 185.15.57.25 is answering auth requests for relevant domains fine from internet," [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:36:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:38:54] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:44] (03CR) 10JHathaway: tshark: use a preseed file, rather than debconf::seen (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928644 (owner: 10JHathaway) [15:41:13] (03PS1) 10Arturo Borrero Gonzalez: Revert "wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers" [dns] - 10https://gerrit.wikimedia.org/r/928619 [15:41:34] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers" [dns] - 10https://gerrit.wikimedia.org/r/928619 (owner: 10Arturo Borrero Gonzalez) [15:41:44] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:48] PROBLEM - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:42:02] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, [15:42:02] th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [15:42:06] PROBLEM - cassandra-a CQL 10.64.0.209:9042 on restbase1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:42:08] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [15:42:08] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:42:52] PROBLEM - cassandra-b CQL 10.64.0.210:9042 on restbase1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:06] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:08] PROBLEM - cassandra-c CQL 10.64.0.211:9042 on restbase1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:12] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:16] RECOVERY - Restbase edge drmrs on text-lb.drmrs.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:43:36] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:36] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:43:40] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:28] (03PS2) 10JHathaway: tshark: drop debconf::seen [puppet] - 10https://gerrit.wikimedia.org/r/928644 [15:44:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T336886)', diff saved to https://phabricator.wikimedia.org/P49390 and previous config saved to /var/cache/conftool/dbconfig/20230609-154428-ladsgroup.json [15:44:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [15:45:14] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:38] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [15:53:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:56:50] (03PS3) 10JHathaway: apt: ensure profile::apt is applied before packages. [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) [15:57:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host snapshot1016.mgmt.eqiad.wmnet with reboot policy FORCED [15:57:48] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [15:58:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:47] (03Abandoned) 10JHathaway: DO NOT MERGE: apply profile::apt in separate stage [puppet] - 10https://gerrit.wikimedia.org/r/927789 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [15:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P49391 and previous config saved to /var/cache/conftool/dbconfig/20230609-155934-ladsgroup.json [16:00:12] PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [16:02:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['snapshot1016'] [16:03:22] (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers"" [dns] - 10https://gerrit.wikimedia.org/r/928620 [16:04:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['snapshot1016'] [16:04:59] (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers"" [dns] - 10https://gerrit.wikimedia.org/r/928620 [16:05:06] (03PS3) 10JHathaway: dev env: nrpe listen on all interfaces in a container [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) [16:05:21] (03CR) 10JHathaway: dev env: nrpe listen on all interfaces in a container (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:05:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host snapshot1017.mgmt.eqiad.wmnet with reboot policy FORCED [16:05:40] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) (owner: 10JHathaway) [16:05:42] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/928620 (owner: 10Arturo Borrero Gonzalez) [16:06:39] (03CR) 10Bartosz Dziewoński: [C: 03+1] Remove wgDiscussionToolsEnable config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927632 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [16:06:51] (03CR) 10Bartosz Dziewoński: [C: 03+1] Remove most DiscussionTools feature configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927653 (https://phabricator.wikimedia.org/T322497) (owner: 10Esanders) [16:08:06] (03CR) 10JHathaway: dev env: avoid kernel tweaks when in a container (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:09:09] (03CR) 10BBlack: [C: 03+1] "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/928620 (owner: 10Arturo Borrero Gonzalez) [16:12:47] (03PS2) 10JHathaway: dev env: allow setting $site via an env var [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) [16:12:55] (03CR) 10JHathaway: dev env: allow setting $site via an env var (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:14:02] (03CR) 10JHathaway: [C: 03+2] dev env: add an insetup role for container builds [puppet] - 10https://gerrit.wikimedia.org/r/928670 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:14:04] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Revert "wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers"" [dns] - 10https://gerrit.wikimedia.org/r/928620 (owner: 10Arturo Borrero Gonzalez) [16:14:17] (03CR) 10JHathaway: [C: 03+2] "thanks for reviewing" [puppet] - 10https://gerrit.wikimedia.org/r/928670 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:14:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P49392 and previous config saved to /var/cache/conftool/dbconfig/20230609-161440-ladsgroup.json [16:15:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:41] (03CR) 10JHathaway: [C: 03+2] "thanks for reviewing" [puppet] - 10https://gerrit.wikimedia.org/r/928671 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:16:04] PROBLEM - SSH on restbase1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:16:29] (03CR) 10JHathaway: [V: 03+2] "thanks for reviewing" [puppet] - 10https://gerrit.wikimedia.org/r/928669 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:16:35] (03CR) 10JHathaway: [V: 03+2 C: 03+2] dev env: get_config support for dev [puppet] - 10https://gerrit.wikimedia.org/r/928669 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:20:13] (03CR) 10JHathaway: dev env: add a basic puppet enc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [16:20:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:23] !log powercycling restbase1028 [16:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:02] PROBLEM - confd service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:23:04] PROBLEM - cassandra-a service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:23:10] PROBLEM - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:23:10] PROBLEM - cassandra-b service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:23:24] PROBLEM - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:23:28] PROBLEM - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:23:28] PROBLEM - cassandra-c service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:23:38] RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 17613 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/RESTBase [16:23:46] RECOVERY - SSH on restbase1028 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:24:36] RECOVERY - confd service on restbase1028 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:26:06] RECOVERY - cassandra-a service on restbase1028 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:26:18] RECOVERY - cassandra-b service on restbase1028 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:26:38] RECOVERY - cassandra-c service on restbase1028 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:28:10] RECOVERY - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-a valid until 2024-08-30 21:25:17 +0000 (expires in 448 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:28:18] RECOVERY - cassandra-b CQL 10.64.0.210:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.210 port 9042 https://phabricator.wikimedia.org/T93886 [16:28:32] RECOVERY - cassandra-c CQL 10.64.0.211:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.211 port 9042 https://phabricator.wikimedia.org/T93886 [16:28:42] RECOVERY - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-c valid until 2024-08-30 21:25:22 +0000 (expires in 448 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:28:54] RECOVERY - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-b valid until 2024-08-30 21:25:20 +0000 (expires in 448 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:29:02] RECOVERY - cassandra-a CQL 10.64.0.209:9042 on restbase1028 is OK: TCP OK - 0.001 second response time on 10.64.0.209 port 9042 https://phabricator.wikimedia.org/T93886 [16:29:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T336886)', diff saved to https://phabricator.wikimedia.org/P49393 and previous config saved to /var/cache/conftool/dbconfig/20230609-162946-ladsgroup.json [16:29:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:29:50] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:30:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2176.codfw.wmnet with reason: Maintenance [16:30:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T336886)', diff saved to https://phabricator.wikimedia.org/P49394 and previous config saved to /var/cache/conftool/dbconfig/20230609-163007-ladsgroup.json [16:34:32] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Elitre) >>! In T261694#8906865, @Galessandroni wrote: > Hi. In Vikidia (an European Wikipedia for kids) we have sev... [16:45:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T336886)', diff saved to https://phabricator.wikimedia.org/P49395 and previous config saved to /var/cache/conftool/dbconfig/20230609-164644-ladsgroup.json [16:46:48] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:48:56] (03PS1) 10Papaul: Add snapshot101[6-7] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/928877 (https://phabricator.wikimedia.org/T334955) [16:49:43] (03CR) 10Papaul: [C: 03+2] Add snapshot101[6-7] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/928877 (https://phabricator.wikimedia.org/T334955) (owner: 10Papaul) [16:49:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS buster [16:54:06] (03PS1) 10Bartosz Dziewoński: Switch VisualEditor to not use RESTBase on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928879 (https://phabricator.wikimedia.org/T338388) [16:54:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations, 10Patch-For-Review: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with... [16:56:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations, 10Patch-For-Review: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) [17:01:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P49396 and previous config saved to /var/cache/conftool/dbconfig/20230609-170150-ladsgroup.json [17:02:05] (03PS1) 10Andrew Bogott: keystone: allow password auth for osstackcanary user [puppet] - 10https://gerrit.wikimedia.org/r/928880 [17:02:33] (03CR) 10CI reject: [V: 04-1] keystone: allow password auth for osstackcanary user [puppet] - 10https://gerrit.wikimedia.org/r/928880 (owner: 10Andrew Bogott) [17:04:17] (03PS2) 10Andrew Bogott: keystone: allow password auth for osstackcanary user [puppet] - 10https://gerrit.wikimedia.org/r/928880 (https://phabricator.wikimedia.org/T325773) [17:12:04] (03CR) 10Andrew Bogott: [C: 03+2] keystone: allow password auth for osstackcanary user [puppet] - 10https://gerrit.wikimedia.org/r/928880 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott) [17:12:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Papaul) @Jclark-ctr you started the reimage lat night just after we did the puppet merge but we didn't run puppet on the apt server that might me the r... [17:16:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P49397 and previous config saved to /var/cache/conftool/dbconfig/20230609-171656-ladsgroup.json [17:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T336886)', diff saved to https://phabricator.wikimedia.org/P49398 and previous config saved to /var/cache/conftool/dbconfig/20230609-173202-ladsgroup.json [17:32:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:33:34] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [17:34:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [17:34:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [17:45:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) Installing Buster on those 2 server is giving the error below. Buster is not able ti detect the driver and the controlled used in those serve... [17:47:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host snapshot1016.eqiad.wmnet with OS buster [17:47:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS buster executed with e... [17:50:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host snapshot1016.eqiad.wmnet with OS bullseye [17:51:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host snapshot1016.eqiad.wmnet with OS bullseye [17:53:34] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [17:58:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Product-Platform Operations: Q3:rack/setup/install snapshot1016 & snapshot1017 - https://phabricator.wikimedia.org/T334955 (10Papaul) I tried to install Bullseye on one of the node it did install with no issues so the problem is from our Debian Buster installer [18:07:06] (03PS1) 10Andrea Denisse: librenms: Change librenms path references for Debian package deployment [puppet] - 10https://gerrit.wikimedia.org/r/928890 (https://phabricator.wikimedia.org/T278309) [18:09:10] (03PS2) 10Andrea Denisse: librenms: Change librenms path references for Debian package deployment [puppet] - 10https://gerrit.wikimedia.org/r/928890 (https://phabricator.wikimedia.org/T278309) [18:09:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/928680/41660/" [puppet] - 10https://gerrit.wikimedia.org/r/928680 (https://phabricator.wikimedia.org/T337388) (owner: 10Dzahn) [18:10:37] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "just CCed you guys to say "look, we are automating this" and to say "check out phabricator::logmail", that is what Andre might need a chan" [puppet] - 10https://gerrit.wikimedia.org/r/928680 (https://phabricator.wikimedia.org/T337388) (owner: 10Dzahn) [18:10:59] (03PS3) 10Andrea Denisse: librenms: Change librenms path references for Debian package deployment [puppet] - 10https://gerrit.wikimedia.org/r/928890 (https://phabricator.wikimedia.org/T278309) [18:17:32] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:47] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on prod servers confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/928680 (https://phabricator.wikimedia.org/T337388) (owner: 10Dzahn) [18:20:32] (03PS4) 10Andrea Denisse: librenms: Change librenms path references for Debian package deployment [puppet] - 10https://gerrit.wikimedia.org/r/928890 (https://phabricator.wikimedia.org/T278309) [18:27:03] 10SRE, 10Wikimedia-Mailing-lists: Restore owners for WikiIT-l mailing list - https://phabricator.wikimedia.org/T338633 (10Legoktm) a:03Legoktm @Nemo_bis I've added you as an owner. Sannita is already an owner, but if it's the wrong address you should be able to fix it. Let me know if there's anything else yo... [18:28:41] 10SRE, 10Wikimedia-Mailing-lists: Restore owners for WikiIT-l mailing list - https://phabricator.wikimedia.org/T338633 (10M7) Comment: Nemo_bis (1) and Sannita (2) are the only people entitled by the it.wiki community to manage the Mailing list (3). Please check the administration status and restore their stat... [18:29:04] (03PS2) 10Hokwelum: Modify the global blocks script to accept output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 [18:30:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:20] (03CR) 10Dzahn: [C: 03+1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/928680/3 was deployed.this should now work :)" [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:36:35] (03PS2) 10Dzahn: Automate yearly Phabricator metrics for wikitech-l [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:37:02] (03CR) 10Dzahn: Automate yearly Phabricator metrics for wikitech-l [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:37:36] (03CR) 10Dzahn: [C: 03+1] "the bash code isn't all new, it's a copy of existing stuff that "just works" (tm)" [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:38:22] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/923367/41661/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:38:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Automate yearly Phabricator metrics for wikitech-l [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:38:58] (03PS3) 10Hokwelum: Modify the global blocks script to accept output dir [puppet] - 10https://gerrit.wikimedia.org/r/928861 [18:40:43] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "on phab2002, it added the script and nothing else, on phab1004 it added the script and also the timer and service, as it should be, based " [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:44:44] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Trigger: Mon 2024-01-01 00:00:00 UTC; 6 months 22 days left" [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:45:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:12] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "@Aklapper timer and everything looks good but one issue in the query: ERROR 1146 (42S02) at line 2: Table 'phabricator_maniphest.transacti" [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [18:50:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:56] (03CR) 10Herron: Add missing build dependencies for the Debian package (031 comment) [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [18:57:16] (03PS18) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [18:59:22] (03PS19) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [18:59:30] (03CR) 10BCornwall: Create cookbook to upgrade Apache Traffic Server (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [19:05:09] (03PS1) 10Aklapper: Followup to "Automate yearly Phabricator metrics for wikitech-l" [puppet] - 10https://gerrit.wikimedia.org/r/928897 (https://phabricator.wikimedia.org/T337388) [19:05:38] (03CR) 10Aklapper: Automate yearly Phabricator metrics for wikitech-l (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [19:09:18] 10SRE, 10envoy, 10serviceops: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10BCornwall) [19:10:33] (03CR) 10Herron: [V: 03+1] service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [19:13:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10phaultfinder) [19:15:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:12] 10SRE, 10Wikimedia-Mailing-lists: Restore owners for WikiIT-l mailing list - https://phabricator.wikimedia.org/T338633 (10Nemo_bis) Thank you. I've removed the two extra owners (which seem to be the email addresses of two trusted it.wiki sysops) and contacted them separately to check their accounts have not be... [19:16:35] 10SRE, 10Wikimedia-Mailing-lists: Restore owners for WikiIT-l mailing list - https://phabricator.wikimedia.org/T338633 (10Nemo_bis) 05Open→03Resolved p:05Triage→03High [19:17:49] (03PS1) 10BCornwall: Remove leftover TODO item [dns] - 10https://gerrit.wikimedia.org/r/928900 (https://phabricator.wikimedia.org/T309074) [19:18:14] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10phaultfinder) [19:19:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:47] (03PS1) 10BCornwall: Add mastadon.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) [19:27:08] 10SRE, 10DNS, 10Domains, 10Traffic, 10Patch-For-Review: Update DNS records for mastodon.wikimedia.org - https://phabricator.wikimedia.org/T337586 (10BCornwall) 05Open→03In progress p:05Triage→03Low a:03BCornwall [19:27:43] (03CR) 10CI reject: [V: 04-1] Add mastadon.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [19:29:54] (03PS2) 10BCornwall: Add mastadon.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) [19:41:51] (03PS3) 10JHathaway: dev env: allow setting $site via an env var [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) [19:44:24] (03PS2) 10JHathaway: dev env: don't pull firewall rules from etcd [puppet] - 10https://gerrit.wikimedia.org/r/928662 (https://phabricator.wikimedia.org/T337972) [19:45:26] (03PS2) 10JHathaway: dev env: don't manage resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) [19:46:59] (03CR) 10CI reject: [V: 04-1] dev env: don't manage resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [19:48:59] (03PS2) 10JHathaway: dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) [19:50:33] (03CR) 10CI reject: [V: 04-1] dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [19:51:33] (03PS4) 10JHathaway: dev env: nrpe listen on all interfaces in a container [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) [19:57:14] (03CR) 10Dzahn: [C: 03+2] Followup to "Automate yearly Phabricator metrics for wikitech-l" [puppet] - 10https://gerrit.wikimedia.org/r/928897 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [19:57:29] (03PS1) 10JHathaway: dev env: make container facts structured [puppet] - 10https://gerrit.wikimedia.org/r/928903 (https://phabricator.wikimedia.org/T337972) [19:58:04] (03CR) 10CI reject: [V: 04-1] dev env: make container facts structured [puppet] - 10https://gerrit.wikimedia.org/r/928903 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [19:59:24] (03PS2) 10JHathaway: dev env: make container facts structured [puppet] - 10https://gerrit.wikimedia.org/r/928903 (https://phabricator.wikimedia.org/T337972) [19:59:41] (03CR) 10Dzahn: [C: 03+2] "the error is gone! unrelatedly some numbers appear to be missing though:" [puppet] - 10https://gerrit.wikimedia.org/r/928897 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [20:00:08] 10SRE, 10Wikimedia-Mailing-lists: Restore owners for WikiIT-l mailing list - https://phabricator.wikimedia.org/T338633 (10Aklapper) (NB also non-public T337757, just in case that there might be a possible relation.) [20:02:39] (03CR) 10Milimetric: "heh, problem is, there is no mediawiki_history_reduced_2023_05 datasource... the indexing task seems to run fine but nothing. I'm going t" [puppet] - 10https://gerrit.wikimedia.org/r/928558 (owner: 10Btullis) [20:04:04] (03PS1) 10Milimetric: Revert "Bump mediawiki_history_reduced version for aqs" [puppet] - 10https://gerrit.wikimedia.org/r/928927 [20:04:07] (03CR) 10Dzahn: [C: 04-1] "I don't think we can point .wikimedia.org subdomains to external IPs. This discussion has been had several times in the past. Just one iss" [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:04:09] (03CR) 10Milimetric: [C: 03+1] Revert "Bump mediawiki_history_reduced version for aqs" [puppet] - 10https://gerrit.wikimedia.org/r/928927 (owner: 10Milimetric) [20:04:27] (03CR) 10CI reject: [V: 04-1] Revert "Bump mediawiki_history_reduced version for aqs" [puppet] - 10https://gerrit.wikimedia.org/r/928927 (owner: 10Milimetric) [20:07:24] (03CR) 10Dzahn: [C: 04-1] "It feels to me like this whole issue keeps coming back at least every other year for the last decade. There is way more to it than might b" [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:15:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:13] (03CR) 10Dzahn: [C: 04-2] "strongly recommend that comms reaches out to SRE management with the overall plan for this project to determine what is possible and what " [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:17:17] (03CR) 10Bartosz Dziewoński: "Is this supposed to be "mastadon" or "mastodon"?" [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:18:10] (03CR) 10Andrea Denisse: [C: 03+1] Remove leftover TODO item [dns] - 10https://gerrit.wikimedia.org/r/928900 (https://phabricator.wikimedia.org/T309074) (owner: 10BCornwall) [20:19:14] (03CR) 10Btullis: [C: 03+2] Revert "Bump mediawiki_history_reduced version for aqs" [puppet] - 10https://gerrit.wikimedia.org/r/928927 (owner: 10Milimetric) [20:20:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:10] (03PS2) 10Btullis: Revert "Bump mediawiki_history_reduced version for aqs" [puppet] - 10https://gerrit.wikimedia.org/r/928927 (owner: 10Milimetric) [20:20:51] (03PS2) 10Andrea Denisse: Add missing build dependencies for the Debian package [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) [20:21:33] (03CR) 10Andrea Denisse: Add missing build dependencies for the Debian package (031 comment) [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) (owner: 10Andrea Denisse) [20:22:39] (03CR) 10Dzahn: [C: 04-2] "additionally there wasn't a response yet to the very legit comment that "social" would be much more in line with other existing services, " [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:23:26] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs [20:24:24] RECOVERY - aqs endpoints health on aqs2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [20:24:40] (03CR) 10Dzahn: [C: 04-2] "was there any discussion about hosting externally vs internally?" [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:29:04] RECOVERY - aqs endpoints health on aqs2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [20:31:31] We can ignore these aqs endpoint alerts for now, they should clear up. It an artifact of the way that the new cookbook to restart aqs is working. [20:34:45] (03Abandoned) 10BCornwall: Add mastadon.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/928901 (https://phabricator.wikimedia.org/T337586) (owner: 10BCornwall) [20:34:58] (03CR) 10Dzahn: [C: 03+1] "we have the month parameter now, should now work" [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) (owner: 10Aklapper) [20:38:27] !log btullis@cumin1001 END (ERROR) - Cookbook sre.aqs.roll-restart-reboot (exit_code=97) rolling restart_daemons on A:aqs [20:41:39] (03CR) 10Dzahn: [C: 03+2] Automate quarterly Phabricator metrics for Tech Community Newsletter [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) (owner: 10Aklapper) [20:42:04] (03CR) 10Dzahn: [C: 03+2] "this one needs a manual rebase now" [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) (owner: 10Aklapper) [20:43:45] (03CR) 10Dzahn: [C: 03+2] Phabricator monthly email: Improve Differential user activity stats [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [20:45:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:33] (03CR) 10Dzahn: [C: 03+2] "tested and no error but the result is 0" [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [20:48:33] (03CR) 10Dzahn: [C: 03+2] "which is good news to me if true :)" [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [20:49:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:48] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [20:53:50] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1011.eqiad.wmnet with OS bullseye [20:53:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [20:53:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye [21:12:13] (03CR) 10Aklapper: Phabricator monthly email: Improve Differential user activity stats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [21:24:30] (03PS1) 10Aklapper: Followfollowup to "Automate yearly Phabricator metrics for wikitech-l" [puppet] - 10https://gerrit.wikimedia.org/r/928955 (https://phabricator.wikimedia.org/T337388) [21:27:24] (03CR) 10Dzahn: [C: 03+2] Followfollowup to "Automate yearly Phabricator metrics for wikitech-l" [puppet] - 10https://gerrit.wikimedia.org/r/928955 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [21:30:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:16] (03PS3) 10Dzahn: Automate quarterly Phabricator metrics for Tech Community Newsletter [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) (owner: 10Aklapper) [21:43:04] (03CR) 10Dzahn: [C: 03+2] Automate quarterly Phabricator metrics for Tech Community Newsletter [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) (owner: 10Aklapper) [21:45:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:50] (03PS1) 10Dzahn: Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter" [puppet] - 10https://gerrit.wikimedia.org/r/928929 [21:47:01] (03CR) 10Dzahn: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resour" [puppet] - 10https://gerrit.wikimedia.org/r/928929 (owner: 10Dzahn) [21:47:10] (03CR) 10Dzahn: [C: 03+2] Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter" [puppet] - 10https://gerrit.wikimedia.org/r/928929 (owner: 10Dzahn) [21:48:43] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "Automate quarterly Phabricator metrics for Tech Community Newsletter" [puppet] - 10https://gerrit.wikimedia.org/r/928929 (owner: 10Dzahn) [21:50:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye [21:50:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1011.eqiad.wmnet with OS bullseye [21:50:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err... [21:50:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye executed with err... [21:53:26] (03CR) 10Dzahn: "sorry, I don't feel qualified to review this and no offense, but it feels to me like a bit of over-engineering when the alternative is an " [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [22:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:32] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:19:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:56] (03CR) 10Thcipriani: "I'll try to find some reviewers from Infrastructure Foundations." [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [22:30:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:59] 10SRE, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup, 10MediaWiki-extensions-CentralAuth, and 2 others: Account creation attempt on mobile Wikipedia domain leads user to desktop Special:CentralLogin/complete, often in logged-out st... - https://phabricator.wikimedia.org/T335125 [22:35:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Papaul) @Jclark-ctr when you lunch the re image cookbook make sure you have at lest a terminal console open to see what's going on on that serve. bacup... [23:08:44] (03CR) 10Tacsipacsi: "Are there absolutely no beta cluster wikis that would have RESTBase set up correctly? The beta cluster is a staging environment, so it sho" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928879 (https://phabricator.wikimedia.org/T338388) (owner: 10Bartosz Dziewoński) [23:13:23] (03PS5) 10Andrea Denisse: librenms: Change librenms path references for Debian package deployment [puppet] - 10https://gerrit.wikimedia.org/r/928890 (https://phabricator.wikimedia.org/T278309) [23:15:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:16] (03CR) 10Bartosz Dziewoński: Switch VisualEditor to not use RESTBase on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928879 (https://phabricator.wikimedia.org/T338388) (owner: 10Bartosz Dziewoński) [23:19:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state