[00:00:31] RECOVERY - Check systemd state on puppetserver1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:09] PROBLEM - Check systemd state on matomo1002 is CRITICAL: CRITICAL - degraded: The following units failed: matomo-archiver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T348183)', diff saved to https://phabricator.wikimedia.org/P53482 and previous config saved to /var/cache/conftool/dbconfig/20231115-000545-arnaudb.json [00:06:00] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [00:07:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:17:53] (03PS1) 10BCornwall: readme: Update repo location of varnishkafka [software/varnish/varnishkafka/testing] - 10https://gerrit.wikimedia.org/r/974289 [00:22:28] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye [00:26:26] urbanecm: fwiw CentralAuth only cares about the username of the user object passed to AuthManager::revokeAccessForUser [00:26:49] tgr: good to know. so, `new UserIdentityValue(0, 'TempAccountName' )` should work? [00:27:04] it should [00:27:17] noted [00:35:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973423 [00:38:56] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973423 (owner: 10TrainBranchBot) [00:56:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973423 (owner: 10TrainBranchBot) [01:04:29] (03PS1) 10Eevans: install_server: aqs: prompt to accept partitioning [puppet] - 10https://gerrit.wikimedia.org/r/974290 (https://phabricator.wikimedia.org/T347738) [01:06:26] (03CR) 10Eevans: [C: 03+2] install_server: aqs: prompt to accept partitioning [puppet] - 10https://gerrit.wikimedia.org/r/974290 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [01:10:49] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:33] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [01:22:12] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye [01:39:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T351279 (10phaultfinder) [02:08:09] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:55] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:08:55] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:10] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 48.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:18:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 48.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:41:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:46:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 41.96% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:53:55] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:07:18] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:25:01] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:13] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 76.93 ms [05:05:15] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2607:fb58:9000:7::2) [05:10:39] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 77.00 ms [06:45:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:45:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:53:02] (03PS1) 10Ayounsi: Don't send debug logs to LibreNMS [homer/public] - 10https://gerrit.wikimedia.org/r/974472 (https://phabricator.wikimedia.org/T349362) [06:55:03] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 3 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10ayounsi) 05Resolved→03Open Thanks @Ladsgroup yeah some devices got way too verbose at sending debug logs and we don't use debug level logs for alerting so the ab... [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T0700) [07:04:31] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [07:05:05] (03CR) 10Marostegui: "<3" [homer/public] - 10https://gerrit.wikimedia.org/r/974472 (https://phabricator.wikimedia.org/T349362) (owner: 10Ayounsi) [07:05:15] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING WARNING - Packet loss = 77%, RTA = 22.30 ms [07:05:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 983 [07:05:44] (03PS1) 10Marostegui: Revert "Revert "pc2013: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/974229 [07:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:05:59] jouncebot: next [07:06:00] In 0 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T0800) [07:06:18] (03PS1) 10Marostegui: Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974230 [07:06:28] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "pc2013: Disable notifications"" [puppet] - 10https://gerrit.wikimedia.org/r/974229 (owner: 10Marostegui) [07:06:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 983 [07:07:01] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974230 (owner: 10Marostegui) [07:07:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 40934 [07:07:45] (03Merged) 10jenkins-bot: Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974230 (owner: 10Marostegui) [07:08:56] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:08:59] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974230|Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master""]] [07:10:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 40934 [07:10:25] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974230|Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master""]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:10:29] !log marostegui@deploy2002 marostegui: Continuing with sync [07:14:35] (03CR) 10Marostegui: [C: 03+1] Update the contact info for the wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) (owner: 10Btullis) [07:15:53] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974230|Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master""]] (duration: 06m 53s) [07:16:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Reimage [07:16:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Reimage [07:17:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2013.codfw.wmnet with OS bookworm [07:19:37] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:33] ^ known [07:34:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2013.codfw.wmnet with reason: host reimage [07:34:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: mariadb::misc::analytics::backup [07:35:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: pybaltest [07:36:42] (03PS1) 10Muehlenhoff: Switch pybaltest to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974475 (https://phabricator.wikimedia.org/T349619) [07:37:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2013.codfw.wmnet with reason: host reimage [07:39:15] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:53] (03CR) 10Muehlenhoff: [C: 03+2] Switch pybaltest to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974475 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:47:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: pybaltest [07:48:44] (03CR) 10Muehlenhoff: [C: 03+2] Apply Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/974211 (https://phabricator.wikimedia.org/T346039) (owner: 10Muehlenhoff) [07:48:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:50:55] (03PS1) 10Marostegui: Revert "Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974232 [07:51:10] (03PS1) 10Marostegui: Revert "Revert "Revert "pc2013: Disable notifications""" [puppet] - 10https://gerrit.wikimedia.org/r/974233 [07:51:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2013.codfw.wmnet with OS bookworm [07:51:49] jouncebot: next [07:51:49] In 0 hour(s) and 8 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T0800) [07:52:31] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "Revert "pc2013: Disable notifications""" [puppet] - 10https://gerrit.wikimedia.org/r/974233 (owner: 10Marostegui) [07:52:41] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974232 (owner: 10Marostegui) [07:53:25] (03Merged) 10jenkins-bot: Revert "Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974232 (owner: 10Marostegui) [07:54:20] !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974232|Revert "Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master"""]] [07:54:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:55:45] !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974232|Revert "Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master"""]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:55:55] !log marostegui@deploy2002 marostegui: Continuing with sync [07:56:19] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:39] (03CR) 10Muehlenhoff: "Looks good, there are two final nits inline (and the last a change to AQS yesterday, but this needs rebasing anyway)." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [07:59:23] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:00:04] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:00:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'depool db1127', diff saved to https://phabricator.wikimedia.org/P53483 and previous config saved to /var/cache/conftool/dbconfig/20231115-080033-arnaudb.json [08:01:14] !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974232|Revert "Revert "Revert "ProductionServices.php: Promote pc2014 to pc3 master"""]] (duration: 06m 54s) [08:01:56] (03CR) 10Filippo Giunchedi: [C: 03+2] oauth2_proxy: new module [puppet] - 10https://gerrit.wikimedia.org/r/973740 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [08:02:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: cassandra_dev [08:02:54] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add oidc support via oauth2-proxy [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [08:03:18] (03PS1) 10Elukey: services: bump cpu limits and Docker images for cp instances [deployment-charts] - 10https://gerrit.wikimedia.org/r/974476 (https://phabricator.wikimedia.org/T348950) [08:05:34] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:05:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:19] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:07:28] (03CR) 10Muehlenhoff: thanos: add oidc support via oauth2-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [08:10:35] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add oidc support via oauth2-proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [08:11:05] (03PS1) 10Filippo Giunchedi: hieradata: enable Thanos OIDC SSO [puppet] - 10https://gerrit.wikimedia.org/r/974477 (https://phabricator.wikimedia.org/T331512) [08:13:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:14:39] (03PS1) 10Filippo Giunchedi: Clean up cn=ops from mod_auth_cas config for o11y [puppet] - 10https://gerrit.wikimedia.org/r/974478 [08:14:46] (03PS1) 10Muehlenhoff: Switch cassandra-dev to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974479 (https://phabricator.wikimedia.org/T349619) [08:15:39] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/974478 (owner: 10Filippo Giunchedi) [08:18:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:18:41] (03PS71) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) [08:18:54] (03CR) 10Filippo Giunchedi: [C: 03+2] Clean up cn=ops from mod_auth_cas config for o11y [puppet] - 10https://gerrit.wikimedia.org/r/974478 (owner: 10Filippo Giunchedi) [08:19:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:19:42] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:04] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:20] (03CR) 10Muehlenhoff: [C: 03+2] Switch cassandra-dev to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974479 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:21:40] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable Thanos OIDC SSO [puppet] - 10https://gerrit.wikimedia.org/r/974477 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi) [08:21:56] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:23:33] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/473/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:24:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:26:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:24] (03PS4) 10MVernon: swift: migrate one node to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) [08:26:28] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:45] (Traffic on tunnel link) firing: Alert for device cr4-ulsfo.wikimedia.org - Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [08:27:05] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [08:27:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cassandra_dev [08:30:27] (03CR) 10Brouberol: "Latest diff: https://phabricator.wikimedia.org/P53293" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:32:59] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon) [08:35:27] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:44] !log rolling restart of Cassandra in cassandra-dev following migration to Puppet 7 [08:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:45] (Traffic on tunnel link) resolved: Device cr4-ulsfo.wikimedia.org recovered from Traffic on tunnel link - https://alerts.wikimedia.org/?q=alertname%3DTraffic+on+tunnel+link [08:42:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I also doublechecked the latest diff, so let's merge before someone changes another config in the old-style config :-)" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:46:41] (03PS1) 10Brouberol: Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) [08:47:28] (03CR) 10CI reject: [V: 04-1] Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:48:21] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:27] (03PS2) 10Brouberol: Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) [08:49:14] (03CR) 10CI reject: [V: 04-1] Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:49:52] (03Abandoned) 10Hashar: Second submission for realloc misusage. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/708094 (owner: 10R4q3NWnUx2CEhVyr) [08:52:40] (03CR) 10Brouberol: [C: 03+2] Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:53:27] (03CR) 10Brouberol: [C: 03+2] "Thanks everyone for bearing with me while I level up on puppet, and for your thorough reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [08:58:41] (03CR) 10Muehlenhoff: [C: 03+1] Generate the netboot.cfg file to avoid typos impacting everyone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [09:00:05] jeena and jnuche: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T0900). [09:09:37] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev [09:10:01] (03PS7) 10JMeybohm: Update api-gateway for cert-manager support [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) [09:18:23] (03PS1) 10JMeybohm: Add kubernetes2054 to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/974484 (https://phabricator.wikimedia.org/T348436) [09:18:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:19:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup_noferm [09:21:05] (03PS1) 10Muehlenhoff: Switch insetup_noferm to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974485 (https://phabricator.wikimedia.org/T349619) [09:21:57] (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup_noferm to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974485 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:26:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup_noferm [09:27:31] (03PS1) 10Elukey: profile::thanos: improve istio sli recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974486 [09:27:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:33:11] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10Gehel) p:05Triage→03High [09:33:21] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:34:01] (03PS9) 10Ilias Sarantopoulos: team-ml: add alert for memory spike in inf services [alerts] - 10https://gerrit.wikimedia.org/r/963724 (https://phabricator.wikimedia.org/T346151) [09:36:33] (03PS1) 10Giuseppe Lavagetto: admin: add deploy function to my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/974487 [09:36:37] (03CR) 10Slyngshede: NTP: alert on ntp/time errors (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [09:37:31] !log imported php-igbinary 3.2.1+2.0.8-2+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [09:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:48] (03CR) 10Slyngshede: NTP: alert on ntp/time errors (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [09:40:15] (03CR) 10Slyngshede: [C: 03+2] Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [09:40:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add deploy function to my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/974487 (owner: 10Giuseppe Lavagetto) [09:42:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev [09:43:02] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10MatthewVernon) This is a request for access to `analytics-privatedata-users` now, which requires approval from @od... [09:43:42] (03CR) 10Majavah: [C: 03+2] hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah) [09:43:51] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:08] (03Merged) 10jenkins-bot: Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [09:53:37] (03CR) 10Filippo Giunchedi: NTP: alert on ntp/time errors (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [10:04:47] <_joe_> jouncebot: next [10:04:47] In 0 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T1100) [10:04:56] <_joe_> jouncebot: now [10:04:56] For the next 0 hour(s) and 55 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T0900) [10:05:04] <_joe_> oh the train [10:05:19] <_joe_> jnuche: is the train rolling today? if so, let me know when you're done [10:10:21] _joe_: hi, the train runs on US time this week, you can go ahead [10:10:32] <_joe_> ack thanks [10:14:29] (03PS2) 10Slyngshede: NTP: alert on ntp/time errors [alerts] - 10https://gerrit.wikimedia.org/r/973306 [10:22:13] (03PS19) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [10:22:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-api-int: double the number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/973183 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [10:22:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: miscweb [10:22:52] PROBLEM - Check systemd state on titan1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:58] (03CR) 10Slyngshede: NTP: alert on ntp/time errors (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [10:23:12] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:23:26] PROBLEM - thanos.wikimedia.org requires authentication on titan1002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:23:26] (03Merged) 10jenkins-bot: mw-api-int: double the number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/973183 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [10:23:56] (03PS2) 10Btullis: Temporarily disable the production jobs that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974172 (https://phabricator.wikimedia.org/T284150) [10:24:07] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:09] (03PS2) 10Btullis: Re-enable the production pipelines that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974173 (https://phabricator.wikimedia.org/T284150) [10:25:22] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 403 Forbidden https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:25:26] RECOVERY - Check systemd state on titan1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:04] (03PS20) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [10:27:57] (03Abandoned) 10Hashar: (DO NOT SUBMIT) testing for CI (PS2) [puppet] - 10https://gerrit.wikimedia.org/r/973775 (owner: 10Hashar) [10:28:58] (03PS1) 10Muehlenhoff: Switch miscweb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974495 (https://phabricator.wikimedia.org/T349619) [10:29:07] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:53] (03PS1) 10Elukey: profile::pyrra::filesystem: improve/fix lift wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974496 (https://phabricator.wikimedia.org/T302995) [10:30:50] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:31:29] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:31:31] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:31:45] PROBLEM - thanos.wikimedia.org requires authentication on titan2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 403 Forbidden https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:34:26] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:34:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10Clement_Goubert) @Eevans Could you take a look at this please, ditto {T349876}? [10:34:54] (03CR) 10Aqu: "Thanks for the review Filippo." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [10:34:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349876 (10Clement_Goubert) @Eevans Could you take a look at this please? We should probably change https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions so that... [10:36:36] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974497 (https://phabricator.wikimedia.org/T351197) [10:37:04] (03CR) 10Tchanders: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974497 (https://phabricator.wikimedia.org/T351197) (owner: 10Kosta Harlan) [10:37:55] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974497 (https://phabricator.wikimedia.org/T351197) (owner: 10Kosta Harlan) [10:39:01] <_joe_> !log roll restart of mobileapps in codfw and eqiad [10:39:02] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: sync [10:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:47] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: sync [10:39:48] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync [10:39:54] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync [10:40:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch miscweb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974495 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:40:37] !log tchanders@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:40:57] (03PS1) 10Filippo Giunchedi: thanos: disable auth_cas when running in OIDC SSO mode [puppet] - 10https://gerrit.wikimedia.org/r/974498 (https://phabricator.wikimedia.org/T331512) [10:41:09] !log tchanders@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:41:58] (03CR) 10Btullis: [C: 03+2] Temporarily disable the production jobs that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974172 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [10:42:16] (03CR) 10Ayounsi: [C: 03+1] Add kubernetes2054 to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/974484 (https://phabricator.wikimedia.org/T348436) (owner: 10JMeybohm) [10:42:59] !log tchanders@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:43:19] PROBLEM - thanos.wikimedia.org requires authentication on titan2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 403 Forbidden https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:44:03] !log tchanders@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:45:50] (03CR) 10Majavah: [C: 03+2] wiki-replicas: Update IP address for cloudcontrol1006 [puppet] - 10https://gerrit.wikimedia.org/r/964871 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [10:46:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: miscweb [10:46:31] (03CR) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [10:46:38] (03PS7) 10D3r1ck01: mc: Make it possible to use mcrouter server set by environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) [10:47:51] (03CR) 10D3r1ck01: "Krinkle, I'll schedule this for deploy tomorrow instead of today. A little bit unstable to do it today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01) [10:48:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Well i have updated apt1001 to 8.2102.0-2~deb10u1 and i still see the problem so tha... [10:49:21] PROBLEM - Check systemd state on ganeti1010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:53] (03PS3) 10Slyngshede: P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) [10:50:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:51:15] (03PS21) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [10:52:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:52:21] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:53:17] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:56:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10MoritzMuehlenhoff) >>! In T351181#9333629, @jbond wrote: > Well i have updated apt1001 to 8... [10:57:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:58:32] (03PS22) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T1100) [11:00:43] (03PS1) 10Majavah: definitions: remove ntp.anycast [homer/public] - 10https://gerrit.wikimedia.org/r/974501 [11:00:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10MoritzMuehlenhoff) >>! In T351181#9333641, @MoritzMuehlenhoff wrote: >>>! In T351181#933362... [11:02:54] (03CR) 10Ayounsi: [C: 03+1] definitions: remove ntp.anycast [homer/public] - 10https://gerrit.wikimedia.org/r/974501 (owner: 10Majavah) [11:03:08] (03CR) 10Majavah: [C: 03+2] definitions: remove ntp.anycast [homer/public] - 10https://gerrit.wikimedia.org/r/974501 (owner: 10Majavah) [11:03:39] (03PS1) 10Muehlenhoff: Correct insetup role for lists2001 [puppet] - 10https://gerrit.wikimedia.org/r/974502 [11:03:42] (03Merged) 10jenkins-bot: definitions: remove ntp.anycast [homer/public] - 10https://gerrit.wikimedia.org/r/974501 (owner: 10Majavah) [11:05:17] (03CR) 10Btullis: [C: 03+2] Promote an-mariadb1001 to be the new primary for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974167 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:05:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert) [11:05:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert) Hi, these will directly be used as kubernetes nodes. Current distribution of kubernnetes nodes is |codfw row A|9| |codfw row B|12| |codfw row C|12| |codfw row D|... [11:05:37] (03CR) 10Btullis: [V: 03+1 C: 03+2] Use new mariadb server for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/972424 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:05:59] (PuppetFailure) firing: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:08:56] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:07] (03CR) 10Btullis: [C: 03+2] Switch datahub to use the new an-mariadb servers instead of an-coord [deployment-charts] - 10https://gerrit.wikimedia.org/r/972823 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:11:02] (03Merged) 10jenkins-bot: Switch datahub to use the new an-mariadb servers instead of an-coord [deployment-charts] - 10https://gerrit.wikimedia.org/r/972823 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:14:17] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [11:14:35] (03PS1) 10Arnaudb: mariadb: clone and upgrade mariadb [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) [11:15:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host thanos-fe2001.codfw.wmnet [11:15:51] (03CR) 10Muehlenhoff: [C: 03+2] Correct insetup role for lists2001 [puppet] - 10https://gerrit.wikimedia.org/r/974502 (owner: 10Muehlenhoff) [11:17:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Clement_Goubert) [11:17:26] (03PS1) 10Muehlenhoff: Switch thanos-fe2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974503 (https://phabricator.wikimedia.org/T349619) [11:17:29] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [11:18:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert) a:05Clement_Goubert→03RobH [11:18:34] (03CR) 10Muehlenhoff: [C: 03+2] Switch thanos-fe2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974503 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:18:53] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:18:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Clement_Goubert) a:05Clement_Goubert→03RobH As in {T349873}, these nodes will be used directly as kubernetes nodes. Current distribution of kubernetes nodes in eqiad is |eqia... [11:19:40] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:20:09] !log btullis@cumin1001 END (ERROR) - Cookbook sre.druid.roll-restart-workers (exit_code=97) for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:20:59] (PuppetFailure) resolved: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:21:02] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [11:23:51] (03CR) 10Btullis: [C: 03+2] Re-enable the production pipelines that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974173 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:24:24] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [11:24:52] !log update cr*-{codfw,eqiad} firewall policy via homer to update cloudcontrol1006 addressing [11:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host thanos-fe2001.codfw.wmnet [11:27:24] PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:28:17] ^ ack - that's expected. I had meant to silence it before migrating [11:28:46] RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:30:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:33:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:34:12] PROBLEM - analytics-meta MySQL instance on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [11:34:16] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:35:34] RECOVERY - analytics-meta MySQL instance on an-coord1002 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta [11:35:38] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:41:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:41:46] (03PS1) 10Jbond: apt1001: use ossl for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/974509 (https://phabricator.wikimedia.org/T351181) [11:42:15] (03PS1) 10Muehlenhoff: insetup::unowned: Fix name of firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/974510 [11:42:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/479/console" [puppet] - 10https://gerrit.wikimedia.org/r/974509 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [11:42:26] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:42:38] (03PS2) 10Jbond: apt1001: use ossl for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/974509 (https://phabricator.wikimedia.org/T351181) [11:43:17] (03CR) 10Brouberol: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [11:43:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/480/con" [puppet] - 10https://gerrit.wikimedia.org/r/974509 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [11:43:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:45:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974282 (owner: 10JHathaway) [11:45:18] (03CR) 10Muehlenhoff: [C: 03+2] insetup::unowned: Fix name of firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/974510 (owner: 10Muehlenhoff) [11:46:54] RECOVERY - Check systemd state on ganeti1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:26] (03PS2) 10Btullis: Promote an-mariadb1001 to be the new primary for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974167 (https://phabricator.wikimedia.org/T284150) [11:47:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 63, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:47:42] (03CR) 10Btullis: [C: 03+2] Enable notifications for new analytics_meta hosts [puppet] - 10https://gerrit.wikimedia.org/r/974165 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:48:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::unowned [11:48:21] (03CR) 10Btullis: [C: 03+2] Promote an-mariadb1001 to be the new primary for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974167 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:49:24] (03CR) 10Majavah: [C: 03+2] acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721 (owner: 10Majavah) [11:50:42] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:50:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:51:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:52:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::unowned [11:53:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:54:18] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/974472 (https://phabricator.wikimedia.org/T349362) (owner: 10Ayounsi) [11:55:00] (03CR) 10Jbond: [C: 04-1] "-1 is on the reload lets first try with the api. im hopping that is less disruptive then a reload which does some additional steps[1]" [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [11:55:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] apt1001: use ossl for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/974509 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [11:56:19] (03PS1) 10Btullis: Update the database host for superset-next [puppet] - 10https://gerrit.wikimedia.org/r/974512 (https://phabricator.wikimedia.org/T284150) [11:56:51] !log stevemunene@deploy2002 Started deploy [airflow-dags/wmde@91810bc]: (no justification provided) [11:57:01] !log stevemunene@deploy2002 Finished deploy [airflow-dags/wmde@91810bc]: (no justification provided) (duration: 00m 10s) [11:57:08] (03CR) 10Btullis: [C: 03+2] Update the database host for superset-next [puppet] - 10https://gerrit.wikimedia.org/r/974512 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [11:58:55] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:07:19] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:08:03] (03CR) 10Cathal Mooney: [C: 03+2] Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [12:08:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) i have tested using openssl and that works so ill prepare a patch to switch all buster to openssl [12:08:31] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 20% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/974514 (https://phabricator.wikimedia.org/T348122) [12:09:13] (03PS2) 10Clément Goubert: trafficserver: move 20% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964448 (https://phabricator.wikimedia.org/T348122) [12:09:19] (03PS3) 10Clément Goubert: trafficserver: move 20% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964448 (https://phabricator.wikimedia.org/T348122) [12:12:19] (03Merged) 10jenkins-bot: Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [12:13:36] (03CR) 10Volans: "question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:15:04] (03PS1) 10Btullis: Clean up hadoop coordinator roles by removing analytics_meta DB [puppet] - 10https://gerrit.wikimedia.org/r/974516 (https://phabricator.wikimedia.org/T284150) [12:15:51] (03PS5) 10Majavah: Add wiki replica backends to conftool [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) [12:15:53] (03PS5) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427) [12:15:55] (03PS11) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) [12:15:57] (03PS1) 10Majavah: wiki-replicas: document existing haproxy grants [puppet] - 10https://gerrit.wikimedia.org/r/974517 [12:15:59] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974516 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [12:19:24] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [12:19:35] (03CR) 10Cathal Mooney: [C: 03+2] Adjust reimage cookbook config for DHCP binding clear workaround (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [12:24:34] (03CR) 10Hnowlan: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 20% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/974514 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [12:25:32] (03CR) 10Marostegui: [C: 03+1] wiki-replicas: document existing haproxy grants [puppet] - 10https://gerrit.wikimedia.org/r/974517 (owner: 10Majavah) [12:26:46] (03CR) 10Majavah: [C: 03+2] wiki-replicas: document existing haproxy grants [puppet] - 10https://gerrit.wikimedia.org/r/974517 (owner: 10Majavah) [12:27:50] (03PS2) 10Arnaudb: mariadb: clone and upgrade mariadb [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) [12:28:33] (03PS1) 10Jbond: remote_syslog: force rsyslog-openssl on buster [puppet] - 10https://gerrit.wikimedia.org/r/974520 (https://phabricator.wikimedia.org/T351181) [12:29:21] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/974520 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [12:31:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/974520 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [12:33:18] (03CR) 10Jbond: [C: 03+1] "lgtm optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [12:33:42] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bullseye [12:33:49] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye [12:37:14] (03PS2) 10Jbond: remote_syslog: force rsyslog-openssl on buster [puppet] - 10https://gerrit.wikimedia.org/r/974520 (https://phabricator.wikimedia.org/T351181) [12:39:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/974520 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [12:40:21] (03PS1) 10Clément Goubert: mediawiki: Fix rsyslog errorlog ruleset [deployment-charts] - 10https://gerrit.wikimedia.org/r/974521 (https://phabricator.wikimedia.org/T350430) [12:41:20] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:43:05] (03PS2) 10Majavah: cr-labs: permit cloudlb to wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/973769 (https://phabricator.wikimedia.org/T300427) [12:48:24] (03CR) 10Clément Goubert: [C: 03+1] wmnet: fix typo [dns] - 10https://gerrit.wikimedia.org/r/973746 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:49:52] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [12:51:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/974286 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [12:52:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [12:52:54] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 20% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/974514 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [12:53:55] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 20% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/974514 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [12:54:25] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:54:26] (03CR) 10Marostegui: mariadb: clone and upgrade mariadb (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [12:54:41] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:55:02] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:55:44] (03PS5) 10Arnaudb: mariadb: clone and upgrade mariadb [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) [12:56:06] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [12:56:34] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [12:56:59] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:57:06] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:57:15] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:57:24] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:57:36] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:57:41] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:57:49] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:58:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for looking into this" [puppet] - 10https://gerrit.wikimedia.org/r/974520 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [12:58:53] (03PS6) 10Arnaudb: mariadb: clone and upgrade mariadb [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) [13:00:43] (03PS7) 10Arnaudb: mariadb: clone and upgrade mariadb [cookbooks] - 10https://gerrit.wikimedia.org/r/973424 (https://phabricator.wikimedia.org/T343674) [13:00:56] (03CR) 10Jbond: [C: 03+1] "lgtm minor nit on the doc string, thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:01:02] (03CR) 10Hnowlan: [C: 03+2] wmnet: fix typo [dns] - 10https://gerrit.wikimedia.org/r/973746 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [13:05:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [13:05:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [13:10:51] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [13:13:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Jhancock.wm) I'll try but row A is low on space. thank you for your help! [13:13:21] (03PS5) 10Brouberol: Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) [13:13:23] (03PS8) 10Brouberol: Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) [13:13:25] (03CR) 10Brouberol: Define a wmflib function to compute the mask representation of a CIDR (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:14:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1001" [13:14:12] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bullseye [13:14:18] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye... [13:16:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/484/con" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:16:44] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) p:05Triage→03High [13:16:57] (03CR) 10Ayounsi: [C: 03+1] cr-labs: permit cloudlb to wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/973769 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [13:17:05] !log resetting FPC1 card in cr1-esams which has a major error and gone offline (T351304) [13:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:10] T351304: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 [13:17:23] (03PS6) 10ArielGlenn: use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [13:18:23] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [13:18:29] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/974472 (https://phabricator.wikimedia.org/T349362) (owner: 10Ayounsi) [13:18:33] (03CR) 10Jbond: "nice work some minor nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:18:41] (03PS9) 10Brouberol: Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) [13:18:47] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [13:19:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] remote_syslog: force rsyslog-openssl on buster [puppet] - 10https://gerrit.wikimedia.org/r/974520 (https://phabricator.wikimedia.org/T351181) (owner: 10Jbond) [13:19:14] ACKNOWLEDGEMENT - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn ayounsi https://phabricator.wikimedia.org/T351304 - The acknowledgement expires at: 2023-11-22 13:18:45. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:14] ACKNOWLEDGEMENT - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn ayounsi https://phabricator.wikimedia.org/T351304 - The acknowledgement expires at: 2023-11-22 13:18:45. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:26] (03CR) 10CI reject: [V: 04-1] Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:20:06] (03CR) 10Marostegui: "Once this is merged, please ping me on the task (or here) so I can truncate the table" [homer/public] - 10https://gerrit.wikimedia.org/r/974472 (https://phabricator.wikimedia.org/T349362) (owner: 10Ayounsi) [13:21:44] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:49] (03CR) 10Ladsgroup: [C: 03+1] use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [13:21:54] !log sfaci@deploy2002 Started deploy [airflow-dags/analytics_test@be05071]: Regular analytics weekly train [airflow/analytics_test@c203642a] [13:21:56] (03CR) 10Ayounsi: [C: 03+2] Don't send debug logs to LibreNMS [homer/public] - 10https://gerrit.wikimedia.org/r/974472 (https://phabricator.wikimedia.org/T349362) (owner: 10Ayounsi) [13:22:00] !log sfaci@deploy2002 Finished deploy [airflow-dags/analytics_test@be05071]: Regular analytics weekly train [airflow/analytics_test@c203642a] (duration: 00m 06s) [13:24:58] (03PS10) 10Brouberol: Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) [13:25:44] (03CR) 10Brouberol: "Addressed" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:25:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! nice" [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede) [13:26:25] (03PS1) 10JMeybohm: Add kubernetes2054 to codfw wikikube cluster [puppet] - 10https://gerrit.wikimedia.org/r/974528 (https://phabricator.wikimedia.org/T34843) [13:26:46] (03PS4) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 [13:27:19] (03CR) 10CI reject: [V: 04-1] Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:28:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: etcd::v3::kubernetes [13:29:04] (03PS6) 10Brouberol: Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) [13:29:06] (03PS11) 10Brouberol: Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) [13:29:17] (03PS7) 10Jbond: Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:29:21] !log sfaci@deploy2002 Started deploy [airflow-dags/analytics@5a47584]: Regular analytics weekly train [airflow/analytics@5a475842] [13:29:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, let's try" [puppet] - 10https://gerrit.wikimedia.org/r/974496 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [13:29:49] !log sfaci@deploy2002 Finished deploy [airflow-dags/analytics@5a47584]: Regular analytics weekly train [airflow/analytics@5a475842] (duration: 00m 27s) [13:29:53] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) Reset completed, the card came back up briefly but quickly failed again ` cmooney@re0.cr1-esams> show chassis fpc 1 detail Slot 1 information: State... [13:30:04] (03PS1) 10Muehlenhoff: Switch etcd::v3::kubernetes to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974529 (https://phabricator.wikimedia.org/T349619) [13:30:39] (03CR) 10Filippo Giunchedi: [C: 03+1] P:url_downloader add blackbox exporter. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [13:31:14] !log sfaci@deploy2002 Started deploy [airflow-dags/analytics_test@5a47584]: Regular analytics weekly train [airflow/analytics_test@5a475842] [13:31:23] (03PS8) 10Jbond: Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:31:29] !log sfaci@deploy2002 Finished deploy [airflow-dags/analytics_test@5a47584]: Regular analytics weekly train [airflow/analytics_test@5a475842] (duration: 00m 14s) [13:32:13] (03CR) 10Jbond: [C: 03+1] "nice work looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:32:24] (03CR) 10CI reject: [V: 04-1] Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:32:31] (03CR) 10Filippo Giunchedi: Send metrics from Airflow analytics test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [13:33:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch etcd::v3::kubernetes to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974529 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:34:55] (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: improve/fix lift wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974496 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [13:35:32] (03CR) 10Brouberol: [C: 03+2] Define a wmflib function to compute the mask representation of a CIDR [puppet] - 10https://gerrit.wikimedia.org/r/974483 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:36:20] (03CR) 10Jbond: "fyi ci is failing because you are missing" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:38:21] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:39:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etcd::v3::kubernetes [13:40:26] (03PS12) 10Brouberol: Automatically generate autoinstall subnet DHCP config files [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) [13:40:56] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:41:14] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:42:29] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10cmooney) Some logs following the issue of the "request chassis fpc online slot 1" command: {F41507770} [13:42:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:42:57] (03CR) 10Majavah: [C: 03+2] cr-labs: permit cloudlb to wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/973769 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [13:44:04] (03Merged) 10jenkins-bot: cr-labs: permit cloudlb to wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/973769 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [13:44:14] PROBLEM - Check systemd state on ganeti1033 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:22] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:25] !log joal@deploy2002 Started deploy [analytics/refinery@3e9df5d]: Regular analytics weekly train - HOTFIX [analytics/refinery@3e9df5d8] [13:44:43] (03PS1) 10Arnaudb: decommission: removing db1127 from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/973425 (https://phabricator.wikimedia.org/T351063) [13:45:03] (03CR) 10Cathal Mooney: [C: 03+1] "This is ok for now. I think in general the cloudlb should not load-balance to things outside the cloud realm, but this should be consider" [homer/public] - 10https://gerrit.wikimedia.org/r/973769 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [13:48:31] (03CR) 10Brouberol: Automatically generate autoinstall subnet DHCP config files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:48:38] (03PS1) 10Tchanders: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974532 (https://phabricator.wikimedia.org/T351300) [13:48:41] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) JTAC case 2023-1115-011066 opened. [13:48:48] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) a:03ayounsi [13:49:41] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/485/con" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:49:52] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:50:58] !log deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/973769/ core routers [13:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:41] (03CR) 10Majavah: [C: 03+1] use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn) [13:52:41] !log joal@deploy2002 Finished deploy [analytics/refinery@3e9df5d]: Regular analytics weekly train - HOTFIX [analytics/refinery@3e9df5d8] (duration: 08m 16s) [13:53:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add kubernetes2054 to codfw wikikube cluster [puppet] - 10https://gerrit.wikimedia.org/r/974528 (https://phabricator.wikimedia.org/T34843) (owner: 10JMeybohm) [13:54:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:54:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) 05Open→03Resolved a:03jbond i have rolled out a change so that buster machines use openss... [13:54:48] (03CR) 10Kosta Harlan: [C: 03+1] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974532 (https://phabricator.wikimedia.org/T351300) (owner: 10Tchanders) [13:54:52] (03PS2) 10Muehlenhoff: Add kubernetes2054 to codfw wikikube cluster [puppet] - 10https://gerrit.wikimedia.org/r/974528 (https://phabricator.wikimedia.org/T34843) (owner: 10JMeybohm) [13:54:55] (03PS1) 10Cathal Mooney: Remove includes for subnets from cloud-support1-a-eqiad [dns] - 10https://gerrit.wikimedia.org/r/974534 (https://phabricator.wikimedia.org/T346947) [13:55:56] !log disable peering/transit on cr1-esams for linecard reboot - T346779 [13:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:01] T346779: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 [13:58:40] (03CR) 10Majavah: [C: 04-1] "the removal seems ok, but see inline" [dns] - 10https://gerrit.wikimedia.org/r/974534 (https://phabricator.wikimedia.org/T346947) (owner: 10Cathal Mooney) [13:58:50] (03CR) 10Brouberol: "With the `pcc-netboot.cfg.diff` file containing the diff related to the netboot.cfg diff, we see that most lines are just re-ordered, but " [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [13:59:51] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::control [13:59:52] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host thanos-be2001.codfw.wmnet [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T1400). [14:00:04] Daimona and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] o/ [14:00:18] hello [14:00:53] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [14:00:53] (03CR) 10Brouberol: Automatically generate autoinstall subnet DHCP config files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [14:02:22] (03PS1) 10Muehlenhoff: Switch thanos-be2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974535 (https://phabricator.wikimedia.org/T349619) [14:02:34] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974536 (https://phabricator.wikimedia.org/T349619) [14:03:00] !log reboot fpc0 on cr1-esams - T346779 [14:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:07] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::control: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974536 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:03:10] T346779: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 [14:03:11] (03CR) 10Muehlenhoff: [C: 03+2] Switch thanos-be2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974535 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:03:12] !log joal@deploy2002 Started deploy [analytics/refinery@3e9df5d]: Regular analytics weekly train - HOTFIX [analytics/refinery@3e9df5d8] [14:03:14] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE, AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:03:19] !log joal@deploy2002 Finished deploy [analytics/refinery@3e9df5d]: Regular analytics weekly train - HOTFIX [analytics/refinery@3e9df5d8] (duration: 00m 06s) [14:03:25] (03PS2) 10Muehlenhoff: Switch thanos-be2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974535 (https://phabricator.wikimedia.org/T349619) [14:03:28] Any deployers around? [14:03:32] (03CR) 10JMeybohm: [C: 03+2] Add kubernetes2054 to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/974484 (https://phabricator.wikimedia.org/T348436) (owner: 10JMeybohm) [14:03:46] !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1003.eqiad.wmnet with OS bullseye [14:03:48] (03CR) 10JMeybohm: [C: 03+2] Add kubernetes2054 to codfw wikikube cluster [puppet] - 10https://gerrit.wikimedia.org/r/974528 (https://phabricator.wikimedia.org/T34843) (owner: 10JMeybohm) [14:04:11] (03Merged) 10jenkins-bot: Add kubernetes2054 to codfw.k8s [homer/public] - 10https://gerrit.wikimedia.org/r/974484 (https://phabricator.wikimedia.org/T348436) (owner: 10JMeybohm) [14:04:55] I restarted the linecard on the wrong router... [14:05:00] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:19] Daimona: o/ I can deploy [14:05:39] XioNoX: does that explain why I don't seem to be able to reach any any of our sites? [14:05:42] Many things down [14:05:46] PROBLEM - Host ncredir3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:47] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:05:47] PROBLEM - Host cr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:47] PROBLEM - Host doh3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:47] PROBLEM - Host doh3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:47] PROBLEM - Host install3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:49] (^) [14:05:52] PROBLEM - Host ncredir3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:58] PROBLEM - Host netflow3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:00] PROBLEM - Host durum3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:00] Ah yeah I can't deploy, then ;-) [14:06:02] PROBLEM - Host asw1-bw27-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:04] PROBLEM - Host asw1-by27-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:13] PROBLEM - Host cr2-esams #page is DOWN: PING CRITICAL - Packet loss = 100% [14:06:13] PROBLEM - Host durum3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:14] PROBLEM - Host prometheus3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:06:17] ouch [14:06:22] Uh, yeah, maybe not the best time to deploy stuff :O [14:06:34] let's depool esams? [14:06:35] RECOVERY - Host cr2-esams #page is UP: PING OK - Packet loss = 0%, RTA = 80.58 ms [14:06:43] vgutierrez: +1 [14:06:46] PROBLEM - Host asw1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:06:46] PROBLEM - Host asw1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:06:46] Just a switch reboot? [14:06:52] PROBLEM - Host ps1-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:06:52] cr-2 is a core router [14:06:54] RECOVERY - Host asw1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 82.31 ms [14:06:54] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 78.51 ms [14:06:56] RECOVERY - Host install3003 is UP: PING OK - Packet loss = 0%, RTA = 78.70 ms [14:06:56] RECOVERY - Host doh3003 is UP: PING OK - Packet loss = 0%, RTA = 78.61 ms [14:06:56] RECOVERY - Host doh3004 is UP: PING OK - Packet loss = 0%, RTA = 78.49 ms [14:06:58] RECOVERY - Host asw1-bw27-esams is UP: PING OK - Packet loss = 0%, RTA = 82.62 ms [14:07:00] topranks: [14:07:04] PROBLEM - Host 2a02:ec80:300:2:185:15:59:34 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:06] PROBLEM - Host cp3067 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:06] PROBLEM - Host cp3069 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:06] PROBLEM - Host cp3071 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:06] PROBLEM - Host cp3075 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:06] PROBLEM - Host cp3073 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:06] PROBLEM - Host cp3077 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:07] PROBLEM - Host cp3079 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:07] vgutierrez: it is recovering now [14:07:07] PROBLEM - Host cp3081 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:08] PROBLEM - Host ganeti3005 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:08] doing [14:07:08] PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:09] PROBLEM - Host lvs3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:09] PROBLEM - Host asw1-bw27-esams.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:07:10] PROBLEM - Host asw1-by27-esams.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:07:11] PROBLEM - Host cr2-esams.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:07:11] PROBLEM - Host scs-by27-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:07:11] k [14:07:12] PROBLEM - Host re0.cr1-esams.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:07:12] PROBLEM - Host cp3066 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:12] looks like it's back faster than a depool is needed [14:07:12] PROBLEM - Host cp3068 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:13] PROBLEM - Host cp3070 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:13] PROBLEM - Host cp3072 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:14] PROBLEM - Host cp3074 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:14] PROBLEM - Host cp3076 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:15] PROBLEM - Host cp3078 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:15] PROBLEM - Host cp3080 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:16] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:16] PROBLEM - Host ganeti3008 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:17] PROBLEM - Host lvs3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:17] PROBLEM - Host lvs3008 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:18] PROBLEM - Host 2a02:ec80:300:1:185:15:59:2 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:20] PROBLEM - Host ps1-bw27-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:07:23] but many thing may be messy? [14:07:28] PROBLEM - Host ripe-atlas-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:40] (03PS1) 10Ssingh: depool esams [dns] - 10https://gerrit.wikimedia.org/r/974537 [14:08:03] I can't reach text-lb.esams.wikimedia.org from here FWIW [14:08:05] (03PS1) 10Ayounsi: Depool esams [dns] - 10https://gerrit.wikimedia.org/r/974538 [14:08:09] (03CR) 10Ssingh: [V: 03+2 C: 03+2] depool esams [dns] - 10https://gerrit.wikimedia.org/r/974537 (owner: 10Ssingh) [14:08:12] https://gerrit.wikimedia.org/r/c/operations/dns/+/974538 [14:08:18] XioNoX: already merged by sukhe [14:08:19] XioNoX: running authdns-upate [14:08:22] PROBLEM - Wikidough DoT Check -IPv6- on doh3004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [14:08:34] yeah, better doing it anyway [14:08:40] !log running authdns-update to depool esams [14:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:52] wow, missing PTR error [14:08:54] what a timing [14:09:03] (03CR) 10CI reject: [V: 04-1] Depool esams [dns] - 10https://gerrit.wikimedia.org/r/974538 (owner: 10Ayounsi) [14:09:04] RECOVERY - Host cp3078 is UP: PING OK - Packet loss = 0%, RTA = 78.25 ms [14:09:04] in past cases things don't end up super health [14:09:04] RECOVERY - Host cp3068 is UP: PING OK - Packet loss = 0%, RTA = 78.01 ms [14:09:04] RECOVERY - Host cp3074 is UP: PING OK - Packet loss = 0%, RTA = 78.00 ms [14:09:04] RECOVERY - Host durum3004 is UP: PING OK - Packet loss = 0%, RTA = 78.46 ms [14:09:06] RECOVERY - Host lvs3010 is UP: PING OK - Packet loss = 0%, RTA = 78.13 ms [14:09:06] RECOVERY - Host cp3070 is UP: PING OK - Packet loss = 0%, RTA = 78.06 ms [14:09:06] RECOVERY - Host cp3066 is UP: PING OK - Packet loss = 0%, RTA = 78.07 ms [14:09:06] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 79.67 ms [14:09:06] RECOVERY - Host ncredir3004 is UP: PING OK - Packet loss = 0%, RTA = 78.51 ms [14:09:06] RECOVERY - Host cp3067 is UP: PING OK - Packet loss = 0%, RTA = 78.02 ms [14:09:07] RECOVERY - Host cp3072 is UP: PING OK - Packet loss = 0%, RTA = 78.09 ms [14:09:07] RECOVERY - Host cp3076 is UP: PING OK - Packet loss = 0%, RTA = 78.05 ms [14:09:08] RECOVERY - Host cp3080 is UP: PING OK - Packet loss = 0%, RTA = 78.04 ms [14:09:08] RECOVERY - Host cp3073 is UP: PING OK - Packet loss = 0%, RTA = 78.02 ms [14:09:09] RECOVERY - Host cp3079 is UP: PING OK - Packet loss = 0%, RTA = 78.06 ms [14:09:09] (03PS1) 10Cathal Mooney: De-pool esams due to edge router offline status [dns] - 10https://gerrit.wikimedia.org/r/974539 (https://phabricator.wikimedia.org/T351304) [14:09:10] RECOVERY - Host cp3069 is UP: PING OK - Packet loss = 0%, RTA = 78.05 ms [14:09:10] RECOVERY - Host cp3081 is UP: PING OK - Packet loss = 0%, RTA = 78.07 ms [14:09:10] RECOVERY - Host cp3077 is UP: PING OK - Packet loss = 0%, RTA = 78.01 ms [14:09:11] RECOVERY - Host cp3071 is UP: PING OK - Packet loss = 0%, RTA = 78.01 ms [14:09:11] RECOVERY - Host ganeti3008 is UP: PING OK - Packet loss = 0%, RTA = 78.10 ms [14:09:12] RECOVERY - Host ganeti3007 is UP: PING OK - Packet loss = 0%, RTA = 78.20 ms [14:09:12] RECOVERY - Host cp3075 is UP: PING OK - Packet loss = 0%, RTA = 78.04 ms [14:09:13] RECOVERY - Host lvs3009 is UP: PING OK - Packet loss = 0%, RTA = 79.05 ms [14:09:13] RECOVERY - Host ganeti3005 is UP: PING OK - Packet loss = 0%, RTA = 78.08 ms [14:09:14] RECOVERY - Host ps1-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 79.07 ms [14:09:14] ok then [14:09:14] RECOVERY - Host ps1-bw27-esams is UP: PING WARNING - Packet loss = 33%, RTA = 78.96 ms [14:09:15] RECOVERY - Host prometheus3003 is UP: PING OK - Packet loss = 0%, RTA = 78.49 ms [14:09:15] RECOVERY - Host ncredir3003 is UP: PING OK - Packet loss = 0%, RTA = 78.47 ms [14:09:20] RECOVERY - Host asw1-bw27-esams.mgmt is UP: PING OK - Packet loss = 0%, RTA = 78.47 ms [14:09:24] RECOVERY - Host durum3003 is UP: PING OK - Packet loss = 0%, RTA = 78.53 ms [14:09:26] RECOVERY - Host netflow3003 is UP: PING OK - Packet loss = 0%, RTA = 78.62 ms [14:09:26] I'm still unable to reach anything, fwiw [14:09:30] RECOVERY - Host asw1-by27-esams.mgmt is UP: PING OK - Packet loss = 0%, RTA = 78.44 ms [14:09:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host thanos-be2001.codfw.wmnet [14:09:36] taavi: same here [14:09:36] RECOVERY - Host lvs3008 is UP: PING OK - Packet loss = 0%, RTA = 78.02 ms [14:09:36] (ProbeDown) firing: (6) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:45] /tmp/dns-check.rudydea2/zones/netbox/4.64.10.in-addr.arpa [14:09:48] who was working on this? [14:09:52] (03Abandoned) 10Cathal Mooney: De-pool esams due to edge router offline status [dns] - 10https://gerrit.wikimedia.org/r/974539 (https://phabricator.wikimedia.org/T351304) (owner: 10Cathal Mooney) [14:09:52] we need to fix this to fix authdns-update [14:10:00] sukhe: hey that was me [14:10:02] PROBLEM - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:10:10] PROBLEM - Wikidough DoH Check -IPv6- on doh3004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [14:10:12] let's just merge https://gerrit.wikimedia.org/r/c/operations/dns/+/974534 [14:10:13] topranks: can you please fix or revert :) [14:10:20] text-lb is reachable again [14:10:20] (03CR) 10Cathal Mooney: [C: 03+2] Remove includes for subnets from cloud-support1-a-eqiad [dns] - 10https://gerrit.wikimedia.org/r/974534 (https://phabricator.wikimedia.org/T346947) (owner: 10Cathal Mooney) [14:10:21] enwiki is back for me now [14:10:22] got it [14:10:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:10:30] yep, back up for me too [14:10:33] ok then, topranks skipping it! [14:10:36] RECOVERY - Host 2a02:ec80:300:2:185:15:59:34 is UP: PING OK - Packet loss = 0%, RTA = 78.07 ms [14:10:36] tavvi -1'd cos I quoted wrong task in bug [14:10:44] I'm merging now [14:10:52] yeah probes are recovering too [14:10:54] topranks: wait [14:10:54] (03PS2) 10Cathal Mooney: Remove includes for subnets from cloud-support1-a-eqiad [dns] - 10https://gerrit.wikimedia.org/r/974534 (https://phabricator.wikimedia.org/T346947) [14:10:58] RECOVERY - Wikidough DoT Check -IPv6- on doh3004 is OK: TCP OK - 0.161 second response time on 2a02:ec80:300:1:185:15:59:4 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [14:11:00] you will have my change as well then [14:11:01] Daimona: I'm happy to push this config, if you're ready? [14:11:13] (03PS1) 10Ssingh: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/974236 [14:11:14] sukhe: ok holding off [14:11:15] text-lb.esams.wikimedia.org seems to be back to life [14:11:18] I am reverting [14:11:24] RECOVERY - Wikidough DoH Check -IPv6- on doh3004 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.323 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [14:11:26] RECOVERY - Host 2a02:ec80:300:1:185:15:59:2 is UP: PING OK - Packet loss = 0%, RTA = 79.59 ms [14:11:26] RECOVERY - Host asw1-bw27-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.42 ms [14:11:28] RECOVERY - Host asw1-by27-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.35 ms [14:11:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams01_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:40] RECOVERY - Host cr2-esams.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.81 ms [14:11:40] RECOVERY - Host scs-by27-esams is UP: PING OK - Packet loss = 0%, RTA = 78.52 ms [14:11:40] RECOVERY - Host re0.cr1-esams.mgmt is UP: PING OK - Packet loss = 0%, RTA = 78.35 ms [14:11:47] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/974236 (owner: 10Ssingh) [14:11:51] NEL is still high [14:11:59] topranks: all yours! [14:12:11] XioNoX: topranks: can we get a summary of the issues with the core routers? [14:12:23] now going down, it was just monitoring lag probably [14:12:23] linecard broken, linecard on the other router restarted [14:12:28] jbond: ^^ [14:12:29] awight: yup, I'm ready, as long as the fire has been extinguished... :) [14:12:33] ahh thanks i missed that [14:12:36] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:12:40] Daimona: ack [14:12:44] (ProbeDown) firing: (12) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:52] RECOVERY - Host ripe-atlas-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.80 ms [14:13:19] (03CR) 10Ssingh: [C: 03+1] Remove includes for subnets from cloud-support1-a-eqiad [dns] - 10https://gerrit.wikimedia.org/r/974534 (https://phabricator.wikimedia.org/T346947) (owner: 10Cathal Mooney) [14:13:35] Daimona: The commit message is about a different config variable than the patch changes. Can you confirm for me? [14:14:07] Oh dear. That's definitely wrong, thanks for spotting that [14:14:11] Let me quickly fix [14:14:13] (03PS3) 10Cathal Mooney: Remove includes for subnets from cloud-support1-a-eqiad [dns] - 10https://gerrit.wikimedia.org/r/974534 (https://phabricator.wikimedia.org/T346947) [14:14:13] ty [14:14:21] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:14:36] (ProbeDown) resolved: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:14:49] alright apologies about that, thanks to everyone who responded [14:14:53] (03PS2) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974200 (https://phabricator.wikimedia.org/T347607) [14:14:54] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 79.49 ms [14:14:56] Fixed now! [14:15:22] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:15:25] the nel alert is abit lagged, I am guessing because it does average over a window [14:16:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:16:46] RECOVERY - Host cr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 79.58 ms [14:16:56] Daimona: It's okay that I don't understand, but I don't see this variable getting used anywhere. Is it for the Campaigns extension? [14:17:25] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:17:33] It's used by the CampaignEvents extension: https://codesearch.wmcloud.org/deployed/?q=CampaignEventsEnableParticipantQuestions&files=&excludeFiles=&repos=mediawiki%2Fextensions%2FCampaignEvents [14:17:33] (ProbeDown) firing: (12) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:42] Daimona: ty! [14:17:52] deploying... [14:18:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974200 (https://phabricator.wikimedia.org/T347607) (owner: 10Daimona Eaytoy) [14:18:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host kubernetes2054.codfw.wmnet [14:18:22] um [14:18:29] are we really in a position to deploy already again? [14:18:30] (03PS3) 10Awight: prod: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974200 (https://phabricator.wikimedia.org/T347607) (owner: 10Daimona Eaytoy) [14:19:02] (03CR) 10Cathal Mooney: Remove includes for subnets from cloud-support1-a-eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/974534 (https://phabricator.wikimedia.org/T346947) (owner: 10Cathal Mooney) [14:19:12] Lucas_WMDE: I'm not sure, tbh. [14:19:13] (03CR) 10Andrew Bogott: [C: 03+2] Remove mentions of decom'd cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/973829 (https://phabricator.wikimedia.org/T351010) (owner: 10Andrew Bogott) [14:19:44] I think you're good [14:19:50] :+1: [14:19:56] sukhe: what's current status of esams in terms of DNS pooling? [14:20:11] (03CR) 10TrainBranchBot: "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974200 (https://phabricator.wikimedia.org/T347607) (owner: 10Daimona Eaytoy) [14:20:46] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [14:20:52] (03Merged) 10jenkins-bot: prod: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974200 (https://phabricator.wikimedia.org/T347607) (owner: 10Daimona Eaytoy) [14:21:15] !log awight@deploy2002 Started scap: Backport for [[gerrit:974200|prod: Enable $wgCampaignEventsEnableParticipantQuestions (T347607)]] [14:21:17] (NELHigh) resolved: (2) Elevated Network Error Logging events (tcp.address_unreachable) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:21:19] T347607: Enable Participant Questions in production - https://phabricator.wikimedia.org/T347607 [14:22:21] sukhe: this line is in the local repo on dns1004 for instance [14:22:25] cmooney@dns1004:~$ grep esams /srv/authdns/git/admin_state [14:22:25] geoip/generic-map/esams => DOWN [14:22:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:22:39] !log awight@deploy2002 daimona and awight: Backport for [[gerrit:974200|prod: Enable $wgCampaignEventsEnableParticipantQuestions (T347607)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:47] topranks: ^ lumen is back up [14:22:49] Running authdns-update wants to remove that line. Which I think is ok but I'm not proceeding [14:22:55] Daimona: ready to test [14:23:05] !log joal@deploy2002 Started deploy [analytics/refinery@3e9df5d] (thin): Regular analytics weekly train - THIN - HOTFIX [analytics/refinery@3e9df5d8] [14:23:12] !log joal@deploy2002 Finished deploy [analytics/refinery@3e9df5d] (thin): Regular analytics weekly train - THIN - HOTFIX [analytics/refinery@3e9df5d8] (duration: 00m 07s) [14:23:15] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [14:23:17] XioNoX: wow, I was not expecting anything to go our way today given luck so far :) [14:23:18] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:25] topranks: checking [14:23:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host kubernetes2054.codfw.wmnet [14:23:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) Thank you for looking into this and fixing the issue, I can confirm the errors I'm seeing... [14:23:36] HouseOfM: ^^ [14:23:38] !log joal@deploy2002 Started deploy [analytics/refinery@3e9df5d] (hadoop-test): Regular analytics weekly train - TEST - HOTFIX [analytics/refinery@3e9df5d8] [14:23:43] topranks: at least HEAD says the revert has been cmpleted [14:23:52] I'm going to create an event now, can you try registering? [14:24:04] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:15] sukhe: yes, running authdns-update is gonna remove the DOWN for esams in the local repo [14:24:25] which I think is correct right? we're not depooling [14:24:28] are you running it? yes [14:24:29] and yes [14:24:37] ok I'll proceed. thanks for the sanity check :) [14:25:01] Here it is: https://test.wikipedia.org/wiki/Event:T347607 [14:25:07] I guess the local repo updated when that was in, but update failed cos of my bit [14:25:48] OK - authdns-update successful on all nodes! [14:25:55] thanks! [14:26:04] cmooney@dns1004:~$ grep esams /srv/authdns/git/admin_state [14:26:04] cmooney@dns1004:~$ [14:26:43] (03PS3) 10Andrew Bogott: Remove mentions of decom'd cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/973829 (https://phabricator.wikimedia.org/T351010) [14:26:52] !log joal@deploy2002 Finished deploy [analytics/refinery@3e9df5d] (hadoop-test): Regular analytics weekly train - TEST - HOTFIX [analytics/refinery@3e9df5d8] (duration: 03m 13s) [14:26:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10MoritzMuehlenhoff) I'll also open a separate task to eventually also move Bullseye and Bookworm hosts... [14:27:22] Daimona: I received an email FWIW, and it looks correct. Very exciting new feature! [14:27:34] (KubernetesCalicoDown) firing: kubernetes2054.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2054.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:28:01] @Daimona I got the registration email, but nothing else, did you send a custom email? [14:28:20] Were you asked to answer questions when you registered? [14:28:25] I was [14:29:03] I was not [14:29:23] Uhm [14:30:13] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 20% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/964448 (https://phabricator.wikimedia.org/T348122) (owner: 10Clément Goubert) [14:30:17] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye [14:30:24] RECOVERY - Check unit status of netbox_ganeti_esams01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:30:36] I can see them in https://test.wikipedia.org/wiki/Special:RegisterForEvent/138 [14:30:55] Sorry, my bad, chrome extension issues [14:30:58] we are good [14:31:11] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye [14:31:20] Nice :) [14:31:23] +1 great, continuing with deployment in that case [14:31:39] !log awight@deploy2002 daimona and awight: Continuing with sync [14:31:57] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Ladsgroup) Thanks for the patch! I hope it'll make a dent, I'll monitor it. While I was monitoring it, I tried this: ` root@db1217.eqiad.wmnet[librenms]> select * f... [14:32:24] (03CR) 10Andrew Bogott: [C: 03+2] Remove mentions of decom'd cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/973829 (https://phabricator.wikimedia.org/T351010) (owner: 10Andrew Bogott) [14:32:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Clement_Goubert) Can you do 2 in A, 1 in B, 1 in C? [14:33:37] sergi0: Shall I begin deploying the GrowthExperiments config? [14:34:30] awight: sure. We won't be able to test much, it is a run toggle for a scheduled maintenance script [14:34:51] ah okay, I'll just check basic health on the test servers, then. [14:35:10] !log Raised mw-on-k8s to 20% of external traffic, rollout will happen over the next half hour - T348122 [14:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:15] T348122: Move 25% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T348122 [14:37:24] !log awight@deploy2002 Finished scap: Backport for [[gerrit:974200|prod: Enable $wgCampaignEventsEnableParticipantQuestions (T347607)]] (duration: 16m 09s) [14:37:28] T347607: Enable Participant Questions in production - https://phabricator.wikimedia.org/T347607 [14:37:38] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10MatthewVernon) [14:37:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974169 (https://phabricator.wikimedia.org/T308142) (owner: 10Sergio Gimeno) [14:38:44] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink backend for 16,17th rounds of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974169 (https://phabricator.wikimedia.org/T308142) (owner: 10Sergio Gimeno) [14:38:56] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:11] !log awight@deploy2002 Started scap: Backport for [[gerrit:974169|GrowthExperiments: enable AddLink backend for 16,17th rounds of wikis (T308142 T308143)]] [14:39:18] T308142: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 [14:39:18] T308143: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 [14:40:18] (03PS1) 10Ottomata: Refinery job - bump jar versions for refine, test refine, and producecanaryevents [puppet] - 10https://gerrit.wikimedia.org/r/974545 (https://phabricator.wikimedia.org/T321854) [14:40:51] (03PS2) 10Ottomata: Refinery job - bump jar versions for refine and test refine [puppet] - 10https://gerrit.wikimedia.org/r/974545 (https://phabricator.wikimedia.org/T321854) [14:41:42] RECOVERY - Check systemd state on ganeti1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:54] !log awight@deploy2002 sgimeno and awight: Backport for [[gerrit:974169|GrowthExperiments: enable AddLink backend for 16,17th rounds of wikis (T308142 T308143)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:41:56] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:57] !log awight@deploy2002 sgimeno and awight: Continuing with sync [14:42:33] awight: thank you :) [14:42:34] (KubernetesCalicoDown) resolved: kubernetes2054.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2054.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:43:01] Daimona: Keep up the great work :-) [14:43:25] (03CR) 10Ottomata: [C: 03+2] Refinery job - bump jar versions for refine and test refine [puppet] - 10https://gerrit.wikimedia.org/r/974545 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [14:43:31] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) @ayounsi okay to truncate that table? [14:44:57] (03PS1) 10Ottomata: canary_events - bump refinery-job to version to pick up retry logic [puppet] - 10https://gerrit.wikimedia.org/r/974600 (https://phabricator.wikimedia.org/T326002) [14:45:05] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: wmcs::openstack::codfw1dev::control [14:47:11] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1003.eqiad.wmnet with OS bullseye [14:47:27] !log awight@deploy2002 Finished scap: Backport for [[gerrit:974169|GrowthExperiments: enable AddLink backend for 16,17th rounds of wikis (T308142 T308143)]] (duration: 08m 16s) [14:47:33] T308142: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 [14:47:34] T308143: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 [14:47:38] sergi0: deployed! [14:48:28] awight: cool! Thank you for the assistance :) [14:50:13] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bullseye [14:50:20] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [14:50:47] 10SRE, 10Infrastructure-Foundations, 10netops: FPC1 Failure on cr1-esams - https://phabricator.wikimedia.org/T351304 (10ayounsi) [14:51:08] (03CR) 10Brouberol: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/974516 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [14:51:20] (03CR) 10Ottomata: [C: 03+2] canary_events - bump refinery-job to version to pick up retry logic [puppet] - 10https://gerrit.wikimedia.org/r/974600 (https://phabricator.wikimedia.org/T326002) (owner: 10Ottomata) [14:51:40] (03CR) 10Marostegui: [C: 03+1] decommission: removing db1127 from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/973425 (https://phabricator.wikimedia.org/T351063) (owner: 10Arnaudb) [14:51:54] (03CR) 10Btullis: [C: 04-1] "Should we abandon this change then, or is there anything potentially useful from this change that we want to implement?" [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:52:06] (03CR) 10Btullis: [C: 03+2] Clean up hadoop coordinator roles by removing analytics_meta DB [puppet] - 10https://gerrit.wikimedia.org/r/974516 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [14:53:18] 10SRE-tools, 10Infrastructure-Foundations, 10homer: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415 (10ayounsi) FYI, this limitation is becoming more and more problematic for deploying a change to the whole infra. [14:53:56] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:59] (PuppetFailure) firing: Puppet has failed on ganeti1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:56:20] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: vrts [14:57:50] (03PS1) 10Jbond: policies/cr-labs: Add a rule so host can reach puppetservers [homer/public] - 10https://gerrit.wikimedia.org/r/974605 (https://phabricator.wikimedia.org/T349619) [14:57:58] (03CR) 10Stevemunene: Disable WMDE misc jobs on stat1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:58:46] (03PS1) 10Muehlenhoff: Switch vrts to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974606 (https://phabricator.wikimedia.org/T349619) [14:59:13] (03Abandoned) 10Stevemunene: Disable WMDE misc jobs on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [14:59:43] (03CR) 10Muehlenhoff: [C: 03+2] Switch vrts to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974606 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:00:56] (03CR) 10Ayounsi: "Probably cleaner to merge it with the puppetmaster term (even rename it)." [homer/public] - 10https://gerrit.wikimedia.org/r/974605 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:01:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T1500) [15:01:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10Southparkfan) [15:01:44] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10Southparkfan) [15:02:20] (03PS2) 10Jbond: policies/cr-labs: Add a rule so host can reach puppetservers [homer/public] - 10https://gerrit.wikimedia.org/r/974605 (https://phabricator.wikimedia.org/T349619) [15:02:29] (03CR) 10Jbond: policies/cr-labs: Add a rule so host can reach puppetservers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/974605 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:05:00] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [15:05:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: vrts [15:05:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10Southparkfan) Production migration from the gnutls driver to the openssl driver can be tracked in T324... [15:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:05:59] (PuppetFailure) resolved: Puppet has failed on ganeti1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:06:39] (03CR) 10Ayounsi: [C: 03+1] policies/cr-labs: Add a rule so host can reach puppetservers [homer/public] - 10https://gerrit.wikimedia.org/r/974605 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:06:51] (03CR) 10Jbond: [C: 03+2] policies/cr-labs: Add a rule so host can reach puppetservers [homer/public] - 10https://gerrit.wikimedia.org/r/974605 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:07:25] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [15:07:40] (03Merged) 10jenkins-bot: policies/cr-labs: Add a rule so host can reach puppetservers [homer/public] - 10https://gerrit.wikimedia.org/r/974605 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:08:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [15:09:43] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [15:09:51] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1001.eqiad.wmnet with OS bullseye [15:12:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1024.eqiad.wmnet [15:13:02] (03CR) 10Arnaudb: [C: 03+2] decommission: removing db1127 from eqiad [puppet] - 10https://gerrit.wikimedia.org/r/973425 (https://phabricator.wikimedia.org/T351063) (owner: 10Arnaudb) [15:13:54] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1127.eqiad.wmnet [15:14:40] (03PS1) 10Muehlenhoff: Switch restbase1024 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974607 (https://phabricator.wikimedia.org/T349619) [15:15:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch restbase1024 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974607 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:16:02] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye [15:16:53] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) [15:18:07] 10ops-esams: Merge cr3-esams (WMF10020) with cr1-esams (WMF4200) - https://phabricator.wikimedia.org/T351319 (10ayounsi) [15:18:38] 10SRE, 10Cloud-VPS, 10cloud-services-team, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) As part of the Puppet migration we already switched all Buster clients (where version of GNUTLS had problems with the new cert) toward... [15:19:37] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:21:39] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1127.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:22:41] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1127.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:22:41] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1127.eqiad.wmnet [15:23:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1024.eqiad.wmnet [15:25:21] (03CR) 10Ayounsi: Automatically generate autoinstall subnet DHCP config files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [15:25:48] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host durum1001.eqiad.wmnet [15:26:02] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [15:26:33] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1127.eqiad.wmnet - https://phabricator.wikimedia.org/T351063 (10ABran-WMF) [15:26:45] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:26:49] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:26:56] arnaudb: ^ [15:27:08] (03PS1) 10Muehlenhoff: Switch durum1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974608 (https://phabricator.wikimedia.org/T349619) [15:27:08] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1127.eqiad.wmnet - https://phabricator.wikimedia.org/T351063 (10ABran-WMF) a:05ABran-WMF→03None [15:28:18] (03CR) 10Muehlenhoff: [C: 03+2] Switch durum1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974608 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:28:26] (03PS1) 10Hnowlan: api-gateway: specify config path [deployment-charts] - 10https://gerrit.wikimedia.org/r/974609 (https://phabricator.wikimedia.org/T324130) [15:28:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'depool db1127', diff saved to https://phabricator.wikimedia.org/P53485 and previous config saved to /var/cache/conftool/dbconfig/20231115-152836-arnaudb.json [15:28:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [15:28:58] (03PS1) 10Jbond: cfssl: ensure we b64decode the responses before dumping [puppet] - 10https://gerrit.wikimedia.org/r/974610 [15:29:46] (03CR) 10Jbond: [C: 03+2] cfssl: ensure we b64decode the responses before dumping [puppet] - 10https://gerrit.wikimedia.org/r/974610 (owner: 10Jbond) [15:31:38] 10ops-eqdfw: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) [15:31:45] 10ops-eqdfw: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) p:05Triage→03High [15:31:46] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:31:50] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:33:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host durum1001.eqiad.wmnet [15:36:18] (03PS1) 10Arnaudb: mariadb: decommission db1130 [puppet] - 10https://gerrit.wikimedia.org/r/974627 (https://phabricator.wikimedia.org/T351067) [15:37:59] (PuppetFailure) firing: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:38:21] 10ops-eqdfw: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) [15:39:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host doh6001.wikimedia.org [15:40:08] (03PS1) 10Muehlenhoff: Switch doh6001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974612 (https://phabricator.wikimedia.org/T349619) [15:40:38] !log bounce prometheus@ops on prometheus4002 [15:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch doh6001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974612 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:41:22] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:24] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:41:32] !log bounce prometheus-blackbox-exporter on prometheus4002 [15:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1106.eqiad.wmnet [15:43:15] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1106.eqiad.wmnet [15:43:58] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Andrew) a:05Andrew→03None [15:44:51] !log swapped cp1106 <-> cp1081 (T349244) [15:44:54] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:56] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [15:44:58] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:45:59] (03CR) 10Ladsgroup: [C: 03+1] "I see it's depooled. Good to go I think." [puppet] - 10https://gerrit.wikimedia.org/r/974627 (https://phabricator.wikimedia.org/T351067) (owner: 10Arnaudb) [15:46:53] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1107.eqiad.wmnet [15:46:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1107.eqiad.wmnet [15:46:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host doh6001.wikimedia.org [15:47:59] (PuppetFailure) resolved: Puppet has failed on bast2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:48:06] !log swapped cp1107 <-> cp1082 (T349244) [15:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:35] (03CR) 10Arnaudb: [C: 03+2] mariadb: decommission db1130 [puppet] - 10https://gerrit.wikimedia.org/r/974627 (https://phabricator.wikimedia.org/T351067) (owner: 10Arnaudb) [15:49:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:49:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1130.eqiad.wmnet [15:54:50] (03PS1) 10Jbond: pki: correct paths [puppet] - 10https://gerrit.wikimedia.org/r/974613 (https://phabricator.wikimedia.org/T350688) [15:55:09] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:55:16] (03CR) 10Jbond: [C: 03+2] pki: correct paths [puppet] - 10https://gerrit.wikimedia.org/r/974613 (https://phabricator.wikimedia.org/T350688) (owner: 10Jbond) [15:56:16] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host lvs6003.drmrs.wmnet [15:56:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/487/con" [puppet] - 10https://gerrit.wikimedia.org/r/974613 (https://phabricator.wikimedia.org/T350688) (owner: 10Jbond) [15:57:14] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1130.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:58:03] (03PS1) 10Muehlenhoff: Switch lvs6003 to to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974614 (https://phabricator.wikimedia.org/T349619) [15:58:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1130.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:58:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:58:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1130.eqiad.wmnet [15:58:56] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:00:01] (03CR) 10Muehlenhoff: [C: 03+2] Switch lvs6003 to to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974614 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [16:00:04] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1130.eqiad.wmnet - https://phabricator.wikimedia.org/T351067 (10ABran-WMF) a:05ABran-WMF→03None [16:00:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:02:24] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10MatthewVernon) [16:02:37] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10MatthewVernon) ssh pubkey confirmed OOB. [16:02:41] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Exclude bot accounts from applying antivandalism thresholds [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/970742 (https://phabricator.wikimedia.org/T350245) (owner: 10Aklapper) [16:02:57] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:02:59] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:03:50] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Avoid trailing newline in qqq.json [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/969515 (https://phabricator.wikimedia.org/T294754) (owner: 10Pppery) [16:03:59] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) 05Open→03Resolved Per my chat with Arzhel in irc, table truncated! `root@db1119.eqiad.wmnet[librenms]> truncate table syslog; Query OK, 0 rows affect... [16:04:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host lvs6003.drmrs.wmnet [16:05:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:05:39] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:06:38] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Update source strings from Phrabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/969518 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [16:06:42] (03PS1) 10JMeybohm: k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/974615 [16:06:52] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] Update arcanist translations too [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/969520 (https://phabricator.wikimedia.org/T318763) (owner: 10Pppery) [16:08:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/488/con" [puppet] - 10https://gerrit.wikimedia.org/r/974615 (owner: 10JMeybohm) [16:09:49] (03PS2) 10JMeybohm: k8s: Make kubelet register new nodes as unschedulable [puppet] - 10https://gerrit.wikimedia.org/r/974615 [16:11:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/489/con" [puppet] - 10https://gerrit.wikimedia.org/r/974615 (owner: 10JMeybohm) [16:16:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'depool db1130', diff saved to https://phabricator.wikimedia.org/P53486 and previous config saved to /var/cache/conftool/dbconfig/20231115-161600-arnaudb.json [16:18:15] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:18:19] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:18:33] (03PS1) 10Btullis: Stop oozie server and remove some resources [puppet] - 10https://gerrit.wikimedia.org/r/974618 (https://phabricator.wikimedia.org/T341893) [16:18:35] (03PS1) 10Btullis: Remove the oozie client [puppet] - 10https://gerrit.wikimedia.org/r/974619 (https://phabricator.wikimedia.org/T341893) [16:18:37] (03PS1) 10Btullis: Stop applying oozie profiles in most places [puppet] - 10https://gerrit.wikimedia.org/r/974620 (https://phabricator.wikimedia.org/T341893) [16:19:26] !log depooling cp1102 for BIOS options fix [16:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:48] 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes2054 - https://phabricator.wikimedia.org/T345650 (10JMeybohm) [16:21:19] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cp1102.eqiad.wmnet with reason: BIOS settings fix [16:21:32] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp1102.eqiad.wmnet with reason: BIOS settings fix [16:21:50] fabfur, jbond: two VO alerts are still open. Are they good to resolve? [16:22:12] Also, anything specific to keep an eye out for re: the fun you had earlier as things are handed off? [16:23:03] (PuppetFailure) firing: Puppet has failed on ping1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:23:23] brett: the issue is resolve, I can ack the alerts [16:23:27] *resolved [16:23:38] pass you the gdoc link in pvt [16:25:16] !log reload thanos-rule on titan[12]001 to pick up new pyrra generated configs [16:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:33] !log fabfur@cumin1001 START - Cookbook sre.hosts.provision for host cp1102.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:32:59] (PuppetFailure) resolved: Puppet has failed on ping1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:34:39] (03PS1) 10Btullis: Remove the oozie integration from hue [puppet] - 10https://gerrit.wikimedia.org/r/974646 (https://phabricator.wikimedia.org/T341893) [16:34:41] (03PS1) 10Btullis: Remove oozie configuration from core hadoop configuration files [puppet] - 10https://gerrit.wikimedia.org/r/974647 (https://phabricator.wikimedia.org/T341893) [16:34:43] (03PS1) 10Btullis: Update our kerberos scripts to remove oozie customisation [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) [16:34:45] (03PS1) 10Btullis: Remove all remaining references to oozie and clean up [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) [16:34:49] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:35:25] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1102.mgmt.eqiad.wmnet with reboot policy GRACEFUL [16:35:58] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974618 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:36:16] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1102.eqiad.wmnet [16:36:26] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974619 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:36:54] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974620 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:37:33] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974646 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:37:46] (03CR) 10Brouberol: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [16:37:48] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974647 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:38:01] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974648 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:38:13] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/492/con" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [16:38:15] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:38:21] (03CR) 10CI reject: [V: 04-1] Remove all remaining references to oozie and clean up [puppet] - 10https://gerrit.wikimedia.org/r/974649 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:38:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:40:09] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:42:57] (03CR) 10Ssingh: [C: 03+1] "Cathal is merging this today, heads-up" [puppet] - 10https://gerrit.wikimedia.org/r/971490 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [16:43:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:44:24] (03CR) 10Btullis: [C: 03+2] Stop oozie server and remove some resources [puppet] - 10https://gerrit.wikimedia.org/r/974618 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:45:11] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1102.eqiad.wmnet [16:48:03] (03PS2) 10Btullis: Remove the oozie client [puppet] - 10https://gerrit.wikimedia.org/r/974619 (https://phabricator.wikimedia.org/T341893) [16:48:14] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974619 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:49:03] PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [16:52:49] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1102.eqiad.wmnet [16:52:50] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1102.eqiad.wmnet [16:57:11] (03PS1) 10Btullis: Remove the oozie keytab from the hadoop coordinator role [puppet] - 10https://gerrit.wikimedia.org/r/974650 (https://phabricator.wikimedia.org/T341893) [16:58:33] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/493/con" [puppet] - 10https://gerrit.wikimedia.org/r/974650 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [16:58:55] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove the oozie keytab from the hadoop coordinator role [puppet] - 10https://gerrit.wikimedia.org/r/974650 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [17:01:54] 10SRE, 10ops-eqdfw: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10Eevans) [17:04:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service: move mw-jobrunner to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/973825 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:05:10] (03CR) 10Brouberol: "Relying on network/data.yaml data, we only introduce the following subnets in netboot.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [17:05:28] (03PS1) 10Btullis: Stop installing the oozie shared library for spark2 [puppet] - 10https://gerrit.wikimedia.org/r/974651 (https://phabricator.wikimedia.org/T341893) [17:07:15] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974651 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [17:07:31] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:08:54] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [17:10:50] (03CR) 10Btullis: [C: 03+2] Stop installing the oozie shared library for spark2 [puppet] - 10https://gerrit.wikimedia.org/r/974651 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [17:13:53] (03CR) 10Herron: "Seeing timeouts with the new recording rules:" [puppet] - 10https://gerrit.wikimedia.org/r/974496 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey) [17:16:36] (03CR) 10Hnowlan: [C: 03+2] service: move mw-jobrunner to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/973825 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:16:43] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Aklapper) p:05Medium→03Low [17:18:45] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796) [17:18:53] T349796: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 [17:19:54] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796) [17:20:49] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974619 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [17:22:25] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:23:04] ^ execpted, hnowlan [17:23:23] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796) [17:23:23] (03CR) 10Btullis: [C: 03+2] Remove the oozie client [puppet] - 10https://gerrit.wikimedia.org/r/974619 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [17:23:55] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 82 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [17:24:15] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 79 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [17:25:47] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:26:05] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T349796) [17:26:19] T349796: Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 [17:27:26] !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T349796) [17:28:07] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 80 connections established with conf2004.codfw.wmnet:4001 (min=80) https://wikitech.wikimedia.org/wiki/PyBal [17:28:07] (03CR) 10Btullis: [C: 03+2] Stop applying oozie profiles in most places [puppet] - 10https://gerrit.wikimedia.org/r/974620 (https://phabricator.wikimedia.org/T341893) (owner: 10Btullis) [17:28:09] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:29:19] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 83 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [17:29:21] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:29:44] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T349796) [17:31:52] (03CR) 10Dzahn: [C: 03+2] php: add templates to support php8.2 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/974286 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [17:36:05] (03CR) 10Dzahn: wmflib: add function to return PHP version based on distro version (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [17:36:31] (03PS3) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [17:37:29] (03PS1) 10Btullis: Send recovery emails to data-engineering-alerts [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) [17:37:45] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974652 (https://phabricator.wikimedia.org/T346438) (owner: 10Btullis) [17:44:59] PROBLEM - Check systemd state on kafka-jumbo1011 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:13] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:50:37] in the middle of an interview at the moment [17:51:10] Emperor: should we do a rolling restart? I haven't looked at the errors [17:51:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10RobH) a:05RobH→03Jhancock.wm [17:51:32] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [17:51:46] jynus: codfw cluster seems to be impacted [17:51:54] per the impact on the CDN [17:52:39] !log bking@wdqs1024 reboot host to hopefully reduce data reload failures T349011 [17:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:58] T349011: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 [17:53:01] 504s and now 502s [17:54:15] waiting a bit to confirm it is not transient (rate is not super high) to run sre.swift.roll-restart-reboot-swift-ms-proxies [17:54:17] jynus: regular behavior prior to a rolling restart [17:54:52] actually, it has been like that for a while, so going with it [17:54:56] 2% of 5xx is quite bad IMHO :) [17:55:13] (ATSBackendErrorsHigh) firing: (3) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:55:45] jynus: are you triggering it? should I? [17:55:52] * vgutierrez wondering where the on call folks are :) [17:55:55] I am doing it [17:56:03] jynus: thx [17:56:10] 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10kzimmerman) Approved, thanks! [17:56:23] !log root@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [17:56:36] vgutierrez: here and looking at dashboards and runbooks [17:56:50] ^ see log [17:56:57] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:57:04] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [17:57:10] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [17:58:39] how is it going, getting any better? [17:59:38] Seems to have turned around? https://grafana-rw.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-6h&to=now-1m&viewPanel=37 [17:59:56] 504s fell off [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T1800) [18:00:12] it is the 502 that worried us [18:00:26] going down as well [18:01:01] !log root@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [18:01:12] !log All restart_daemons were successful [18:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:18] (for codfw) [18:02:45] cwhite: I followed https://wikitech.wikimedia.org/wiki/Service_restarts#Swift [18:03:16] left a screen in cumin2002 in case more debugging is needed [18:03:19] thank you! :) [18:03:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05Resolved→03In progress a:05Jclark-ctr→03bking [18:04:29] cwhite: also, if you could find- there is probably a ticket open about these recurring issue and leave a note [18:04:33] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:04:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) a:05RobH→03None [18:04:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:05:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T348183)', diff saved to https://phabricator.wikimedia.org/P53488 and previous config saved to /var/cache/conftool/dbconfig/20231115-180503-arnaudb.json [18:05:11] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:05:13] (ATSBackendErrorsHigh) resolved: (3) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:05:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Reopening as cloudelastic1008-1010 don't appear to have reimaged properly, and we may need them for T350826 . [18:07:04] jynus: found it https://phabricator.wikimedia.org/T322424 [18:09:17] 10SRE, 10SRE-swift-storage: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 (10colewhite) @jcrespo roll-restarted swift proxies today using `sre.swift.roll-restart-reboot-swift-ms-proxies` cookbook in response to high 502s and 504... [18:09:56] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:12:49] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye [18:13:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:14:34] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::db [18:15:13] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1009.wikimedia.org with OS bullseye [18:16:26] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1010.wikimedia.org with OS bullseye [18:16:54] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::db: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974658 (https://phabricator.wikimedia.org/T349619) [18:17:19] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:17:32] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::db: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974658 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [18:25:08] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::db [18:29:54] (03PS2) 10Cathal Mooney: Remove specific TTL values from server BGP groups [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) [18:31:35] (03CR) 10Cathal Mooney: Remove specific TTL values from server BGP groups (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [18:31:53] (03CR) 10Cathal Mooney: [C: 03+2] Remove specific TTL values from server BGP groups [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [18:32:29] (03Merged) 10jenkins-bot: Remove specific TTL values from server BGP groups [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [18:33:43] (03PS1) 10Btullis: Enable the GeoIP2 plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/974659 (https://phabricator.wikimedia.org/T351242) [18:33:48] (03PS1) 10Dzahn: phabricator: install python3-phabricator if bullseye or newer [puppet] - 10https://gerrit.wikimedia.org/r/974660 (https://phabricator.wikimedia.org/T351333) [18:34:16] (03CR) 10CI reject: [V: 04-1] phabricator: install python3-phabricator if bullseye or newer [puppet] - 10https://gerrit.wikimedia.org/r/974660 (https://phabricator.wikimedia.org/T351333) (owner: 10Dzahn) [18:35:15] (03PS2) 10Dzahn: phabricator: install python3-phabricator if bullseye or newer [puppet] - 10https://gerrit.wikimedia.org/r/974660 (https://phabricator.wikimedia.org/T351333) [18:35:22] (03CR) 10Btullis: [C: 03+2] Enable the GeoIP2 plugin for Matomo [puppet] - 10https://gerrit.wikimedia.org/r/974659 (https://phabricator.wikimedia.org/T351242) (owner: 10Btullis) [18:35:29] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10KFrancis) Hi all, I was finally granted access to see the signature confirmation page. I can confirm https://phabricator.wikimedia.org/p/Xqt/ has signed. We are still rese... [18:36:17] !log remove TTL setting on server-facing BGP peerings on cr3-ulsfo T350488 [18:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:22] T350488: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 [18:36:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:36:57] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) 05Stalled→03Open [18:38:41] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) @Xqt Would you like us to keep your real name out of public repos or you don't mind? [18:41:19] RECOVERY - Check systemd state on kafka-jumbo1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:34] !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host cloudgw2002-dev.codfw.wmnet [18:42:53] !log Reset BGP to lvs4010 from cr3-ulsfo to validate new config T350488 [18:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:57] T350488: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 [18:43:07] (03PS1) 10Majavah: hieradata: move cloudgw2002-dev to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974661 [18:43:41] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) Thanks @KFrancis should we expect Xqt to be on the "NDA and MOU.." Google doc? Can we just use the "known to legal" string as real name regardless if we see it there? [18:43:48] (03PS1) 10Hnowlan: envoy: use ENTRYPOINT instead of CMD [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/974662 (https://phabricator.wikimedia.org/T300033) [18:43:51] (03CR) 10Majavah: [C: 03+2] hieradata: move cloudgw2002-dev to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974661 (owner: 10Majavah) [18:44:48] (03PS1) 10Btullis: Add a missing reference to the GeoIP2 plugin [puppet] - 10https://gerrit.wikimedia.org/r/974663 (https://phabricator.wikimedia.org/T351242) [18:45:38] (03PS1) 10Jbond: cfssl: update to serve ocsp responses directly from the database [puppet] - 10https://gerrit.wikimedia.org/r/974664 (https://phabricator.wikimedia.org/T350688) [18:45:44] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::net [18:47:01] (03CR) 10Btullis: [C: 03+2] Add a missing reference to the GeoIP2 plugin [puppet] - 10https://gerrit.wikimedia.org/r/974663 (https://phabricator.wikimedia.org/T351242) (owner: 10Btullis) [18:47:19] (03PS1) 10Dzahn: aphlict: migrate aphlict2001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974665 (https://phabricator.wikimedia.org/T349619) [18:47:21] (03PS2) 10Jbond: cfssl: update to serve ocsp responses directly from the database [puppet] - 10https://gerrit.wikimedia.org/r/974664 (https://phabricator.wikimedia.org/T350688) [18:48:25] (03PS1) 10Dzahn: aphlict: migrate role to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974666 (https://phabricator.wikimedia.org/T349619) [18:48:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/501/con" [puppet] - 10https://gerrit.wikimedia.org/r/974664 (https://phabricator.wikimedia.org/T350688) (owner: 10Jbond) [18:48:45] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::net: migrate puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974667 (https://phabricator.wikimedia.org/T349619) [18:49:52] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudgw2002-dev.codfw.wmnet [18:49:55] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::net: migrate puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974667 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [18:50:44] (03PS1) 10Dzahn: aphlict: move 'puppet-controlled config' to role-level [puppet] - 10https://gerrit.wikimedia.org/r/974668 [18:51:08] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10KFrancis) I think the 'known to legal' is okay for now. Now that I have access to the 'signed' page, you can always check with me as well. [18:51:22] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::services [18:52:08] (03CR) 10Dzahn: "...or we could change the default value to true and drop it?" [puppet] - 10https://gerrit.wikimedia.org/r/974668 (owner: 10Dzahn) [18:52:36] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::services: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974669 (https://phabricator.wikimedia.org/T349619) [18:52:52] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::services: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974669 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [18:54:10] !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host cloudgw2003-dev.codfw.wmnet [18:54:10] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host aphlict2001.codfw.wmnet [18:54:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::net [18:54:32] (03PS1) 10Majavah: hieradata: move cloudgw2003-dev to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974670 [18:54:34] (03CR) 10Dzahn: [C: 03+2] aphlict: migrate aphlict2001 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974665 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn) [18:54:37] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::virt_ceph [18:55:11] (03CR) 10Majavah: [C: 03+2] hieradata: move cloudgw2003-dev to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974670 (owner: 10Majavah) [18:55:53] (03PS1) 10Jbond: wmcs::openstack::codfw1dev::virt_ceph: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974672 (https://phabricator.wikimedia.org/T349619) [18:56:07] (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::virt_ceph: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974672 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [18:56:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [18:56:49] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1001.eqiad.wmnet with OS bullseye... [18:58:51] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: wmcs::openstack::codfw1dev::virt_ceph [18:59:01] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::virt_ceph [18:59:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host aphlict2001.codfw.wmnet [18:59:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::services [19:00:00] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudgw2003-dev.codfw.wmnet [19:00:05] jeena and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T1900). [19:00:05] jeena and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T1900). [19:01:24] !log taavi@cumin1001 START - Cookbook sre.puppet.migrate-host for host cloudlb2001-dev.codfw.wmnet [19:02:27] (03PS1) 10Majavah: hieradata: move cloudlb2001-dev to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974675 [19:03:01] (03CR) 10Majavah: [C: 03+2] hieradata: move cloudlb2001-dev to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974675 (owner: 10Majavah) [19:05:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [19:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:05:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::virt_ceph [19:06:20] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) @Xqt Can we publish the email address associated with your LDAP/Wikitech account? Is it accurate or would you like to use a different one? [19:07:16] !log aphlict2001 - restart aphlict service after puppet 7 upgrade [19:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:38] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudlb2001-dev.codfw.wmnet [19:09:27] !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: aphlict [19:09:44] (03CR) 10Jbond: wmflib: add function to return PHP version based on distro version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [19:09:46] (03CR) 10Dzahn: [C: 03+2] aphlict: migrate role to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974666 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn) [19:09:49] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974676 (https://phabricator.wikimedia.org/T350081) [19:09:51] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974676 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [19:09:53] (03PS2) 10Dzahn: aphlict: migrate role to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974666 (https://phabricator.wikimedia.org/T349619) [19:10:02] (03CR) 10Cathal Mooney: [C: 03+2] Change Bird multihop command to use default system TTL [puppet] - 10https://gerrit.wikimedia.org/r/971490 (https://phabricator.wikimedia.org/T350488) (owner: 10Cathal Mooney) [19:10:31] !log merging patch to remove TTL restriction on Bird Anycast BGP peerings (T350488) [19:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:35] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974676 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot) [19:10:40] T350488: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 [19:11:53] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) [19:12:53] (03PS2) 10Dzahn: aphlict: move 'puppet-controlled config' to role-level [puppet] - 10https://gerrit.wikimedia.org/r/974668 [19:15:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: aphlict [19:16:11] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:17:45] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.5 refs T350081 [19:17:50] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [19:17:58] topranks: ^ just a flap? [19:18:17] (03PS4) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [19:18:19] (03CR) 10Dzahn: wmflib: add function to return PHP version based on distro version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [19:18:31] sukhe: looking [19:18:57] sukhe: yes would seem so [19:19:35] (03CR) 10Muehlenhoff: wmflib: add function to return PHP version based on distro version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [19:19:43] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:53] doh4001 flapped, I think unrelated as doh1001 didn't flap when it updated the config [19:20:31] ok! [19:23:38] !log jhuneidi@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.5 refs T350081 (duration: 05m 52s) [19:23:42] T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081 [19:25:48] (03PS5) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [19:25:54] (03CR) 10Dzahn: wmflib: add function to return PHP version based on distro version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [19:26:19] (03PS6) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 [19:34:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.wikimedia.org with OS bullseye [19:36:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1009.wikimedia.org with OS bullseye [19:37:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1010.wikimedia.org with OS bullseye [19:38:49] (03CR) 10Jbond: [C: 03+1] "lgtm assuming outputs look good but see inline for possible improvement" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [19:39:11] !log re-enabling puppet on DNS hosts to adjust TTL setting in BIRD (T350488) [19:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:16] T350488: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 [19:58:56] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:13:57] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:14:27] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:17:47] (03CR) 10Klausman: [C: 03+1] profile::thanos: improve istio sli recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974486 (owner: 10Elukey) [20:18:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:22:23] (03PS2) 10Dzahn: WIP: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [20:29:46] (03PS3) 10Dzahn: WIP: planet: Update for rawdog v3 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [20:31:18] (03CR) 10Dzahn: "I amended with the intention to make it compatible with both versions as the previous commit message said. The intention is that it's noop" [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [20:32:21] (03CR) 10Dzahn: "assuming the old version ignores the new values in config templates or at least it doesnt break anything so the .erb files can be changed" [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [20:34:08] (03PS1) 10Jcrespo: [WIP]Prepare for release [software/transferpy] - 10https://gerrit.wikimedia.org/r/974683 [20:37:05] (03CR) 10Jbond: [C: 03+1] "lgtm assuming pcc is good" [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn) [20:46:41] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:48:59] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:59:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) 05In progress→03Resolved Not sure what happened, but the cloudelastic1008-1010 hosts are up after a reim... [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T2100). [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:12:34] (03CR) 10Brouberol: [V: 03+1] "Thanks for the code suggestion jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/974500 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol) [21:16:06] (03PS1) 10Andrew Bogott: Codfw1dev galera: switch to using private IPs for syncing [puppet] - 10https://gerrit.wikimedia.org/r/974688 (https://phabricator.wikimedia.org/T351281) [21:17:59] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T351144 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [21:19:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Jhancock.wm) yes, I can do that. ty! [21:24:14] (03CR) 10Andrew Bogott: [C: 03+2] Codfw1dev galera: switch to using private IPs for syncing [puppet] - 10https://gerrit.wikimedia.org/r/974688 (https://phabricator.wikimedia.org/T351281) (owner: 10Andrew Bogott) [21:35:17] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:36:27] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:36:46] (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:41:46] (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [21:46:41] (03CR) 10Muehlenhoff: WIP: planet: Update for rawdog v3 on bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/964176 (https://phabricator.wikimedia.org/T348392) (owner: 10Legoktm) [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231115T2200) [22:09:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:10:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:10:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:23] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:11:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.866 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:16:55] (03PS1) 10Ryan Kemper: cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/974693 [22:17:19] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:17:23] (03CR) 10CI reject: [V: 04-1] cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/974693 (owner: 10Ryan Kemper) [22:17:44] (03PS2) 10Bking: cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/974693 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [22:18:05] (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/974693 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [22:18:43] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:18:58] (03CR) 10Bking: [C: 03+1] cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/974693 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [22:19:13] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/974693 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [22:20:05] !log T351354 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/974693; running puppet on hosts [22:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:11] T351354: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 [22:23:23] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:25:52] (03PS1) 10Ryan Kemper: cloudelastic: hosts need racking info [puppet] - 10https://gerrit.wikimedia.org/r/974694 [22:27:28] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: hosts need racking info [puppet] - 10https://gerrit.wikimedia.org/r/974694 (owner: 10Ryan Kemper) [22:27:40] (03CR) 10Bking: [C: 03+1] cloudelastic: hosts need racking info [puppet] - 10https://gerrit.wikimedia.org/r/974694 (owner: 10Ryan Kemper) [22:27:46] (03PS2) 10Ryan Kemper: cloudelastic: hosts need racking info [puppet] - 10https://gerrit.wikimedia.org/r/974694 (https://phabricator.wikimedia.org/T351354) [22:28:09] (03CR) 10Ryan Kemper: [V: 03+2] cloudelastic: hosts need racking info [puppet] - 10https://gerrit.wikimedia.org/r/974694 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [22:28:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:28:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:30:31] (03PS2) 10JHathaway: puppetserver: cache code [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) [22:31:49] (03CR) 10JHathaway: puppetserver: cache code (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [22:33:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 50% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:35:27] (03PS3) 10JHathaway: puppetserver: cache code [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) [22:38:42] (SystemdUnitFailed) firing: (2) nginx.service Failed on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:39] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974695 (owner: 10Awight) [22:41:07] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cloudelastic[1007-1010].wikimedia.org with reason: new cloudelastic hosts TT351354 [22:41:22] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudelastic[1007-1010].wikimedia.org with reason: new cloudelastic hosts TT351354 [22:41:28] (03CR) 10Jbond: [C: 03+1] "LGTM, please test puppet-merge after merging. i normally add a cr to the README file and self +2 and merge that to test puppet-merge" [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [22:44:14] (03PS2) 10Awight: Drop config which is the same as the default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974695 [22:45:33] (03CR) 10Jbond: [C: 03+1] puppetserver: cache code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway) [22:48:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 7h 49m 42s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [22:54:24] (03PS1) 10Ryan Kemper: cloudelastic: switch new hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/974696 (https://phabricator.wikimedia.org/T351354) [22:55:39] (03CR) 10Bking: [C: 03+1] cloudelastic: switch new hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/974696 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [22:57:48] !log bking@cumin2002 START - Cookbook sre.puppet.renew-cert for cloudelastic1008.wikimedia.org: Renew puppet certificate - bking@cumin2002 [22:57:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cloudelastic1008.wikimedia.org: Renew puppet certificate - bking@cumin2002 [22:58:49] !log bking@cumin2002 START - Cookbook sre.puppet.renew-cert for cloudelastic1007.wikimedia.org: Renew puppet certificate - bking@cumin2002 [22:59:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cloudelastic1007.wikimedia.org: Renew puppet certificate - bking@cumin2002 [22:59:30] (03PS1) 10Ryan Kemper: nit: s/see's/sees [cookbooks] - 10https://gerrit.wikimedia.org/r/974697 [23:00:34] (03PS2) 10Ryan Kemper: nit: s/see's/sees [cookbooks] - 10https://gerrit.wikimedia.org/r/974697 [23:00:47] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] nit: s/see's/sees [cookbooks] - 10https://gerrit.wikimedia.org/r/974697 (owner: 10Ryan Kemper) [23:01:02] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: switch new hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/974696 (https://phabricator.wikimedia.org/T351354) (owner: 10Ryan Kemper) [23:04:14] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye [23:05:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T348183)', diff saved to https://phabricator.wikimedia.org/P53490 and previous config saved to /var/cache/conftool/dbconfig/20231115-230504-arnaudb.json [23:05:13] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:05:49] (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:09:53] 10SRE, 10ops-eqiad: aqs1012: reseat SSD (/dev/sdh)? - https://phabricator.wikimedia.org/T351320 (10RobH) [23:20:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P53491 and previous config saved to /var/cache/conftool/dbconfig/20231115-232010-arnaudb.json [23:22:31] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:23:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:26:35] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:29:28] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:35:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P53492 and previous config saved to /var/cache/conftool/dbconfig/20231115-233516-arnaudb.json [23:50:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T348183)', diff saved to https://phabricator.wikimedia.org/P53493 and previous config saved to /var/cache/conftool/dbconfig/20231115-235023-arnaudb.json [23:50:25] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [23:50:28] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:50:38] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [23:50:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T348183)', diff saved to https://phabricator.wikimedia.org/P53494 and previous config saved to /var/cache/conftool/dbconfig/20231115-235044-arnaudb.json [23:58:56] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure