[00:00:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:00:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.630 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:02:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.313 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:02:13] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:03:37] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:12:00] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[16,19-21,28,31].eqiad.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [00:12:08] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [00:35:21] (03CR) 10Krinkle: [C: 03+2] noc: Fix highlight.php to not append .txt to dblist URLs (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591459 (https://phabricator.wikimedia.org/T250852) (owner: 10Urbanecm) [00:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957807 [00:38:25] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957807 (owner: 10TrainBranchBot) [00:42:49] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:49] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957807 (owner: 10TrainBranchBot) [00:58:11] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:59:37] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:07:30] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10Andrew) [01:09:24] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [01:09:25] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt2006-dev.codfw.wmnet with OS bookworm [01:09:27] (03PS4) 10Krinkle: Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ) [01:09:28] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt2005-dev.codfw.wmnet with OS bookworm [01:09:41] (03CR) 10Krinkle: [C: 03+2] Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ) [01:10:34] (03Merged) 10jenkins-bot: Remove old origin-with-crossorigin referrer policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ) [01:12:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by krinkle@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927279 (https://phabricator.wikimedia.org/T338183) (owner: 10TheDJ) [01:12:28] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:927279|Remove old origin-with-crossorigin referrer policy (T338183)]] [01:12:31] T338183: Remove old origin-when-crossorigin Safari misspelling of referrer policy - https://phabricator.wikimedia.org/T338183 [01:13:50] !log krinkle@deploy1002 krinkle and hartman: Backport for [[gerrit:927279|Remove old origin-with-crossorigin referrer policy (T338183)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [01:14:27] !log krinkle@deploy1002 krinkle and hartman: Continuing with sync [01:20:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:20:44] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:927279|Remove old origin-with-crossorigin referrer policy (T338183)]] (duration: 08m 16s) [01:20:48] T338183: Remove old origin-when-crossorigin Safari misspelling of referrer policy - https://phabricator.wikimedia.org/T338183 [01:25:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:44:30] !log rolling Cassandra restart, RESTBase/eqiad/row-B — T331713 [01:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:34] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [01:44:42] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[17,22-24,29,32].eqiad.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [02:07:33] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) [02:22:33] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:33] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:36] (03PS2) 10Krinkle: mediawiki: Extend /portals max-age from 24h to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/817409 (https://phabricator.wikimedia.org/T313881) [02:40:08] (03PS4) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 [02:40:15] (03CR) 10Krinkle: Remove legacy wgRC2UDPPrefix overrides for private wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820245 (owner: 10Krinkle) [02:43:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[17,22-24,29,32].eqiad.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [02:43:09] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [02:49:27] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:50:53] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:53:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:56:19] (03PS2) 10Krinkle: eventlogging: Remove unused FeaturePolicyViolation schema [puppet] - 10https://gerrit.wikimedia.org/r/908382 (https://phabricator.wikimedia.org/T209572) [02:58:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:08:08] (03PS3) 10Krinkle: eventlogging: Remove obsolete FeaturePolicyViolation schema [puppet] - 10https://gerrit.wikimedia.org/r/908382 (https://phabricator.wikimedia.org/T209572) [03:16:34] (03PS4) 10Krinkle: eventlogging: Remove obsolete FeaturePolicyViolation schema [puppet] - 10https://gerrit.wikimedia.org/r/908382 (https://phabricator.wikimedia.org/T209572) [03:20:59] (03CR) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [03:46:23] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:47:45] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:53:21] (03PS1) 10Zoranzoki21: Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391) [04:47:15] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:48:39] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:30:01] (03CR) 10Muehlenhoff: Extend the maps restart cookbook to also handle reboots (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [05:30:21] (03PS3) 10Muehlenhoff: Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) [05:36:27] (03CR) 10Muehlenhoff: [C: 03+2] Extend the maps restart cookbook to also handle reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/957696 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [05:40:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/956968 (owner: 10BCornwall) [05:44:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org [05:48:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:48:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:50:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [05:50:13] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [05:50:57] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1592 days) https://wikitech.wikimedia.org/wiki/Logs [05:50:57] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1108 days) https://wikitech.wikimedia.org/wiki/Logs [05:53:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:53:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:53:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org [05:56:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [05:57:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230915T0600) [06:00:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [06:01:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [06:02:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:06:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [06:18:20] (03PS1) 10Filippo Giunchedi: librenms: fix permissions on logs and 'lnms' [puppet] - 10https://gerrit.wikimedia.org/r/957846 (https://phabricator.wikimedia.org/T344136) [06:18:24] (03PS1) 10Filippo Giunchedi: librenms: use timer name in journal [puppet] - 10https://gerrit.wikimedia.org/r/957847 (https://phabricator.wikimedia.org/T344136) [06:18:28] (03PS1) 10Filippo Giunchedi: librenms: refactor ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/957848 (https://phabricator.wikimedia.org/T344136) [06:18:56] (03CR) 10CI reject: [V: 04-1] librenms: refactor ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/957848 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [06:25:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [06:36:54] (03PS1) 10Filippo Giunchedi: prometheus: support setting owner/group in assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957850 (https://phabricator.wikimedia.org/T346335) [06:36:56] (03PS1) 10Filippo Giunchedi: prometheus: snmp-exporter support for assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957851 (https://phabricator.wikimedia.org/T346335) [06:36:58] (03PS1) 10Filippo Giunchedi: prometheus: use assemble-config for snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/957852 (https://phabricator.wikimedia.org/T346335) [06:37:48] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:45:40] (03PS1) 10Elukey: ml-services: raise enwiki articlequality's min replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/957853 (https://phabricator.wikimedia.org/T344058) [06:50:38] rzl / volans / _joe_ thanks for fixing my mess yesterday! I must admit I only checked etcdmirror logs and icinga :/ [06:56:32] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10MoritzMuehlenhoff) Since this is a single VM which can run in either DC, please create in codfw. we currently have way more space there. [06:56:38] jayme: no prob :) I think some follow up is still needed, a patch to be merged in etcd mirror and to make a new release AFAICT [06:56:42] (03CR) 10Elukey: [C: 03+1] kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [06:57:11] volans: yeah - saw that one [06:59:36] (03CR) 10Elukey: [C: 03+2] ml-services: raise enwiki articlequality's min replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/957853 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230915T0700) [07:01:13] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10MoritzMuehlenhoff) >>! In T345809#9168091, @cmooney wrote: > Do we have any way to measure it's impact? I had a quick look at available promethues metrics a... [07:03:29] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:04:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:05:59] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:59] (03PS2) 10JMeybohm: Python3 fixes: return bytes from render_GET, and accept a bytes path [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [07:09:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org [07:09:29] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10JMeybohm) [07:10:01] 10SRE, 10serviceops, 10Patch-For-Review: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10JMeybohm) 05Resolved→03Open SRE was paged due to EtcdReplicationDown. Turns out the etcdmirror webinterface does not work with python3 on bullseye [07:15:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org [07:21:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [07:21:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [07:21:25] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:22:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica2007.wikimedia.org [07:22:17] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:22:51] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:24:21] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica2007.wikimedia.org - jmm@cumin2002" [07:25:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica2007.wikimedia.org - jmm@cumin2002" [07:25:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:25:08] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica2007.wikimedia.org on all recursors [07:25:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica2007.wikimedia.org on all recursors [07:25:36] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica2007.wikimedia.org - jmm@cumin2002" [07:26:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica2007.wikimedia.org - jmm@cumin2002" [07:29:01] (03PS1) 10Slyngshede: P:idm tweaks to settings for deb install. [puppet] - 10https://gerrit.wikimedia.org/r/957854 [07:33:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43314/console" [puppet] - 10https://gerrit.wikimedia.org/r/957854 (owner: 10Slyngshede) [07:35:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43315/console" [puppet] - 10https://gerrit.wikimedia.org/r/957854 (owner: 10Slyngshede) [07:38:23] (03PS2) 10Slyngshede: P:idm tweaks to settings for deb install. [puppet] - 10https://gerrit.wikimedia.org/r/957854 [07:38:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-replica2007.wikimedia.org with OS bookworm [07:38:58] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica2007.wikimedia.org with OS bookworm [07:39:37] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43316/console" [puppet] - 10https://gerrit.wikimedia.org/r/957854 (owner: 10Slyngshede) [07:40:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:40:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:40:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:42:10] (03PS3) 10Slyngshede: P:idm tweaks to settings for deb install. [puppet] - 10https://gerrit.wikimedia.org/r/957854 [07:43:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43317/console" [puppet] - 10https://gerrit.wikimedia.org/r/957854 (owner: 10Slyngshede) [07:44:55] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43318/console" [puppet] - 10https://gerrit.wikimedia.org/r/957854 (owner: 10Slyngshede) [07:45:23] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm tweaks to settings for deb install. [puppet] - 10https://gerrit.wikimedia.org/r/957854 (owner: 10Slyngshede) [07:45:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 2.718 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-replica2007.wikimedia.org with reason: host reimage [07:54:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:54:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-replica2007.wikimedia.org with reason: host reimage [07:55:06] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:55:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:56:08] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:03:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:05:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) @ayounsi I'm in two minds as to whether it makes sense to make this change for the EVPN switches. In terms of the traffic between s... [08:06:08] (03PS3) 10Cathal Mooney: Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) [08:06:13] (03PS1) 10Giuseppe Lavagetto: trafficserver: move 6.5% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957857 (https://phabricator.wikimedia.org/T346422) [08:06:15] (03PS1) 10Giuseppe Lavagetto: trafficserver: move 8% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957858 (https://phabricator.wikimedia.org/T346422) [08:06:17] (03PS1) 10Giuseppe Lavagetto: trafficserver: move 10% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957859 (https://phabricator.wikimedia.org/T346422) [08:06:39] (03CR) 10CI reject: [V: 04-1] Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney) [08:06:47] (03PS4) 10Cathal Mooney: Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) [08:07:02] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.702 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:08:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-replica2007.wikimedia.org with OS bookworm [08:08:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica2007.wikimedia.org [08:08:06] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica2007.wikimedia.org with OS bookworm completed: - ldap-rep... [08:09:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:10:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-replica2008.wikimedia.org [08:10:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:10:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.057 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:47] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica2008.wikimedia.org - jmm@cumin2002" [08:13:28] (03PS1) 10Brouberol: Configure kafka-jumbo101[0-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957861 [08:14:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:15:10] (03PS1) 10Stevemunene: Remove mention of an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/957862 (https://phabricator.wikimedia.org/T329363) [08:15:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 5.035 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:15:33] (03PS1) 10Ayounsi: Add include for esams sandbox1-by27-esams v4/v6 [dns] - 10https://gerrit.wikimedia.org/r/957863 (https://phabricator.wikimedia.org/T307021) [08:18:41] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM (but more importantly CI :P)" [dns] - 10https://gerrit.wikimedia.org/r/957863 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [08:19:09] (03CR) 10Brouberol: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [08:20:02] (03CR) 10Ayounsi: [C: 03+2] Add include for esams sandbox1-by27-esams v4/v6 [dns] - 10https://gerrit.wikimedia.org/r/957863 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [08:24:36] (03PS1) 10Elukey: services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) [08:26:40] (03PS2) 10Elukey: services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) [08:26:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-replica2008.wikimedia.org - jmm@cumin2002" [08:26:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:26:42] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-replica2008.wikimedia.org on all recursors [08:26:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-replica2008.wikimedia.org on all recursors [08:27:09] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica2008.wikimedia.org - jmm@cumin2002" [08:27:29] (03PS2) 10Ayounsi: Add esams RIPE Atlas to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957330 (https://phabricator.wikimedia.org/T307021) [08:27:37] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957330 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [08:27:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-replica2008.wikimedia.org - jmm@cumin2002" [08:28:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) (owner: 10Elukey) [08:29:16] (03CR) 10Ladsgroup: [C: 03+1] services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) (owner: 10Elukey) [08:29:20] (03PS3) 10Elukey: services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) [08:30:15] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) [08:30:25] (03PS4) 10Elukey: services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) [08:30:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-replica2008.wikimedia.org with OS bookworm [08:30:58] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica2008.wikimedia.org with OS bookworm [08:32:34] (03PS1) 10Muehlenhoff: Move Ganeti to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/957865 [08:32:56] (03CR) 10Ilias Sarantopoulos: [C: 03+1] services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) (owner: 10Elukey) [08:34:09] (03CR) 10Ayounsi: [C: 03+2] Add esams RIPE Atlas to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957330 (https://phabricator.wikimedia.org/T307021) (owner: 10Ayounsi) [08:34:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10ayounsi) Good point! That was done before the VXLAN deployment to have more predictability on the anycast traffic to the end servers. If we... [08:34:39] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) [08:35:04] (03CR) 10Elukey: [C: 03+2] services: increase concurrency in ORESFetchScoreJob's changeprop cfg [deployment-charts] - 10https://gerrit.wikimedia.org/r/957864 (https://phabricator.wikimedia.org/T346175) (owner: 10Elukey) [08:38:13] (03CR) 10Phuedx: [C: 03+1] eventlogging: Remove obsolete FeaturePolicyViolation schema [puppet] - 10https://gerrit.wikimedia.org/r/908382 (https://phabricator.wikimedia.org/T209572) (owner: 10Krinkle) [08:39:06] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:39:14] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [08:39:29] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [08:39:49] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [08:40:34] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [08:40:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) >>! In T339852#9169159, @ayounsi wrote: > If we can't have different behavior for vxlan vs. servers it seems more important to me th... [08:41:17] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10dcaro) I was due to a non-responsive kubernetes worker node, rebooting it to force the pod to get rescheduled seemed to get the service back online, I'm looking a... [08:41:41] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:43:35] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Access port speed <= 100Mbps False positives - https://phabricator.wikimedia.org/T336511 (10ayounsi) 05Open→03Resolved a:03ayounsi I removed the alert as it was being problematic in {T346317} as well. [08:44:01] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10SRE Observability (FY2023/2024-Q1): Alert "access port speed less 100mbit" and librenms upgrade - https://phabricator.wikimedia.org/T346317 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, that's related to {T336511} and I j... [08:46:02] (03CR) 10Ayounsi: [C: 03+1] Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney) [08:46:48] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [08:47:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-replica2008.wikimedia.org with reason: host reimage [08:47:23] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [08:47:34] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) @jbond from Juniper: [08:47:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957865 (owner: 10Muehlenhoff) [08:50:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-replica2008.wikimedia.org with reason: host reimage [08:56:02] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10SLyngshede-WMF) We might need to change the Juniper configuration in CAS from: ` "supportedResponseTypes": [ "java.util.HashSet", [ "code"... [08:56:17] (03PS4) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) [08:57:18] (03PS5) 10Arturo Borrero Gonzalez: wikimediacloud.org: decom ns-recursor0.openstack.eqiad1.wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/957745 (https://phabricator.wikimedia.org/T307357) [08:57:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ldap-replica2008.wikimedia.org with OS bookworm [08:57:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ldap-replica2008.wikimedia.org [08:57:50] 10SRE, 10Infrastructure-Foundations, 10LDAP: Migrate the r/w LDAP servers to Bookworm and MDB storage - https://phabricator.wikimedia.org/T331699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica2008.wikimedia.org with OS bookworm executed with errors:... [09:01:20] (03CR) 10Ladsgroup: dbbackups: Add new check (focused on ES) of long running backups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [09:08:06] (03PS1) 10Marostegui: install_server: Do not reimage db2196 [puppet] - 10https://gerrit.wikimedia.org/r/957887 [09:08:54] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2196 [puppet] - 10https://gerrit.wikimedia.org/r/957887 (owner: 10Marostegui) [09:10:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [09:10:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [09:12:39] (03CR) 10Jelto: "I'm not really sure what's needed to add a new service. I'll loop in traffic folks." [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [09:15:21] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:15:35] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10aborrero) [09:15:37] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:16:56] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10aborrero) I will allocate the records by hand, but the proper fix would be parent task {T346428}. [09:18:31] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43319/console" [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:19:42] (03CR) 10Jcrespo: "Amir: Thanks. The script suggestions are quick and easy to apply (implementation details), and honestly I didn't give much thought to a on" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [09:20:06] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10aborrero) created: * https://netbox.wikimedia.org/ipam/ip-addresses/14680/ cloudvirt2004-dev.private.codfw.wikimedia.cloud 172.20.5.10/24 * https://netb... [09:20:27] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [09:20:44] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43320/console" [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:20:45] 10SRE, 10ops-knams, 10DC-Ops: ripe-atlas-esams down - https://phabricator.wikimedia.org/T303242 (10ayounsi) [09:21:17] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10Brycehughes) @dcaro it works in my case. What a coincidence re the DNS stuff! Thanks all [09:22:18] (03CR) 10Ladsgroup: dbbackups: Add new check (focused on ES) of long running backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [09:22:41] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirts in codfw - aborrero@cumin1001" [09:28:20] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:28:32] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10aborrero) [09:28:46] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10aborrero) 05Open→03Resolved this should be done. Please try again and reopen if required. [09:28:52] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) a:05aborrero→03cmooney [09:29:04] (03CR) 10Marostegui: "Why don't we use the proxies for this?" [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [09:29:07] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10aborrero) a:05Jhancock.wm→03aborrero [09:38:28] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [09:40:38] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [09:45:19] (03CR) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [09:54:49] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:56:13] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:02:45] (Primary outbound port utilisation over 80% #page) firing: (3) Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:02:45] (Primary outbound port utilisation over 80% #page) firing: (3) Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:02:53] here we go [10:02:59] * volans here [10:03:14] what is it? [10:03:29] oh, eqsin [10:04:00] I don't see spikes of real traffic, seems internal marostegui [10:04:22] at first sight [10:04:30] Yeah, I am not seeing anything either on the typical dashboards [10:04:44] I am seeing something [10:04:54] XioNoX, topranks: by any chance did you change anything on eqsin that might explain ^^^ ? [10:04:58] https://w.wiki/7The [10:05:06] * volans checking librenms [10:05:16] but now filtering to see [10:05:35] volans: no nothing I was doing that could affect it I think [10:05:43] PROBLEM - Check systemd state on puppetdb1002 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service,wmf_auto_restart_uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:46] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:05:46] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:05:59] no silent-drop either [10:06:09] https://grafana.wikimedia.org/d/pXnJdJ17k/all-clusters-network-traffic-traffic?orgId=1 network cache upload seems to be quite high [10:06:14] topranks: Any thoughts on what it could be? [10:07:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirts in codfw - aborrero@cumin1001" [10:07:17] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:07:45] (Primary outbound port utilisation over 80% #page) firing: (5) Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:07:45] (Primary outbound port utilisation over 80% #page) firing: (5) Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:09:00] marostegui: masssive jump in traffic but I've no real clue as to why yet [10:09:15] k, thanks we are discussing in the private channel [10:09:35] PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi-puppetdb-microservice.service,wmf_auto_restart_uwsgi-puppetdb-microservice.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:12:45] (Primary outbound port utilisation over 80% #page) firing: (5) Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:12:45] (Primary outbound port utilisation over 80% #page) firing: (5) Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:15:46] (Primary inbound port utilisation over 80% #page) resolved: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:15:46] (Primary inbound port utilisation over 80% #page) resolved: Device cr3-eqsin.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [10:17:45] (Primary outbound port utilisation over 80% #page) firing: (4) Alert for device cr1-esams.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:17:45] (Primary outbound port utilisation over 80% #page) firing: (4) Alert for device cr1-esams.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:18:59] PROBLEM - Check systemd state on cloudgw1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:23] (03PS3) 10JMeybohm: Python3 fixes [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [10:20:06] (03CR) 10CI reject: [V: 04-1] Python3 fixes [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [10:20:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:22:23] (03PS4) 10JMeybohm: Python3 fixes [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [10:27:45] (Primary outbound port utilisation over 80% #page) resolved: (2) Device cr1-esams.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:27:45] (Primary outbound port utilisation over 80% #page) resolved: (2) Device cr1-esams.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [10:32:17] (03PS2) 10Volans: external clouds: allow to get prefixes from RIPE [puppet] - 10https://gerrit.wikimedia.org/r/956955 (https://phabricator.wikimedia.org/T303534) [10:32:19] (03PS1) 10Volans: varnish upload: throttle very large image [puppet] - 10https://gerrit.wikimedia.org/r/957893 [10:32:27] (03PS2) 10Volans: varnish upload: throttle very large image [puppet] - 10https://gerrit.wikimedia.org/r/957893 [10:33:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] varnish upload: throttle very large image [puppet] - 10https://gerrit.wikimedia.org/r/957893 (owner: 10Volans) [10:36:14] (03CR) 10JMeybohm: [C: 03+1] miscweb: add static-codereview to wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [10:37:48] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:38:55] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable set auto-merge [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) [10:39:36] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) (owner: 10Arturo Borrero Gonzalez) [10:49:19] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:45] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:50:56] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: enable set auto-merge [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) [10:51:39] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: enable set auto-merge [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) [10:52:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Python3 fixes [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [10:52:33] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) (owner: 10Arturo Borrero Gonzalez) [10:52:55] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) (owner: 10Arturo Borrero Gonzalez) [10:53:39] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: enable set auto-merge [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) [10:56:10] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1138.eqiad.wmnet with OS bullseye [10:56:30] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) (owner: 10Arturo Borrero Gonzalez) [11:01:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:01:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:04:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.110 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:04:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'll also make a patch to add auto-merge to the puppetised sets" [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) (owner: 10Arturo Borrero Gonzalez) [11:04:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:05:46] (03PS5) 10Giuseppe Lavagetto: Python3 fixes [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [11:06:18] (03PS1) 10Muehlenhoff: Add auto-merge to nftables sets where needed [puppet] - 10https://gerrit.wikimedia.org/r/957901 (https://phabricator.wikimedia.org/T346432) [11:07:07] (03CR) 10CI reject: [V: 04-1] Add auto-merge to nftables sets where needed [puppet] - 10https://gerrit.wikimedia.org/r/957901 (https://phabricator.wikimedia.org/T346432) (owner: 10Muehlenhoff) [11:08:09] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: route using NAT queries to legacy DNS recursors to the new [puppet] - 10https://gerrit.wikimedia.org/r/957902 (https://phabricator.wikimedia.org/T346426) [11:08:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: enable set auto-merge [puppet] - 10https://gerrit.wikimedia.org/r/957895 (https://phabricator.wikimedia.org/T346432) (owner: 10Arturo Borrero Gonzalez) [11:08:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Python3 fixes [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [11:09:09] (03PS2) 10Muehlenhoff: Add auto-merge to nftables sets where needed [puppet] - 10https://gerrit.wikimedia.org/r/957901 (https://phabricator.wikimedia.org/T346432) [11:09:46] (03Merged) 10jenkins-bot: Python3 fixes [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/957784 (https://phabricator.wikimedia.org/T332010) (owner: 10RLazarus) [11:09:47] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: host reimage [11:11:03] RECOVERY - Check systemd state on cloudgw1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:51] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: host reimage [11:14:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:15:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add auto-merge to nftables sets where needed [puppet] - 10https://gerrit.wikimedia.org/r/957901 (https://phabricator.wikimedia.org/T346432) (owner: 10Muehlenhoff) [11:15:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.693 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:17:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957901 (https://phabricator.wikimedia.org/T346432) (owner: 10Muehlenhoff) [11:27:41] (03CR) 10JMeybohm: [C: 04-1] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [11:30:39] (03CR) 10JMeybohm: [C: 04-1] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [11:30:53] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Just best remove it at some stage" [puppet] - 10https://gerrit.wikimedia.org/r/957902 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [11:37:21] (03PS2) 10Jelto: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) [11:37:48] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1138.eqiad.wmnet with OS bullseye [11:38:09] (03CR) 10CI reject: [V: 04-1] miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [11:39:20] (03PS3) 10Jelto: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) [11:40:56] (03CR) 10CI reject: [V: 04-1] miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [11:41:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: route using NAT queries to legacy DNS recursors to the new [puppet] - 10https://gerrit.wikimedia.org/r/957902 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [11:45:00] (03PS1) 10Pikne: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T338162) [11:46:35] (03PS4) 10Jelto: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) [11:47:02] (03CR) 10Cathal Mooney: [C: 03+2] Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney) [11:47:16] (03CR) 10CI reject: [V: 04-1] miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [11:47:19] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't restrict compat DNS NAT to VMs without floating IPs [puppet] - 10https://gerrit.wikimedia.org/r/957909 (https://phabricator.wikimedia.org/T346426) [11:47:35] (03Merged) 10jenkins-bot: Adjust hashing algo for QFX5000 series l3_switches [homer/public] - 10https://gerrit.wikimedia.org/r/957792 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney) [11:49:07] (03PS5) 10Jelto: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) [11:49:14] (03CR) 10Cathal Mooney: [C: 03+1] cloudgw: don't restrict compat DNS NAT to VMs without floating IPs [puppet] - 10https://gerrit.wikimedia.org/r/957909 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [11:50:08] (03CR) 10Btullis: "I think I would be tempted to turn this into a chain of linked patches, rather than a single big-bang change." [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [11:51:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: don't restrict compat DNS NAT to VMs without floating IPs [puppet] - 10https://gerrit.wikimedia.org/r/957909 (https://phabricator.wikimedia.org/T346426) (owner: 10Arturo Borrero Gonzalez) [11:53:38] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:54:04] <_joe_> !log updated etcd-mirror to 0.0.10 everywhere [11:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:50] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10JMeybohm) [11:56:05] 10SRE, 10serviceops: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10JMeybohm) 05Open→03Resolved Updated etcd-mirror package has been rolled out, resolving this again [11:56:16] (03PS6) 10Jelto: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) [11:57:18] (03CR) 10Brouberol: [V: 03+1] Configure kafka-jumbo101[0-5].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [11:57:55] (03CR) 10Muehlenhoff: [C: 03+2] Add auto-merge to nftables sets where needed [puppet] - 10https://gerrit.wikimedia.org/r/957901 (https://phabricator.wikimedia.org/T346432) (owner: 10Muehlenhoff) [11:59:04] (03PS1) 10Urbanecm: Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) [11:59:58] (03CR) 10JMeybohm: [C: 03+1] miscweb: add static-codereview to wikikube (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [12:01:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST issuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:02:42] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) >>! In T346042#9168336, @Brycehughes wrote: > @aborrero most (if not all) of the Toolforge tools are throwing 504's (T346126). Seems rel... [12:03:24] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10aborrero) 05Open→03Resolved a:03aborrero Thanks @dcaro for fixing the cluster! [12:05:39] (03CR) 10Btullis: "I would also look at doing PCC runs that covers more (if not all) kafka-jumbo consumers and producers as well. That would give you some pe" [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [12:06:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST issuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:07:33] (03CR) 10Btullis: Configure kafka-jumbo101[0-5].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [12:11:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST issuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:15:24] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) a:03Jclark-ctr hey @Jclark-ctr we are now ready to do this migration next week. Starting next monday 2023-09-18, when could you handle... [12:17:47] (03CR) 10CI reject: [V: 04-1] Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm) [12:23:03] (03PS1) 10Slyngshede: P:IDM Use package configuration for static files. [puppet] - 10https://gerrit.wikimedia.org/r/957915 [12:27:05] !log changing ECMP hasing algorithm on asw1-b12-drmrs T339852 [12:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:08] T339852: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 [12:27:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:28:02] (03PS2) 10Slyngshede: P:IDM Use package configuration for static files. [puppet] - 10https://gerrit.wikimedia.org/r/957915 [12:29:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43321/console" [puppet] - 10https://gerrit.wikimedia.org/r/957915 (owner: 10Slyngshede) [12:32:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:33:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957915 (owner: 10Slyngshede) [12:38:36] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43322/console" [puppet] - 10https://gerrit.wikimedia.org/r/957915 (owner: 10Slyngshede) [12:41:56] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43323/console" [puppet] - 10https://gerrit.wikimedia.org/r/957915 (owner: 10Slyngshede) [12:44:38] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Use package configuration for static files. [puppet] - 10https://gerrit.wikimedia.org/r/957915 (owner: 10Slyngshede) [12:46:02] (03PS1) 10Brouberol: Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) [12:46:04] (03PS1) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T346425) [12:46:06] (03PS1) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T346425) [12:46:08] (03PS1) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T346425) [12:46:10] (03PS1) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T346425) [12:46:12] (03PS1) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T346425) [12:46:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero) 05Open→03Resolved [12:46:29] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:46:32] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:46:45] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:46:55] (03Abandoned) 10Brouberol: Configure kafka-jumbo101[0-5].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957861 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [12:47:01] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:47:03] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:47:22] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:48:15] btullis: the -1 come from the fact that I've added a `Hosts: xxx.yyy.zzz` line in the commit. Could that line be in gerrit only, and not in the commit message? [12:49:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957744 (https://phabricator.wikimedia.org/T346055) (owner: 10Alexandros Kosiaris) [12:49:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [12:50:19] (annd that line is >100 chars because I'm testing w/ many hosts) [12:50:36] (03Merged) 10jenkins-bot: Add /.well-known/apple-developer-merchantid-domain-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957744 (https://phabricator.wikimedia.org/T346055) (owner: 10Alexandros Kosiaris) [12:50:49] !log changing ECMP hasing algorithm on drmrs, esams and cloud switches T339852 [12:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:54] T339852: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 [12:54:04] (03PS2) 10Brouberol: Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) [12:54:06] (03PS2) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T346425) [12:54:08] (03PS2) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T346425) [12:54:11] (03PS2) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T346425) [12:54:14] 10SRE-OnFire, 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Review alerting around Wikidata Query Service update pipeline - https://phabricator.wikimedia.org/T336574 (10Gehel) [12:54:19] (03PS2) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T346425) [12:54:23] (03PS2) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T346425) [12:54:35] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:54:43] (03CR) 10jenkins-bot: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:54:47] (03CR) 10jenkins-bot: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:54:49] (03CR) 10Btullis: "You can use multiple Hosts lines in the footer and you can do some matching by role or other properties." [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:54:55] (03CR) 10jenkins-bot: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:55:02] (03CR) 10jenkins-bot: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:55:22] (03CR) 10jenkins-bot: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:57:50] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [12:58:02] (03PS3) 10Brouberol: Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) [12:58:04] (03PS3) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T346425) [12:58:06] (03PS3) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T346425) [12:58:08] (03PS3) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T346425) [12:58:10] (03PS3) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T346425) [12:58:12] (03PS3) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T346425) [12:58:28] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:58:42] (03CR) 10jenkins-bot: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:58:44] (03CR) 10jenkins-bot: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:58:51] (03CR) 10jenkins-bot: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:59:12] (03CR) 10jenkins-bot: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:59:32] (03CR) 10jenkins-bot: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [12:59:39] (03CR) 10Brouberol: Configure kafka-jumbo1010.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) (owner: 10Brouberol) [13:01:07] !log akosiaris@deploy1002 Synchronized docroot: (no justification provided) (duration: 08m 20s) [13:03:26] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm-test1001.wikimedia.org with OS bookworm [13:03:36] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm [13:06:04] (03PS4) 10Brouberol: Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T346425) [13:06:06] (03PS4) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T346425) [13:06:08] (03PS4) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T346425) [13:06:10] (03PS4) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T346425) [13:06:12] (03PS4) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T346425) [13:06:14] (03PS4) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T346425) [13:08:59] (03PS2) 10Alexandros Kosiaris: donate: Move into dedicated docroot [puppet] - 10https://gerrit.wikimedia.org/r/957750 (https://phabricator.wikimedia.org/T346055) [13:10:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] donate: Move into dedicated docroot [puppet] - 10https://gerrit.wikimedia.org/r/957750 (https://phabricator.wikimedia.org/T346055) (owner: 10Alexandros Kosiaris) [13:13:10] RECOVERY - Check systemd state on netmon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:33] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10Andrew) Will this step need to be done by hand for all future cloudvirts, or is there a chance of this getting automated? [13:15:16] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/957848 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [13:15:21] (03PS5) 10Brouberol: Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) [13:15:24] (03PS5) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) [13:15:26] (03PS5) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) [13:15:28] (03PS5) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) [13:15:30] (03PS5) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) [13:15:32] (03PS5) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [13:15:51] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team: need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T346410 (10aborrero) >>! In T346410#9169848, @Andrew wrote: > Will this step need to be done by hand for all future cloudvirts, or is there a chance of this gettin... [13:16:58] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage [13:17:13] (03PS1) 10Cathal Mooney: Explicitly set hash mode type on QFX5100 devices for ECMP [homer/public] - 10https://gerrit.wikimedia.org/r/957925 (https://phabricator.wikimedia.org/T339852) [13:19:17] (03PS16) 10Slyngshede: Allow packing as a .deb [software/bitu] - 10https://gerrit.wikimedia.org/r/955160 [13:19:21] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2006-dev.codfw.wmnet with OS bookworm [13:19:23] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm-test1001.wikimedia.org with reason: host reimage [13:19:23] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2005-dev.codfw.wmnet with OS bookworm [13:19:24] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [13:22:14] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [13:25:43] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) Added another patch above as on the QFX5100 you need to explicitly set the "hash mode" to layer2-payload (i.e. IP header), otherwise... [13:26:33] (03PS2) 10Filippo Giunchedi: librenms: refactor ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/957848 (https://phabricator.wikimedia.org/T344136) [13:26:40] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate wikifeeds to mw-api-int - https://phabricator.wikimedia.org/T346447 (10Joe) [13:29:33] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate wikifeeds to mw-api-int - https://phabricator.wikimedia.org/T346447 (10Joe) p:05Triage→03Medium [13:31:21] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [13:32:08] (03CR) 10Brouberol: "check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:33:31] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate all eventgate installations to mw-api-int - https://phabricator.wikimedia.org/T346448 (10Joe) [13:33:49] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate all eventgate installations to mw-api-int - https://phabricator.wikimedia.org/T346448 (10Joe) p:05Triage→03Medium a:05Clement_Goubert→03None [13:34:48] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [13:35:02] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate wikifeeds to mw-api-int - https://phabricator.wikimedia.org/T346447 (10Joe) a:05Clement_Goubert→03None [13:35:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [13:35:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [13:35:41] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [13:38:03] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [13:38:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [13:38:41] (03PS2) 10Filippo Giunchedi: prometheus: support setting owner/group in assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957850 (https://phabricator.wikimedia.org/T346335) [13:38:43] (03PS2) 10Filippo Giunchedi: prometheus: snmp-exporter support for assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957851 (https://phabricator.wikimedia.org/T346335) [13:38:45] (03PS2) 10Filippo Giunchedi: prometheus: use assemble-config for snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/957852 (https://phabricator.wikimedia.org/T346335) [13:39:58] (03CR) 10Filippo Giunchedi: [C: 03+1] remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [13:40:24] 10SRE, 10Wikimedia-Mailing-lists: Expose mailman3 internal REST API inside Wikimedia production network - https://phabricator.wikimedia.org/T279023 (10Aklapper) [13:40:37] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [13:40:41] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: Remove temporary blackhole for wanobjectcache hex-like stats [puppet] - 10https://gerrit.wikimedia.org/r/957797 (https://phabricator.wikimedia.org/T178531) (owner: 10Krinkle) [13:41:18] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm-test1001.wikimedia.org with OS bookworm [13:41:28] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm completed: - idm-t... [13:41:38] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [13:43:39] (03PS12) 10Bking: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [13:44:06] (03CR) 10Filippo Giunchedi: [C: 03+2] Update RL alerts from performance-team-alerts@ to mediawiki-platform-team@ [puppet] - 10https://gerrit.wikimedia.org/r/957664 (https://phabricator.wikimedia.org/T345190) (owner: 10Hokwelum) [13:44:46] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:44:56] 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert this remained stable for a good while. You should be able to retool it now. Thank you for humoring my wait time. [13:45:48] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:46:11] (03CR) 10Filippo Giunchedi: [C: 03+1] dispatch::web: add ensure param and ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [13:46:52] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [13:47:25] !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [13:47:53] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:48:53] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:49:21] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [13:49:36] (03CR) 10Filippo Giunchedi: [C: 03+1] dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [13:51:18] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [13:52:48] (03PS1) 10Giuseppe Lavagetto: eventgate: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957933 (https://phabricator.wikimedia.org/T346448) [13:52:51] (03PS1) 10Giuseppe Lavagetto: eventstreams: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957934 [13:52:53] (03PS1) 10Giuseppe Lavagetto: mw-api-int: increase replicas for movement of wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/957935 (https://phabricator.wikimedia.org/T346447) [13:52:57] (03PS1) 10Giuseppe Lavagetto: wikifeeds: add networkpolicy for egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/957936 (https://phabricator.wikimedia.org/T346447) [13:52:59] (03PS1) 10Giuseppe Lavagetto: wikifeeds: migrate to mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/957937 (https://phabricator.wikimedia.org/T346447) [13:55:27] (03PS1) 10Urbanecm: tests: Do not assume UTSysop exists [extensions/Flow] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957872 (https://phabricator.wikimedia.org/T346253) [13:55:44] (03CR) 10Urbanecm: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm) [13:57:31] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:58:07] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, 10SRE Observability (FY2023/2024-Q1): Alert "access port speed less 100mbit" and librenms upgrade - https://phabricator.wikimedia.org/T346317 (10fgiunchedi) Sweet, thank you @ayounsi ! [13:58:12] (03CR) 10Bking: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [13:59:59] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:05:42] !log repooling mw2444.codfw.wmnet - T345884 [14:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:46] T345884: mw2444 down - https://phabricator.wikimedia.org/T345884 [14:06:08] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw2444.codfw.wmnet [14:06:31] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:31] (03PS1) 10Slyngshede: P:IDM Ensure that logs are created with correct permissions. [puppet] - 10https://gerrit.wikimedia.org/r/957940 [14:06:38] 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Clement_Goubert) Repooled, thank you! [14:08:02] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43324/console" [puppet] - 10https://gerrit.wikimedia.org/r/957940 (owner: 10Slyngshede) [14:09:36] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:41] (03CR) 10JMeybohm: [C: 03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [14:10:24] (03PS1) 10David Caro: openstack: apply cli override cloud.yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/957942 [14:10:54] (03CR) 10CI reject: [V: 04-1] openstack: apply cli override cloud.yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [14:11:16] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [14:12:07] (03CR) 10Herron: [C: 03+1] librenms: fix permissions on logs and 'lnms' [puppet] - 10https://gerrit.wikimedia.org/r/957846 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [14:12:28] (03PS2) 10David Caro: openstack: apply cli override cloud.yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/957942 [14:12:52] (03PS4) 10Hashar: fix-staging-perms: set group name from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) [14:12:54] (03PS5) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) [14:12:56] (03PS6) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [14:13:13] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [14:13:19] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:13:23] (03CR) 10CI reject: [V: 04-1] scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:15:37] (03PS6) 10Hashar: scap3: stop defaulting deployment_group to 'wikidev' [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) [14:15:39] (03PS7) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [14:16:31] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:32] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43325/console" [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [14:18:27] (03CR) 10David Caro: [V: 03+1] "PCC looks good, I'll try to manually apply one patch to test the exec, feel free to review" [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [14:19:36] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:21:16] (03CR) 10Andrew Bogott: [C: 03+1] "I've never used a patch that touches multiple files like this but it should be harmless as long as it doesn't get upset about the test fil" [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [14:21:35] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:23:01] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:23:50] (03CR) 10Hashar: "Rebasing to catchup with hieradata/role/common/gerrit.yaml being moved to hieradata/common/profile/gerrit.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [14:24:00] (03PS4) 10Hashar: gerrit: change scap user to gerrit-deploy [puppet] - 10https://gerrit.wikimedia.org/r/844998 (https://phabricator.wikimedia.org/T317412) [14:24:45] (03CR) 10Herron: [C: 03+1] "LGTM -- FWIW I think changing SyslogIdentifier=<%= @syslog_title %> in modules/systemd/templates/timer_service.erb would be a generalized " [puppet] - 10https://gerrit.wikimedia.org/r/957847 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [14:24:51] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2004-dev [14:25:10] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2004-dev [14:26:34] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2005-dev [14:26:52] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2005-dev [14:27:10] (03CR) 10Bking: [C: 03+2] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [14:27:28] !log aborrero@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt2006-dev [14:27:37] !log aborrero@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt2006-dev [14:28:27] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [14:29:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2004-dev.codfw.wmnet with OS bookworm [14:30:43] (03CR) 10Btullis: [C: 03+1] Remove mention of an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/957862 (https://phabricator.wikimedia.org/T329363) (owner: 10Stevemunene) [14:31:21] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2005-dev.codfw.wmnet with OS bookworm [14:32:26] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:32:36] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:33:07] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2006-dev.codfw.wmnet with OS bookworm [14:33:38] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:33:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:34:03] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:35:20] !log rolling Cassandra restart, RESTBase/eqiad/row-D — T331713 [14:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:25] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [14:35:36] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[18,25-27,33].eqiad.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [14:36:54] (03PS3) 10David Caro: openstack: apply the patch to override cloud.yaml on the cli [puppet] - 10https://gerrit.wikimedia.org/r/957942 [14:37:05] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10cmooney) Thanks for the task @aborrero Yeah the goal here will be to extend the [[... [14:38:32] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:38:36] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:43:16] (03PS4) 10David Caro: openstack: apply the patch to override cloud.yaml on the cli [puppet] - 10https://gerrit.wikimedia.org/r/957942 [14:44:37] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43328/console" [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [14:47:25] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:49] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:55] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:51:27] (03PS1) 10Bking: dse-k8s: Add egress rules for flink-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/957951 (https://phabricator.wikimedia.org/T344614) [14:51:45] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:52:03] (03CR) 10DCausse: [C: 03+1] dse-k8s: Add egress rules for flink-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/957951 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:52:41] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T346450 (10phaultfinder) [14:54:43] (03CR) 10Bking: [C: 03+2] dse-k8s: Add egress rules for flink-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/957951 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:54:47] (03CR) 10Ladsgroup: mariadb: Add grants for testreduce1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [14:56:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10VRiley-WMF) > kubernetes1036 > kubernetes1038 > kubernetes1047 kubernetes1036: Verified that is has correct Serial/Port/CableID. Testing new CableID/Port Old: kubernetes1036... [14:57:34] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:58:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:59:12] (03CR) 10Subramanya Sastry: mariadb: Add grants for testreduce1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [15:00:39] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:02:03] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:03:44] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:03:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:10:02] (03CR) 10David Caro: [V: 03+1] "Verified now that the execs work manually, will merge on monday" [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [15:18:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:24:26] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[18,25-27,33].eqiad.wmnet: Maybe pickup missed topology changes — T331713 - eevans@cumin1001 [15:24:40] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [15:28:42] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T346450 (10phaultfinder) [15:30:07] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:32:21] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old esams ranges and includes - cmooney@cumin1001" [15:33:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old esams ranges and includes - cmooney@cumin1001" [15:33:53] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:39:11] (03PS1) 10Clément Goubert: mw-api-int: Raise replicas to cope with increased rps [deployment-charts] - 10https://gerrit.wikimedia.org/r/957954 [15:39:39] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:40:24] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Raise replicas to cope with increased rps [deployment-charts] - 10https://gerrit.wikimedia.org/r/957954 (owner: 10Clément Goubert) [15:41:24] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old esams ranges and includes - cmooney@cumin1001" [15:41:25] (03Merged) 10jenkins-bot: mw-api-int: Raise replicas to cope with increased rps [deployment-charts] - 10https://gerrit.wikimedia.org/r/957954 (owner: 10Clément Goubert) [15:42:14] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:42:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old esams ranges and includes - cmooney@cumin1001" [15:42:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:42:48] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:43:00] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:43:53] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:44:10] (03PS3) 10Andrew Bogott: add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) [15:46:53] (03PS1) 10DCausse: rdf-streaming-updater: reduce concurrency from the dse-k8s test [deployment-charts] - 10https://gerrit.wikimedia.org/r/957955 [15:47:16] (03PS1) 10Clément Goubert: mw-api-int: Raise replicas to cope with increased rps [deployment-charts] - 10https://gerrit.wikimedia.org/r/957956 [15:48:45] (03PS1) 10Cathal Mooney: Add include for new netbox added file for cr1-eqiad to cr2-esams link [dns] - 10https://gerrit.wikimedia.org/r/957957 (https://phabricator.wikimedia.org/T346421) [15:48:54] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Raise replicas to cope with increased rps [deployment-charts] - 10https://gerrit.wikimedia.org/r/957956 (owner: 10Clément Goubert) [15:49:05] (03CR) 10Bking: [C: 03+1] rdf-streaming-updater: reduce concurrency from the dse-k8s test [deployment-charts] - 10https://gerrit.wikimedia.org/r/957955 (owner: 10DCausse) [15:49:23] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: reduce concurrency from the dse-k8s test [deployment-charts] - 10https://gerrit.wikimedia.org/r/957955 (owner: 10DCausse) [15:49:39] (03Merged) 10jenkins-bot: mw-api-int: Raise replicas to cope with increased rps [deployment-charts] - 10https://gerrit.wikimedia.org/r/957956 (owner: 10Clément Goubert) [15:50:10] (03Merged) 10jenkins-bot: rdf-streaming-updater: reduce concurrency from the dse-k8s test [deployment-charts] - 10https://gerrit.wikimedia.org/r/957955 (owner: 10DCausse) [15:50:20] !log raising mw-api-int replicas to 12+2 to cope with wdqs backfill [15:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:25] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:50:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:50:48] (03CR) 10Cathal Mooney: [C: 03+2] Add include for new netbox added file for cr1-eqiad to cr2-esams link [dns] - 10https://gerrit.wikimedia.org/r/957957 (https://phabricator.wikimedia.org/T346421) (owner: 10Cathal Mooney) [15:50:59] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:51:34] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:51:34] (03CR) 10CI reject: [V: 04-1] add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [15:51:40] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [15:53:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:54:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Renumber esams-eqiad GTT link - https://phabricator.wikimedia.org/T346421 (10cmooney) 05Open→03Resolved a:03cmooney [15:57:13] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/957847 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [15:59:13] (03PS4) 10FNegri: add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [16:06:43] (03CR) 10CI reject: [V: 04-1] add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [16:08:47] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/957848 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [16:09:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [16:12:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:12:32] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2002-dev.codfw.wmnet with OS bookworm [16:12:33] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt2003-dev.codfw.wmnet with OS bookworm [16:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:22:31] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:28:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:32:11] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [16:32:19] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [16:33:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:35:14] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2002-dev.codfw.wmnet with reason: host reimage [16:36:34] (CirrusSearchJobQueueLagTooHigh) firing: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 9h 3m 28s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [16:37:31] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2003-dev.codfw.wmnet with reason: host reimage [16:42:38] (03PS2) 10Jdlrobson: Fix white background for Wikibooks wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957908 (https://phabricator.wikimedia.org/T341251) (owner: 10Pikne) [16:44:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:44:36] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:40] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:46:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:47:54] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:49:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:51:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:51:34] (CirrusSearchJobQueueLagTooHigh) resolved: CirrusSearch job cirrusSearchLinksUpdate lag is too high: 6h 26m 35s - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueLagTooHigh [16:53:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:56:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:56:44] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:58:00] (03CR) 10Herron: [V: 03+1] dispatch::web: add ensure param and ensure => absent (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [16:58:04] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:58:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:06:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:08:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2003-dev.codfw.wmnet with OS bookworm [17:11:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:11:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2002-dev.codfw.wmnet with OS bookworm [17:23:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [17:24:05] !ack 4048 [17:24:05] 4048 (ACKED) HaproxyUnavailable cache_text global sre () [17:24:25] Looking at it. ^ [17:25:24] Availability for text and upload seems to be recovering. [17:27:25] (03Abandoned) 10Bking: Elastic: Use OS major version for GC flags [puppet] - 10https://gerrit.wikimedia.org/r/791050 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [17:28:33] (03Abandoned) 10Bking: query_service: Allow query hosts to rsync data from clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [17:28:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [17:33:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:34:01] !incidents [17:34:01] 4048 (RESOLVED) HaproxyUnavailable cache_text global sre () [17:34:01] 4046 (RESOLVED) [3x] Primary outbound port utilisation over 80% (paged) global noc () [17:34:02] 4047 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr3-eqsin.wikimedia.org) [17:38:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:40:38] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:00] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:43:58] !log initiate Cassandra bootstrap, restbase1030-a — T331713 [17:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:01] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [17:47:45] (03PS6) 10Herron: dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) [17:50:06] (03CR) 10Herron: dispatch: remove puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [17:51:20] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:51:28] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:52:46] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:52:54] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:56:01] !log stopping Cassandra bootstrap, restbase1030-a — T331713 [17:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:05] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [17:58:53] (03PS1) 10Bking: rdf-streaming-updater: start adding per-env ZK path root [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) [18:12:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:17:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:21:31] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:35:21] (03PS9) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [18:37:38] (03CR) 10CI reject: [V: 04-1] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [18:38:14] (03PS3) 10Dduvall: gitlab: Fix Gemfile.local permissions and use absolute path in gitlab.rb [puppet] - 10https://gerrit.wikimedia.org/r/957803 (https://phabricator.wikimedia.org/T337570) [18:39:14] (03PS10) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [18:42:27] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [18:42:30] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:45:18] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:48:30] (03PS11) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [18:48:46] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:10] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:01] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [18:57:40] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:59:06] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:59:27] (03CR) 10Jcrespo: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [19:23:16] (03CR) 10Milimetric: [C: 03+2] Fix typo in Jade content type name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [19:23:59] (03Merged) 10jenkins-bot: Fix typo in Jade content type name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [19:37:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Milimetric) approved! [19:38:07] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:43:07] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:54:02] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:55:26] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:56:52] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:57:24] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:58:16] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:58:48] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:17:50] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:20:40] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:34:03] (03CR) 10Hashar: "I am cleaning my Gerrit dashboard, feel free to add me back as a reviewer when you resume work on this change :)" [puppet] - 10https://gerrit.wikimedia.org/r/929681 (https://phabricator.wikimedia.org/T182641) (owner: 10Jbond) [20:41:26] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:42:50] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:16] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:40] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:06] 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10Aklapper) I assume this task is superseded by the project tag #bitu and can be closed as invalid? [20:58:39] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [20:58:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye [20:58:47] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye [20:58:48] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye [20:58:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [20:58:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye [20:58:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1038.eqiad.wmnet with OS bullseye [20:59:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [20:59:05] !log removing 6 files for legal compliance [20:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:24] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [20:59:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye [20:59:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye [20:59:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [21:03:10] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye [21:03:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye [21:03:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1047.eqiad.wmnet with OS bullseye [21:03:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [21:23:02] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/957846 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [21:30:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl) Thanks for the feedback. @Aklapper I connected my Wikimedia Developer (LDAP) account with the Phabricator account and subsequently removed the personal MediaWiki a... [21:37:52] (03PS1) 10Ejegg: Allow FundraiseUp scripts in Donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) [21:38:30] (03CR) 10CI reject: [V: 04-1] Allow FundraiseUp scripts in Donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [21:39:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl) **Update:** "AHoelzl-WMF" is not disabled, I can log in here: https://office.wikimedia.org/wiki/User:AHoelzl-WMF https://phabricator.wikimedia.org/settings/user/Ah... [21:40:45] (03PS2) 10Ejegg: Allow FundraiseUp scripts in Donatewiki CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) [21:41:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:46:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:50:16] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:55:44] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:56:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:21:31] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:30:48] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:32:12] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:49:50] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:51:16] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:46:48] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:48:12] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase