[00:20:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:21:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.305 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954373 [00:38:52] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954373 (owner: 10TrainBranchBot) [00:54:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/954373 (owner: 10TrainBranchBot) [01:04:23] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10AntiCompositeNumber) [01:50:22] RECOVERY - snapshot of s6 in eqiad on backupmon1001 is OK: Last snapshot for s6 at eqiad (db1225) taken on 2023-09-04 01:06:36 (505 GiB, +0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:08:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:49:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:51:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:56:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 8.918 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:00:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:01:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:05:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.336 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:10:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:12:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:12:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.220 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:13:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:17:42] (03PS2) 10KartikMistry: Update MinT to 2023-08-31-061147-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/954005 (https://phabricator.wikimedia.org/T336683) [05:19:15] (03PS2) 10KartikMistry: Enable Section and Content Translation in 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953756 (https://phabricator.wikimedia.org/T343211) [05:48:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:49:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:54:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 5.935 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:58:48] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:59:50] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:08:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:09:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:11:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.908 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:12:10] !log push new pfw policies - T345288 [06:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:17:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:19:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:20:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.015 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:20:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 4.359 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:25:58] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:26:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:26:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:27:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 3.332 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:27:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:28:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.359 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:33:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw- https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:43:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:43:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:45:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:48:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:49:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.366 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:52:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:53:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:54:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.316 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:55:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:56:28] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:56:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 3.912 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:06] Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T0700). [07:00:06] Aca, kart_, and aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] * Aca waves [07:00:19] * kart_ is here [07:00:50] * o/ [07:00:56] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:01:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:01:28] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:01:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:01:29] Huh, that seems like a good incentive to break a wiki. /s [07:02:41] Please ping me when Aca's patches are deployed. [07:02:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:03:42] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:04:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 2.054 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:27] (03CR) 10Deni: [C: 03+1] "Approved by sh.wiki community." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954240 (https://phabricator.wikimedia.org/T345513) (owner: 10Acamicamacaraca) [07:09:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:13:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:14:05] (03CR) 10Ayounsi: [C: 03+2] infra_devices: remove parents for multihomed devices [puppet] - 10https://gerrit.wikimedia.org/r/954278 (https://phabricator.wikimedia.org/T329272) (owner: 10Ayounsi) [07:16:54] Anyone deploying config patches? Amir1 taavi urbanecm ? [07:17:14] umm seems like they are not here :O [07:17:17] Good question indeed. [07:18:17] :/ [07:19:02] It's Monday, rightfully so :') [07:19:07] I can deploy but I have limited time today. If no one around, Please reschedule your patches for the next backport/config window. [07:19:19] Aca: Agree :) [07:21:18] Okiee. I'll be around, so if anyone else wants to deploy, please ping me. Otherwise, I'll reschedule my patches. [07:22:49] !log failover ganeti masters in drmrs to ganeti6001/ganeti6002 [07:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:46] Let me deploy atleast my patch :) [07:26:26] PROBLEM - ganeti-wconfd running on ganeti6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:26:44] PROBLEM - ganeti-wconfd running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:28:44] _kart: Is it a regular occurrence that deployers are not showing up? [07:29:05] imdeni: not really. [07:29:49] (I have issue with login to deployment server), so I also have to withdraw my patch. [07:29:58] Hmm. I just find it odd that three people have signed up and no-one is here. [07:31:58] kart_: Do you know who is responsible for putting together the schedule? [07:32:56] logmsgbot [07:33:01] err [07:34:21] imdeni: Ah. Today is 'No deploy' day (US holiday) [07:34:52] kart_: Oh, labor day. Where do you see this? [07:35:35] !log restart tcpircbot-logmsgbot on alert1001 [07:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:19] I see it. https://wikitech.wikimedia.org/wiki/Deployments/Yearly_calendar [07:37:13] Damn, didn't know that [07:37:33] rescheduling [07:39:13] This should really be fixed, I think between us we are in 3-4 different timezones. [07:39:54] Tagging @thcipriani [07:40:35] It would be great if you could take a look at changing the bot to account for this when you get a chance. [07:43:02] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete OS conditionals [puppet] - 10https://gerrit.wikimedia.org/r/954279 (owner: 10Muehlenhoff) [07:45:27] (03CR) 10Muehlenhoff: [C: 03+2] prometheus mysqld_exporters: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/954220 (owner: 10Muehlenhoff) [07:46:54] (03CR) 10Muehlenhoff: [C: 03+2] Simplify IPMI check [puppet] - 10https://gerrit.wikimedia.org/r/954257 (owner: 10Muehlenhoff) [07:47:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:47:43] (03PS1) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) [07:47:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:49:20] (03PS2) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) [07:50:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:50:51] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 6.850 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:51:02] (03PS3) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) [07:51:54] (03CR) 10JMeybohm: [C: 03+1] etcd: Remove obsolete file [puppet] - 10https://gerrit.wikimedia.org/r/954276 (owner: 10Muehlenhoff) [07:52:14] (03PS1) 10Muehlenhoff: mariadb::packages_client: Remove obsolete OS check [puppet] - 10https://gerrit.wikimedia.org/r/954594 [07:53:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954594 (owner: 10Muehlenhoff) [07:54:31] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:54:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:55:00] (03Abandoned) 10JMeybohm: confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341669) (owner: 10JMeybohm) [07:55:41] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:56:01] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: re-add space to irc messages [puppet] - 10https://gerrit.wikimedia.org/r/954355 (owner: 10Majavah) [07:56:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 3.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:56:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.509 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:23] (03CR) 10Muehlenhoff: [C: 03+2] etcd: Remove obsolete file [puppet] - 10https://gerrit.wikimedia.org/r/954276 (owner: 10Muehlenhoff) [07:57:34] (03PS2) 10Muehlenhoff: etcd: Remove obsolete file [puppet] - 10https://gerrit.wikimedia.org/r/954276 [07:59:08] (03PS2) 10Muehlenhoff: lxc: Remove obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/954274 [07:59:08] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2004.codfw.wmnet [07:59:45] (03CR) 10JMeybohm: [C: 03+1] "Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [08:00:17] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2004.codfw.wmnet [08:00:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:00:26] (03CR) 10Muehlenhoff: [C: 03+2] lxc: Remove obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/954274 (owner: 10Muehlenhoff) [08:00:57] !log restart kubelet on ml-serve1002 to check if stale prometheus metrics are the cause of the stop_container alert [08:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:12] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2001.codfw.wmnet [08:03:33] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet [08:04:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.423 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:08:40] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast4004.wikimedia.org [08:09:12] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [08:10:06] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [08:10:24] (03PS1) 10Muehlenhoff: bastion: Update canary in Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/954595 [08:10:42] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet [08:11:20] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1002.eqiad.wmnet [08:13:02] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:13:23] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode1001.eqiad.wmnet [08:14:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [08:14:25] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [08:14:34] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [08:14:48] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [08:14:54] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter2003.codfw.wmnet [08:15:25] (03CR) 10Muehlenhoff: [C: 03+2] bastion: Update canary in Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/954595 (owner: 10Muehlenhoff) [08:15:43] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster2001.codfw.wmnet [08:15:43] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:15:54] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster2002.codfw.wmnet [08:17:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast4004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:17:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:17:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast4004.wikimedia.org [08:17:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode1001.eqiad.wmnet [08:17:49] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast4004.wikimedia.org` - bast4004.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [08:18:09] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host dragonfly-supernode2001.codfw.wmnet [08:18:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:18:28] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5003.wikimedia.org [08:18:44] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [08:19:05] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [08:19:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:22:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:23:08] PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:33] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubestagemaster1002.eqiad.wmnet [08:23:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dragonfly-supernode2001.codfw.wmnet [08:24:28] PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:25:11] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [08:27:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:28:22] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kubestagemaster2002.codfw.wmnet [08:30:46] RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:56] RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:31:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5003.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:31:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:31:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5003.wikimedia.org [08:31:18] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast5003.wikimedia.org` - bast5003.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [08:31:32] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [08:31:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:33:40] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry1003.eqiad.wmnet [08:33:44] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-eqiad [08:34:50] !log rename "ens5" to "ens13" on orespoolcounter2003's /etc/network/interfaces after a VM reboot [08:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:55] lovely --^ [08:35:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: mysql: expose tcp port to cloud-private networks only [puppet] - 10https://gerrit.wikimedia.org/r/954317 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [08:36:36] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast6002.wikimedia.org [08:37:01] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [08:38:05] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1003.eqiad.wmnet [08:38:29] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry1004.eqiad.wmnet [08:39:09] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-eqiad [08:41:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:41:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter2003.codfw.wmnet [08:41:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:42:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) We are already working at service level with this box. We should coordinate reimage/reboots etc. [08:43:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry1004.eqiad.wmnet [08:44:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast6002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:45:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:45:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast6002.wikimedia.org [08:45:21] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast6002.wikimedia.org` - bast6002.wikimedia.org (**PASS**) - Downtimed host on Icinga/Alertmanager - F... [08:45:56] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [08:46:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:46:49] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2003.codfw.wmnet [08:46:57] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1004.eqiad.wmnet [08:49:56] 10SRE-swift-storage, 10Commons: File not found on commons - https://phabricator.wikimedia.org/T345522 (10Aklapper) [Unrelated to MediaWiki software code but about file storage on Wikimedia server and thumbnails] [08:50:16] (03CR) 10Ladsgroup: "Generally looks fine, I think we need to keep ServiceOps in the loop." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey) [08:51:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2003.codfw.wmnet [08:51:46] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host registry2004.codfw.wmnet [08:53:59] (03PS1) 10Muehlenhoff: Remove bast4004/bast5003/bast6002 [puppet] - 10https://gerrit.wikimedia.org/r/954597 (https://phabricator.wikimedia.org/T343515) [08:54:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10MoritzMuehlenhoff) [08:56:48] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host registry2004.codfw.wmnet [08:56:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast4004/bast5003/bast6002 [puppet] - 10https://gerrit.wikimedia.org/r/954597 (https://phabricator.wikimedia.org/T343515) (owner: 10Muehlenhoff) [08:57:25] !log rename "ens5" to "ens13" on orespoolcounter1004's /etc/network/interfaces after a VM reboot [08:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1004.eqiad.wmnet [09:04:30] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1003.eqiad.wmnet [09:07:11] (03CR) 10Elukey: Add new OAuth Rate Limiter tier for Wiki Education (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey) [09:09:11] !log rename "ens5" to "ens13" on orespoolcounter1003's /etc/network/interfaces after a VM reboot [09:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1003.eqiad.wmnet [09:13:48] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [09:14:41] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1129.eqiad.wmnet with OS bullseye [09:17:13] (03PS1) 10JMeybohm: AQS2: Lower replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954600 [09:19:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot) [09:21:03] (03CR) 10Hnowlan: [C: 03+2] AQS2: Lower replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954600 (owner: 10JMeybohm) [09:22:04] (03Merged) 10jenkins-bot: AQS2: Lower replicas in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954600 (owner: 10JMeybohm) [09:25:07] (03CR) 10Alexandros Kosiaris: "Removing +2, digging a bit more into the history chain and currently deployed version, this is apparently already done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot) [09:27:29] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [09:27:45] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [09:27:55] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage [09:28:34] !log deploying mathoid to bump service mesh envoy version to 1.23.10-2-s2. No changes to the app. [09:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:07] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [09:29:07] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage [09:29:50] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [09:30:04] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [09:30:20] 10ops-codfw, 10Data-Platform-SRE: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 (10Vgutierrez) [09:32:50] mmhh datahub-mae-consumer on kubestage is spamming logstash hard, can we do sth about it ? [09:33:04] https://logstash.wikimedia.org/goto/ec41fa68ba494813fa14be88c941b807 [09:33:18] akosiaris jayme ^ ? [09:33:35] if that's new then it's probably related to me rebooting the cluster nodes [09:33:46] else cc btullis [09:33:55] I'd say so too, seems to have started with the reboots [09:34:04] around 8:38 [09:34:04] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [09:34:26] Looking now. Also stevemunene will probably want to know. [09:34:29] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [09:34:31] wow...really quite noisy [09:34:44] yeah, maxing out logstash :( [09:35:00] We can kill the pod. It's not ingesting anything. [09:35:09] should probably be fixed in some way still [09:35:54] if killing it/some other datahub component leads to this, that is problematic [09:36:18] Agreed. I was only speaking about short term management. [09:36:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I am removing my -1 and switch to +2 per comment. I also remove physikerwelt's -1 to let CI procced." [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot) [09:36:54] +1 to kill the pod as a mitigation, also +1 to decode the stack trace to see what's wrong [09:37:31] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot) [09:37:52] jayme: what's the recommended way to stop datahub-mae-consumer in this case ? [09:37:55] ack. I've killed the pod [09:38:15] godog: no idea. I did "kubectl -n datahub delete po datahub-mae-consumer-main-ff6cb7484-pjsqd" [09:38:25] !log ladsgroup@deploy1002 ladsgroup: Add CP secret synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [09:39:04] jayme: ack, I'll check if the new pod still spams, I'd imagine it does though [09:39:16] Confused: I was running `btullis@deploy1002:~$ kubectl logs -f datahub-mae-consumer-main-ff6cb7484-pjsqd datahub-mae-consumer-main` and I didn't see a lot of logs. The latest I saw was at 09:25. Will check logstash. [09:39:43] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [09:40:22] btullis: see also the link I posted above in case you missed [09:40:28] "you missed it" even [09:40:34] Thanks. Just found and clicked. [09:40:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merging to not break the relation chain (and it's a bit easier than the manual rebase the dependent commit requires)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [09:40:47] (03CR) 10CI reject: [V: 04-1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [09:41:11] btullis: would you take care of this? That would be nice [09:41:36] (03PS3) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [09:41:43] jayme: Yes I will. [09:41:45] reboots are done, to there should be no out band po deletions [09:41:48] cool, thanks! [09:42:04] thank you jayme btullis ! appreciate it [09:42:08] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [09:42:50] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [09:42:54] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1129.eqiad.wmnet with OS bullseye [09:43:01] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [09:43:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:43:46] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [09:44:15] (03PS3) 10Alexandros Kosiaris: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot) [09:44:30] (03CR) 10Hashar: [C: 04-1] taskgen: update for tox 4 syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954297 (https://phabricator.wikimedia.org/T345152) (owner: 10Majavah) [09:44:45] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [09:45:24] !log ladsgroup@deploy1002 Synchronized private/PrivateSettings.php: Add CP secret (duration: 15m 47s) [09:45:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot) [09:45:54] then /go jayme [09:46:24] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot) [09:47:33] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001" [09:47:45] !log disable-puppet fleet wide "deploy confd change gerrit:954007" [09:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:02] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [09:48:18] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [09:48:25] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1006 - aborrero@cumin1001" [09:48:25] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:48:34] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [09:48:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:49:00] !log T345290. Update mathoid to 2023-05-13-192519-production [09:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:02] T345290: Deploy a more recent version of Mathoid to production than 2023-02-21 - https://phabricator.wikimedia.org/T345290 [09:49:07] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [09:49:20] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [09:49:22] (03CR) 10Jbond: [C: 03+2] confd: -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/954007 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [09:50:00] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [09:51:26] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: drop -next suffix from ns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/954605 (https://phabricator.wikimedia.org/T342621) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1000) [10:01:38] 10SRE-swift-storage, 10Commons: File not found on commons - https://phabricator.wikimedia.org/T345522 (10MatthewVernon) @Shizhao the file looks OK to me, what's the problem with this image, please? [10:03:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: drop -next suffix from ns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/954605 (https://phabricator.wikimedia.org/T342621) (owner: 10Arturo Borrero Gonzalez) [10:04:42] (03PS1) 10Elukey: ml-services: tune autoscaling for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954608 (https://phabricator.wikimedia.org/T344058) [10:05:31] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) @ppenloglou please let us know if you need help submitting a new SSH key for the production environment. Otherwise we will close this task [10:05:43] 10Puppet, 10SRE: run-puppet-agent --quiet fails - https://phabricator.wikimedia.org/T345548 (10Volans) p:05Triage→03High [10:06:20] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) p:05Triage→03Medium [10:09:30] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) Dear @Ladsgroup, Thanks for letting me know about the misuse of my ssh key. Could you guide in the right direction for the following? Currently, I would like to be a able to:... [10:11:47] (03PS1) 10Volans: run-puppet-agent: fails with --quiet [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548) [10:12:02] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10BTullis) 05Open→03Resolved Apologies again for the delay @OSefu-WMF - As mentioned, I'll carry on investigating the missin... [10:12:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Volans) >>! In T342534#9137952, @Papaul wrote: > @jbond @Volans on 2027 - 2029 > puppet is failing with > ` > ----- OUTPUT of 'run-puppet-agent --quiet' -----... [10:13:01] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) @ppenloglou that's right. as stated in https://wikitech.wikimedia.org/wiki/People.wikimedia.org people.wm.o is part of the production environment and the SSH key can't be shar... [10:13:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6003.drmrs.wmnet [10:13:28] 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --quiet fails - https://phabricator.wikimedia.org/T345548 (10Volans) [10:14:55] (03PS1) 10Jbond: confd: Only notify the current instance [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669) [10:16:51] (03PS2) 10Majavah: openstack: Remove a bunch of Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294) [10:18:03] (03CR) 10Jbond: [C: 03+1] "lgtm Doh!" [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548) (owner: 10Volans) [10:18:52] (03CR) 10Volans: [C: 03+2] run-puppet-agent: fails with --quiet [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548) (owner: 10Volans) [10:19:08] (03CR) 10AikoChou: [C: 03+1] ml-services: tune autoscaling for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954608 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [10:19:11] (03CR) 10Majavah: [C: 03+2] openstack: Remove a bunch of Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/954102 (https://phabricator.wikimedia.org/T345294) (owner: 10Majavah) [10:19:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954609 (https://phabricator.wikimedia.org/T345548) (owner: 10Volans) [10:20:00] (03CR) 10Elukey: [C: 03+2] ml-services: tune autoscaling for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/954608 (https://phabricator.wikimedia.org/T344058) (owner: 10Elukey) [10:20:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-admins for cjming - https://phabricator.wikimedia.org/T345455 (10Vgutierrez) 05Open→03Stalled p:05Triage→03Medium a:03Vgutierrez Waiting for OOB validation [10:20:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6 NOOP 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43128/console" [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [10:20:58] (03CR) 10JMeybohm: [C: 03+1] confd: Only notify the current instance [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [10:21:14] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::kubeadm: remove version defaults [puppet] - 10https://gerrit.wikimedia.org/r/953577 (owner: 10Majavah) [10:21:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] confd: Only notify the current instance [puppet] - 10https://gerrit.wikimedia.org/r/954610 (https://phabricator.wikimedia.org/T341669) (owner: 10Jbond) [10:22:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [10:24:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet [10:25:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6003.drmrs.wmnet [10:29:05] !log enable-puppet fleet wide post "deploy confd change gerrit:954007" [10:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:18] (03PS1) 10Muehlenhoff: Fix use of more than one src/dst sets [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) [10:31:13] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) Thank you @Vgutierrez for your reply, now it makes sense. I've created a new SSH key locally saved as "id_ed25519_wmprod.pub" so I can tell them apart. And it is: **ssh-ed25... [10:31:40] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [10:33:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:11] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTBase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) [10:34:21] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:34:24] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTBase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) 05Open→03Resolved [10:39:37] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10MatthewVernon) It's a little difficult to see what might have happened here, since you've overwritten the "original" path. The on... [10:40:43] (03Abandoned) 10Majavah: hieradata: drop ldap-labtest acme-chier cert [puppet] - 10https://gerrit.wikimedia.org/r/885026 (owner: 10Majavah) [10:45:13] (03Abandoned) 10Hnowlan: Add discovery records for device-analytics [dns] - 10https://gerrit.wikimedia.org/r/917306 (https://phabricator.wikimedia.org/T335505) (owner: 10Hnowlan) [10:45:42] (03PS1) 10JMeybohm: site.pp: Split wikikube workers per DC [puppet] - 10https://gerrit.wikimedia.org/r/954615 (https://phabricator.wikimedia.org/T342534) [10:46:39] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954593 (https://phabricator.wikimedia.org/T345394) (owner: 10Elukey) [10:49:07] (03CR) 10Hnowlan: [C: 03+1] site.pp: Split wikikube workers per DC [puppet] - 10https://gerrit.wikimedia.org/r/954615 (https://phabricator.wikimedia.org/T342534) (owner: 10JMeybohm) [10:51:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti6004.drmrs.wmnet [10:52:01] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10Aklapper) @lojo_wmde: Could you please answer the last comment? Thanks in advance! [10:59:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [11:01:59] (03CR) 10Jbond: Fix use of more than one src/dst sets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:02:52] (03CR) 10JMeybohm: [C: 03+2] site.pp: Split wikikube workers per DC [puppet] - 10https://gerrit.wikimedia.org/r/954615 (https://phabricator.wikimedia.org/T342534) (owner: 10JMeybohm) [11:03:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10JMeybohm) >>! In T342534#9137952, @Papaul wrote: > [...] > so 2025 and 2026 nodes had 2 roles, insetup and kubernetes::worker roles That was overlook... [11:05:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [11:05:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti6004.drmrs.wmnet [11:08:05] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: " - jbond@cumin1001 - T342534" [11:08:08] T342534: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 [11:08:13] (03PS1) 10Muehlenhoff: Remove LDAP access for vhargyono [puppet] - 10https://gerrit.wikimedia.org/r/954618 [11:08:56] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: " - jbond@cumin1001 - T342534" [11:11:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10jbond) FYi i just checked kubernetes2025 (via install-console), kubernetes2027 and kubernetes2029 and puppet seems to be running well now [11:14:13] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) I've put a very brief summary of using the cookbook on Wikitech here: https://wikitech.wikimedia.org/wiki/ZTP_Ne... [11:15:33] (03PS1) 10Urbanecm: beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) [11:15:54] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetdb7 cross pollination - https://phabricator.wikimedia.org/T338811 (10jbond) [11:15:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructure - https://phabricator.wikimedia.org/T341497 (10jbond) 05Open→03Resolved a:03jbond this has been completed [11:16:13] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for vhargyono [puppet] - 10https://gerrit.wikimedia.org/r/954618 (owner: 10Muehlenhoff) [11:16:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) [11:17:03] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [11:17:28] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) 05In progress→03Resolved a:03jbond This is now in place [11:19:08] (03PS1) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) [11:19:32] (03CR) 10CI reject: [V: 04-1] puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:21:56] (03PS2) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) [11:21:58] (03PS1) 10Jbond: puppetdb-api: switch dev sevices back to puppetdb-api [puppet] - 10https://gerrit.wikimedia.org/r/954647 (https://phabricator.wikimedia.org/T342214) [11:22:22] (03CR) 10CI reject: [V: 04-1] puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:23:07] (03PS3) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) [11:24:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43130/console" [puppet] - 10https://gerrit.wikimedia.org/r/954647 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:24:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43129/console" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:26:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:27:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:27:52] (03PS4) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) [11:29:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43131/console" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:30:38] (03PS2) 10Muehlenhoff: Fix use of more than one src/dst sets [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) [11:31:52] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:35:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1002.eqiad.wmnet with OS bullseye [11:36:55] 10Puppet, 10SRE: run-puppet-agent --quiet fails - https://phabricator.wikimedia.org/T345548 (10Volans) 05Open→03Resolved Change has been merged and by now deployed everywhere. Resolving. [11:37:26] (03CR) 10Muehlenhoff: [C: 03+1] "Awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:37:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:38:07] (03PS1) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811) [11:38:35] !log hnowlan@deploy1002 Started deploy [restbase/deploy@26bc1a5]: Add new wikis T343543 T343549 T345171 [11:38:40] T343549: Add suwikisource to RESTBase - https://phabricator.wikimedia.org/T343549 [11:38:41] T343543: Add blkwiktionary to RESTBase - https://phabricator.wikimedia.org/T343543 [11:38:41] T345171: Add tlywiki to RESTBase - https://phabricator.wikimedia.org/T345171 [11:42:26] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:42:26] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [11:42:26] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/media/math/check/tex: Ti https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:44:12] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/ [11:44:12] Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.100:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.100:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:44:32] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:44:32] /v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.113:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http:// [11:44:32] .113:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:44:44] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi [11:44:44] /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.125:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10. [11:44:44] 5:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:45:22] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikip [11:45:22] /v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/re [11:45:26] (03PS2) 10Btullis: Update Presto TLS configuration in production [puppet] - 10https://gerrit.wikimedia.org/r/709713 (https://phabricator.wikimedia.org/T273642) [11:46:10] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikip [11:46:10] /v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while down [11:46:10] http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:46:24] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:46:24] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url htt [11:46:24] 4.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:46:36] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage [11:46:58] (03PS1) 10Arturo Borrero Gonzalez: openstack: designate: override to enable cloud-private for designate [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240) [11:47:00] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{ [11:47:00] Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.165:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.165:7231/ [11:47:00] edia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:47:22] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi [11:47:22] /page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbas [11:47:43] looking [11:47:46] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:48:04] (03PS2) 10Arturo Borrero Gonzalez: openstack: designate: override to enable cloud-private for designate [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240) [11:48:32] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:49:18] (03PS1) 10Muehlenhoff: Rebuild Java images to update to latest OpenJDK 11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954655 [11:49:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-presto1002.eqiad.wmnet with reason: host reimage [11:49:50] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [11:49:54] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:49:54] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [11:49:54] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.104:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:50:02] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Could not fetch url http://10.64.16.38:7231/en.wikipedia.org/v1/page/title/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Generic connection error: HTTPConnectionPool(host=10.64.16.38, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/page/title/User%3ABSitzmann_%28WMF [11:50:02] S%2FTest%2FFrankenstein (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fd9f0695c18: Failed to establish a new connection: [Errno 111] Connection refused)): /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) is CRITICAL: Could not fetch url http://10.64.16.38:7231/en.wikipedia.org/v1/page/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein: Generic connection error: HTTPConnectionP [11:50:02] =10.64.16.38, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/page/html/User%3ABSitzmann_%28WMF%29%2FMCS%2FTest%2FFrankenstein (Caused by NewConnectionError(urlli https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:50:03] rolling back [11:50:38] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:50:38] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url htt [11:50:38] 4.16.117:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.117:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:50:42] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wik [11:50:42] rg/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:51:14] !log installing grub2 updates from Bullseye point relese [11:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:17] !log installing grub2 updates from Bullseye point release [11:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:38] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-ht [11:51:38] e} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:52:28] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi [11:52:28] /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:52:28] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:52:28] /v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:52:52] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/media/mat [11:52:52] {type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.71:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.71:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:53:07] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@26bc1a5]: Add new wikis T343543 T343549 T345171 (duration: 14m 32s) [11:53:12] T343549: Add suwikisource to RESTBase - https://phabricator.wikimedia.org/T343549 [11:53:12] T343543: Add blkwiktionary to RESTBase - https://phabricator.wikimedia.org/T343543 [11:53:13] T345171: Add tlywiki to RESTBase - https://phabricator.wikimedia.org/T345171 [11:53:14] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:53:14] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [11:53:14] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.117:7231/en.wikipedia.org/v1/media/math/check/tex: https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:53:18] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi [11:53:18] /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbas [11:54:52] (03PS2) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811) [11:54:54] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [11:54:54] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [11:54:54] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/media/math/check/tex: Ti https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:55:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet [11:56:12] (03PS1) 10Alexandros Kosiaris: Add a Hiera option to enable ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) [11:56:14] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on api canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) [11:56:16] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on appserver canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954658 (https://phabricator.wikimedia.org/T345561) [11:56:18] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) [11:56:20] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/954660 (https://phabricator.wikimedia.org/T345561) [11:56:22] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) [11:56:24] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954662 (https://phabricator.wikimedia.org/T345561) [11:56:26] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:56:26] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/954663 (https://phabricator.wikimedia.org/T345561) [11:56:28] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/954664 (https://phabricator.wikimedia.org/T345561) [11:56:30] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on api hosts [puppet] - 10https://gerrit.wikimedia.org/r/954665 (https://phabricator.wikimedia.org/T345561) [11:56:32] (03PS1) 10Alexandros Kosiaris: Enable icu67 component on appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/954666 (https://phabricator.wikimedia.org/T345561) [11:56:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:57:46] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:59:27] (03CR) 10Muehlenhoff: Add a Hiera option to enable ICU67 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [11:59:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2002.codfw.wmnet [12:00:18] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:00:20] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [12:00:20] wikimedia.org/wiki/Services/Monitoring/restbase [12:00:20] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:00:28] (03CR) 10Alexandros Kosiaris: Add a Hiera option to enable ICU67 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:00:38] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.6 [12:00:38] :7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:01:02] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:01:02] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:01:16] (03PS2) 10Alexandros Kosiaris: Add a Hiera option to enable ICU67 component [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) [12:01:18] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on api canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) [12:01:20] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on appserver canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954658 (https://phabricator.wikimedia.org/T345561) [12:01:22] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) [12:01:24] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/954660 (https://phabricator.wikimedia.org/T345561) [12:01:26] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on dumps hosts [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) [12:01:29] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/954662 (https://phabricator.wikimedia.org/T345561) [12:01:31] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on jobrunner hosts [puppet] - 10https://gerrit.wikimedia.org/r/954663 (https://phabricator.wikimedia.org/T345561) [12:01:33] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/954664 (https://phabricator.wikimedia.org/T345561) [12:01:35] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on api hosts [puppet] - 10https://gerrit.wikimedia.org/r/954665 (https://phabricator.wikimedia.org/T345561) [12:01:37] (03PS2) 10Alexandros Kosiaris: Enable icu67 component on appserver hosts [puppet] - 10https://gerrit.wikimedia.org/r/954666 (https://phabricator.wikimedia.org/T345561) [12:01:44] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:02:02] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:02:28] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:02:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox2002.codfw.wmnet [12:02:47] (03PS3) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811) [12:02:52] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:02:52] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:03:26] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:03:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 27381 [12:04:18] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:04:29] (03PS4) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811) [12:04:56] (03CR) 10Muehlenhoff: Add a Hiera option to enable ICU67 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954656 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:05:17] (03Abandoned) 10Jbond: puppetmasters: switch to HTTPSUrl [puppet] - 10https://gerrit.wikimedia.org/r/954652 (https://phabricator.wikimedia.org/T338811) (owner: 10Jbond) [12:05:46] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:05:46] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:05:50] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:06:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2002.codfw.wmnet [12:06:32] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:07:08] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno [12:07:08] s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:07:28] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:07:40] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:07:46] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:08:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox1002.eqiad.wmnet [12:09:08] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:10:06] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:10:30] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:10:48] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:10:58] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:11:22] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:12:00] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:12:24] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:13:08] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:13:22] (03PS5) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) [12:13:24] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:13:24] (03PS1) 10Jbond: puppetmaster: add parameter to change the port that puppetdb runs [puppet] - 10https://gerrit.wikimedia.org/r/954669 (https://phabricator.wikimedia.org/T342214) [12:13:50] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:14:18] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:14:32] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:14:34] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:14:36] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:14:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43136/console" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:15:12] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:15:12] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:15:12] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:16:08] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:16:14] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:16:14] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:16:40] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:16:46] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status [12:16:46] pecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:17:02] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:17:25] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb-api: switch dev sevices back to puppetdb-api [puppet] - 10https://gerrit.wikimedia.org/r/954647 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:17:38] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:17:40] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:00] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:02] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netbox1002.eqiad.wmnet [12:18:32] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:48] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [12:18:48] wikimedia.org/wiki/Services/Monitoring/restbase [12:18:48] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could n [12:18:48] url http://10.64.48.125:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.125:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:18:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 23): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43137/console" [puppet] - 10https://gerrit.wikimedia.org/r/954669 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:19:04] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:19:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetmaster: add parameter to change the port that puppetdb runs [puppet] - 10https://gerrit.wikimedia.org/r/954669 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:19:52] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:20:12] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:20:35] (03PS6) 10Jbond: puppetmaster: update to use new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) [12:20:46] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:21:54] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [12:21:54] wikimedia.org/wiki/Services/Monitoring/restbase [12:22:14] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:25] (03PS1) 10Hnowlan: hieradata: remove restbase1030 from ratelimit list [puppet] - 10https://gerrit.wikimedia.org/r/954672 [12:22:44] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:22:48] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [12:22:48] wikimedia.org/wiki/Services/Monitoring/restbase [12:22:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: remove restbase1030 from ratelimit list [puppet] - 10https://gerrit.wikimedia.org/r/954672 (owner: 10Hnowlan) [12:23:12] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:15] (03CR) 10Hnowlan: [C: 03+2] hieradata: remove restbase1030 from ratelimit list [puppet] - 10https://gerrit.wikimedia.org/r/954672 (owner: 10Hnowlan) [12:23:22] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status [12:23:22] pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:23:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 27381 [12:23:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 136065 [12:24:04] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:24:04] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:24:10] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:24:28] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:24:30] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:24:40] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:24:44] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:24:44] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:02] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:08] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:30] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:25:30] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:30] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:25:38] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:22] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:24] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 136065 [12:26:28] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:30] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:30] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:34] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:26:34] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:36] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:26:54] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno [12:26:54] s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:27:02] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 [12:27:02] ng: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.121:7231/en.wikipedia.org/v1/media/math/c [12:27:02] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:27:14] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:27:18] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:27:48] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:27:54] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:28:20] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:28:20] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:28:52] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:28:52] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:28:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 138884 [12:29:00] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [12:29:00] wikimedia.org/wiki/Services/Monitoring/restbase [12:29:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954622 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:29:28] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:29:50] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:29:54] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:29:54] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:29:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138884 [12:30:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 149665 [12:30:12] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:30:12] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:30:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 149665 [12:30:22] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:30:26] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:30:26] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:30:50] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:31:10] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:31:13] (03CR) 10Muehlenhoff: [C: 03+2] Fix use of more than one src/dst sets [puppet] - 10https://gerrit.wikimedia.org/r/954612 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:31:16] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:31:50] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:06] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno [12:32:06] s returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:12] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:14] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:18] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:38] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:06] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:30] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:34] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:36] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:58] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:06] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:32] (03PS1) 10Filippo Giunchedi: jaeger: match production opensearch replica/shard settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954675 (https://phabricator.wikimedia.org/T344952) [12:34:38] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve anno [12:34:38] s returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.97:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:38] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:34:38] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:50] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:04] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:24] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:46] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:46] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status [12:35:46] pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:22] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:26] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:54] (03PS1) 10Hnowlan: hieradata: change restbase seeds to reflect downed node [puppet] - 10https://gerrit.wikimedia.org/r/954676 [12:36:58] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retriev [12:36:58] cements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:26] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: change restbase seeds to reflect downed node [puppet] - 10https://gerrit.wikimedia.org/r/954676 (owner: 10Hnowlan) [12:37:48] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:58] (03CR) 10Hnowlan: [C: 03+2] hieradata: change restbase seeds to reflect downed node [puppet] - 10https://gerrit.wikimedia.org/r/954676 (owner: 10Hnowlan) [12:39:00] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:16] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:20] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:40] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:44] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:46] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:39:46] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:46] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:54] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:40:48] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:41:12] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:32] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:38] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:06] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:30] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [12:44:30] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:30] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:56] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:18] 10SRE, 10Infrastructure-Foundations: Cookbook sre.puppet.sync-netbox-hiera sets 'public' var for all IPv6 GUA to true - https://phabricator.wikimedia.org/T345473 (10jbond) @cmooney we did discuss this in the original task (329669#8744920) however there wasn't really a conclusion. the tl;dr is * its not curren... [12:45:20] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.117:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.117:7 [12:45:20] ikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:31] !log staggered restarting restbase service on A:restbase [12:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:12] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi [12:47:12] /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.or [12:47:12] ia/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.31:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloadin https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:26] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [12:47:26] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [12:47:26] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.104:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:26] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [12:47:26] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [12:47:26] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.179:7231/en.wikipedia.org/v1/media/math/check/tex: https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:38] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [12:47:38] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [12:47:38] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.173:7231/en.wikipedia.org/v1/media/math/check/tex: https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:38] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [12:47:38] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [12:47:38] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.0.208:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:51] (03PS1) 10JMeybohm: kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) [12:47:56] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wi [12:47:56] org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while d [12:47:56] ng http://10.64.48.183:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:07] (ProbeDown) firing: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:12] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:15] (03CR) 10CI reject: [V: 04-1] kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [12:48:20] yo [12:48:34] XioNoX: hnowlan and me are already debugging RESTBase [12:48:40] hnowlan: looks like it's related to your work? Everything under control? [12:48:43] it isn't feeling well for >1 h now [12:48:44] akosiaris: cool [12:48:52] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:56] it just paged, probably due to the restarts [12:49:00] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:02] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikip [12:49:02] /v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a res [12:49:02] s received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.38:7231/en.wikipedia.org/v1/media/math/check/tex: T https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:04] yeah that's why I'm here [12:49:11] XioNoX: apologies [12:49:16] no pb at all [12:49:20] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /en.wikipedi [12:49:20] /page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:36] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:36] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:40] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:44] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:52] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.64.16.125:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.64.16.125:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:56] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:56] I acked the page [12:49:58] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:24] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:26] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:26] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:38] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received ht [12:50:38] kitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:44] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:56] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:02] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:14] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:14] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:34] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:36] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:36] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:51:50] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:52] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:56] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:52:06] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:52:10] let me know if you need help [12:52:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-presto1002.eqiad.wmnet with OS bullseye [12:52:27] will do [12:52:45] I acked the above page (cache_text) [12:53:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:07] (ProbeDown) resolved: Service restbase-https:7443 has failed probes (http_restbase-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#restbase-https:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:54] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:56] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:22] (03CR) 10Btullis: [C: 03+2] Add .bash_aliases file for btullis [puppet] - 10https://gerrit.wikimedia.org/r/952475 (owner: 10Btullis) [12:55:52] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) [12:56:02] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:56:19] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:56:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [12:58:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: designate: override to enable cloud-private for designate [puppet] - 10https://gerrit.wikimedia.org/r/954654 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:58:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [12:59:01] (03PS2) 10Arturo Borrero Gonzalez: openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) [13:00:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudcontro1l005: open memcached to cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954679 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [13:04:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop-image-suggestions.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:51] (03PS2) 10JMeybohm: kubernetes::master: Validate SA tokens with the certs of all masters [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) [13:10:18] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43138/console" [puppet] - 10https://gerrit.wikimedia.org/r/954677 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [13:12:01] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:15:18] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:17:52] (03PS1) 10Muehlenhoff: Remove visualdiff client/server from testreduce role [puppet] - 10https://gerrit.wikimedia.org/r/954682 (https://phabricator.wikimedia.org/T345220) [13:18:38] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:22:38] (03PS1) 10Cathal Mooney: Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938) [13:23:38] (03CR) 10CI reject: [V: 04-1] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [13:24:22] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001" [13:25:03] (03PS2) 10Cathal Mooney: Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938) [13:27:01] (03CR) 10Cathal Mooney: [C: 03+2] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW [dns] - 10https://gerrit.wikimedia.org/r/954684 (https://phabricator.wikimedia.org/T327938) (owner: 10Cathal Mooney) [13:27:22] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) Completing what @MatthewVernon correctly says, there are my findings: * File existed and was being in use at least in 2... [13:27:38] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) The file: F37653248 (note it is the original because it's sha1 is 8c6169221e33cb1857f183d46bb4d6d9177240f2 or gebtj7wmiz... [13:30:58] 10SRE-swift-storage, 10Commons, 10Internet-Archive, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) My recommendation is to upload the original attached here as the latest version with a link to this t... [13:31:03] 10SRE-swift-storage, 10Commons, 10Internet-Archive, 10media-backups: Original version of File:2008 scalpelless vasectomy, post-op.JPG has disappeared - https://phabricator.wikimedia.org/T345521 (10jcrespo) p:05Triage→03High [13:41:11] (03CR) 10Btullis: Increase the kafka-jumbo maximum message size to 10 MB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [13:41:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:45:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001" [13:45:46] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:46:01] (03PS2) 10Muehlenhoff: pmacct: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/954287 [13:46:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:47:03] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:47:33] 10SRE, 10Infrastructure-Foundations: Cookbook sre.puppet.sync-netbox-hiera sets 'public' var for all IPv6 GUA to true - https://phabricator.wikimedia.org/T345473 (10cmooney) 05Open→03Resolved a:03cmooney >>! In T345473#9140191, @jbond wrote: > @cmooney we did discuss this in the original task (329669#874... [13:48:06] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:48:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:48:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:49:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1001.wikimedia.org [13:49:46] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001" [13:49:46] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:50:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:50:32] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for codfw cr<-> ssw links. - cmooney@cumin1001" [13:50:32] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:53:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1001.wikimedia.org [13:54:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:54:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:55:25] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:56:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:58:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:58:24] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:58:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:58:58] (03PS1) 10Btullis: Update the maximum message size in kafka for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/954690 (https://phabricator.wikimedia.org/T344688) [14:00:31] (03PS1) 10Elukey: ml-services: set minReplicas to 1 for drafttopic's staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/954691 [14:04:02] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add jaeger collector to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/952151 (https://phabricator.wikimedia.org/T344253) (owner: 10Filippo Giunchedi) [14:04:38] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:07:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [14:07:43] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: make it talk to cloudcontrol via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954692 (https://phabricator.wikimedia.org/T345240) [14:08:04] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954692 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [14:08:58] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:35] (03PS8) 10Majavah: openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 [14:09:38] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:11:38] (03CR) 10Elukey: [C: 03+2] ml-services: set minReplicas to 1 for drafttopic's staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/954691 (owner: 10Elukey) [14:11:56] (03CR) 10Majavah: [C: 03+2] openstack: encapi: drop legacy ports [puppet] - 10https://gerrit.wikimedia.org/r/874814 (owner: 10Majavah) [14:11:59] (03PS11) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [14:14:29] (03PS1) 10Jcrespo: dbbackups: Update backup source for s2, x1 to be MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/954693 [14:16:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: make it talk to cloudcontrol via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/954692 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [14:17:32] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Update backup source for s2, x1 to be MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/954693 (owner: 10Jcrespo) [14:18:28] mine can be merged if there is a conflict [14:18:30] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:18:52] there wasn't [14:18:59] (JobUnavailable) firing: (3) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:25] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10Vgutierrez) the key needs to be uploaded to the puppet repo, you could use this CR as an example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/949839 or I could craft a new one fo... [14:21:05] (03PS1) 10Elukey: ml-services: set min/max replicas for Outlink in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954695 [14:23:35] (03CR) 10AikoChou: [C: 03+1] ml-services: set min/max replicas for Outlink in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954695 (owner: 10Elukey) [14:24:26] (03CR) 10Elukey: [C: 03+2] ml-services: set min/max replicas for Outlink in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/954695 (owner: 10Elukey) [14:27:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:48] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [14:28:27] (03PS1) 10Cathal Mooney: Homer YAML additions for new row A/B switches in Codfw [homer/public] - 10https://gerrit.wikimedia.org/r/954697 (https://phabricator.wikimedia.org/T327938) [14:29:14] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: additional keystone overrides for cloud-private migration [puppet] - 10https://gerrit.wikimedia.org/r/954698 (https://phabricator.wikimedia.org/T345240) [14:29:20] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:29:43] (03CR) 10JMeybohm: [C: 03+1] mesh: new networkpolicy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/954210 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [14:29:47] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) >>! In T345265#9134920, @kamila wrote: > Thank you @Trizek-WMF ! The message looks good. Maybe I'd... [14:29:58] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954698 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [14:30:36] (03CR) 10JMeybohm: [C: 03+1] Rebuild Java images to update to latest OpenJDK 11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954655 (owner: 10Muehlenhoff) [14:31:24] !log bounce prometheus@k8s-aux [14:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: additional keystone overrides for cloud-private migration [puppet] - 10https://gerrit.wikimedia.org/r/954698 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [14:32:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:35] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thank you @Trizek-WMF, sounds good! I will ping you regarding translations :-) [14:32:49] (03PS1) 10Majavah: cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699 [14:33:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah) [14:35:19] (03PS1) 10Alexandros Kosiaris: Add temporary buster-based PHP7.4 icu67 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954700 (https://phabricator.wikimedia.org/T329491) [14:38:29] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:42:40] (03PS1) 10Ayounsi: Add MTU 9000 as valid option for NTT VPLS [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/954702 (https://phabricator.wikimedia.org/T336828) [14:43:28] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [14:43:57] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Rebuild Java images to update to latest OpenJDK 11 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/954655 (owner: 10Muehlenhoff) [14:44:05] (03CR) 10Ayounsi: [C: 03+2] Add MTU 9000 as valid option for NTT VPLS [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/954702 (https://phabricator.wikimedia.org/T336828) (owner: 10Ayounsi) [14:44:38] (03Merged) 10jenkins-bot: Add MTU 9000 as valid option for NTT VPLS [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/954702 (https://phabricator.wikimedia.org/T336828) (owner: 10Ayounsi) [14:44:56] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:45:06] (03PS12) 10Elukey: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [14:45:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:46:09] (03PS1) 10Muehlenhoff: Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/954703 [14:47:03] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:48:07] (03CR) 10Ayounsi: [C: 03+1] pmacct: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/954287 (owner: 10Muehlenhoff) [14:51:29] (03CR) 10Elukey: LiftWing: add latency/availability SLO dashboards (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (https://phabricator.wikimedia.org/T327620) (owner: 10Klausman) [14:53:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:54:07] (03CR) 10Muehlenhoff: wikitech: Disable password resets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954076 (https://phabricator.wikimedia.org/T345226) (owner: 10Majavah) [14:54:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, and matches the production ferm rules (i.e. traffic from prometheus host is allowed regardless of ports)" [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah) [14:57:15] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: cloudlb: disable older designate backends [puppet] - 10https://gerrit.wikimedia.org/r/954704 (https://phabricator.wikimedia.org/T345240) [14:57:44] !log installing json-c security updates [14:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:53] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954704 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [14:59:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: cloudlb: disable older designate backends [puppet] - 10https://gerrit.wikimedia.org/r/954704 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [14:59:27] 10SRE, 10serviceops-radar, 10Patch-For-Review: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10akosiaris) I 've uploaded changes for icu67 php7.4 images for use with a shellbox deployment. I 'll also create a temporary shellbox deployment based on those. [15:03:58] (03PS1) 10Filippo Giunchedi: hieradata: set jaeger components services to production [puppet] - 10https://gerrit.wikimedia.org/r/954705 (https://phabricator.wikimedia.org/T344253) [15:04:17] 10SRE, 10Infrastructure-Foundations, 10netops: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) 05Open→03Resolved a:03ayounsi This is now working in prod. [15:04:23] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [15:07:32] (03CR) 10Filippo Giunchedi: [C: 03+2] mesh: new networkpolicy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/954210 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [15:07:36] (03PS2) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [15:07:43] (03CR) 10Ayounsi: [C: 03+2] cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah) [15:08:00] (03PS3) 10AOkoth: vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) [15:08:17] (03Merged) 10jenkins-bot: cr-labs: Remove port filter on Prometheus term [homer/public] - 10https://gerrit.wikimedia.org/r/954699 (owner: 10Majavah) [15:17:45] (03CR) 10Btullis: [V: 03+2 C: 03+2] Build production-images based on spark 3.3.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952476 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [15:21:02] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/954707 [15:24:36] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: allow cloudlb's haproxy connectivity [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) [15:25:03] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [15:26:10] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954709 (https://phabricator.wikimedia.org/T128546) [15:26:15] (03CR) 10Majavah: [C: 04-1] cloudservices1006: allow cloudlb's haproxy connectivity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [15:26:19] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [15:27:10] (03PS2) 10Arturo Borrero Gonzalez: cloudservices1006: allow cloudlb's haproxy connectivity [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) [15:27:19] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [15:27:30] (03CR) 10Arturo Borrero Gonzalez: cloudservices1006: allow cloudlb's haproxy connectivity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [15:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1530). Please do the needful. [15:30:46] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954709 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:31:27] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954709 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:33:29] (03PS7) 10Filippo Giunchedi: mesh: add tracing support [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) [15:34:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: allow cloudlb's haproxy connectivity [puppet] - 10https://gerrit.wikimedia.org/r/954708 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [15:34:33] (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/954707 (owner: 10Muehlenhoff) [15:35:22] (03CR) 10Filippo Giunchedi: mesh: add tracing support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953576 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [15:36:43] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: override deisngate servers [puppet] - 10https://gerrit.wikimedia.org/r/954712 (https://phabricator.wikimedia.org/T345240) [15:37:24] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/954712 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [15:40:42] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:954709| Bumping portals to master (T128546)]] (duration: 07m 01s) [15:40:45] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:43:24] (03CR) 10Elukey: "Added some nits, most of the code is scaffolded so it should be fine. Do you install sextant to run the create_service.sh script right? If" [deployment-charts] - 10https://gerrit.wikimedia.org/r/948689 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [15:43:58] (03PS1) 10AikoChou: ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058) [15:44:29] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) [15:46:56] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:954709| Bumping portals to master (T128546)]] (duration: 06m 14s) [15:47:00] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:50:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices1006: override deisngate servers [puppet] - 10https://gerrit.wikimedia.org/r/954712 (https://phabricator.wikimedia.org/T345240) (owner: 10Arturo Borrero Gonzalez) [15:55:07] (03CR) 10Elukey: [C: 03+1] ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [15:56:51] (03CR) 10AikoChou: [C: 03+2] ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [15:57:35] (03Merged) 10jenkins-bot: ml-services: tune autoscaling for outlink isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/954715 (https://phabricator.wikimedia.org/T344058) (owner: 10AikoChou) [16:05:53] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:06:46] !log setting port 1/1/5 to speed 100G on cr1-codfw [16:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:28] !log setting port 1/1/5 to speed 100G on cr2-codfw [16:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:48] (03PS1) 10DDesouza: Deploy Campaigns Event Discovery survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954720 (https://phabricator.wikimedia.org/T345158) [16:14:39] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:28:44] (03PS1) 10FNegri: [toolsdb] Enable parallel replication [puppet] - 10https://gerrit.wikimedia.org/r/954722 (https://phabricator.wikimedia.org/T345450) [16:33:46] (03PS1) 10DDesouza: Pre-deploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954724 (https://phabricator.wikimedia.org/T344393) [16:38:19] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [16:41:55] (03PS1) 10Majavah: hieradata: add cloudservices1006 to all designate fw rules [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240) [16:44:05] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43141/console" [puppet] - 10https://gerrit.wikimedia.org/r/954726 (https://phabricator.wikimedia.org/T345240) (owner: 10Majavah) [16:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:58:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1700) [17:00:04] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T1700). [17:02:36] 10SRE, 10MediaWiki-General, 10MediaWiki-libs-Stats, 10observability, and 5 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) [17:04:51] 10SRE, 10SRE-Access-Requests: ppenloglou sharing wmcs and production ssh key - https://phabricator.wikimedia.org/T345132 (10ppenloglou) Could you kindly give me a hand with this @Vgutierrez whenever you have a spare moment? [17:11:41] (03CR) 10Subramanya Sastry: [C: 03+1] Remove visualdiff client/server from testreduce role [puppet] - 10https://gerrit.wikimedia.org/r/954682 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [17:51:20] RECOVERY - snapshot of s2 in eqiad on backupmon1001 is OK: Last snapshot for s2 at eqiad (db1225) taken on 2023-09-04 16:39:05 (1068 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [18:00:46] jouncebot: nowandnext [18:00:46] No deployments scheduled for the next 2 hour(s) and 59 minute(s) [18:00:46] In 2 hour(s) and 59 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T2100) [18:02:50] 10SRE, 10serviceops, 10Datacenter-Switchover: Switchover cookbooks live test - https://phabricator.wikimedia.org/T345588 (10kamila) [18:05:02] (03CR) 10Ladsgroup: [C: 03+1] Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/954703 (owner: 10Muehlenhoff) [18:06:56] (03CR) 10Zabe: "I think this would cause some interwiki prefixes to be removed (like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/952984" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [18:07:38] (03Abandoned) 10Zabe: Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk) [18:14:34] PROBLEM - Host mw2448 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:02] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:30:07] (03PS1) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [18:34:29] (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [18:53:05] (03CR) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [18:59:39] (03CR) 10Winston Sung: SiteMatrix config: Remove deprecated language codes from the list (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/953650 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [19:08:56] RECOVERY - snapshot of x1 in eqiad on backupmon1001 is OK: Last snapshot for x1 at eqiad (db1225) taken on 2023-09-04 18:27:37 (370 GiB, -0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [19:09:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:14:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:59:14] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:59:24] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:59:36] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:00:20] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:00:48] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:05:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:05:54] PROBLEM - grafana.wikimedia.org on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [20:06:32] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 07 Feb 2027 06:17:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:06:42] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 07 Feb 2027 06:17:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:12:47] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 565 bytes in 7.237 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:12:47] RECOVERY - grafana.wikimedia.org on grafana1002 is OK: HTTP OK: HTTP/1.1 200 OK - 128346 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [20:12:47] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:00:05] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230904T2100). [21:11:20] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:42:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:47:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:00:32] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) We've wrapped up testing on this for the moment, and we're fairly happy that it's where we want to go in the future. We're going to hold off until a little later in the FY... [22:23:58] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:58:54] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down